-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Add support for memory mapping models #586
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
@ggerganov the changes to |
|
@rabidcopy are you sure that you are comparing against current master? It does produce the same output on my machine. |
|
BLAS can change the output, we observed better perplexity when using it. |
Yeah that was useful, I had missed the |
There might also be a significant performance increase in using C file streams in place of C++ fstreams, as tested in this PR #216 . That being said, the polyfill / mmap might be worth it in case |
@anzz1 on my computer this reduces load times of the (cached) 65B model to ~15ms, I am not sure that there is much more to gain here |
I see, well that is certainly impossible using fopen(), as loading the 65B model from disk given a fast modern 5GB/s ssd would still take 8 seconds at least. Obviously impossible (today) for access speed of storage devices to beat RAM. This can be worth pursuing over even with the increased complexity and need to maintain polyfills. However, in my opinion the move to C file streams should be done anyway before adding more complexity to the model loading framework since the current STL method of loading files is slooooow as benchmarked in the PR mentioned above. I'm also not certain whether a large change like this should be implemented right now, since the managing of states and models as their own parts is on the current roadmap . Most of the code would be need to be adapted / completely changed to use the new paradigm anyway. I think currently the best way would be to not jump the gun and see how it works when the model state is separated from the model. I reckon it could be done then without #ifdef ' ing any existing model load code, but make the mmap() one a separete function like Then the whole thing could be wrapped neatly inside a #ifdef POSIX block, and the memory management code would stay portable outside of those extra functions. |
@anzz1 I am not sure that I understand your objections, the model state is already kept in a different context in |
The objection is mainly towards adding non-portable code to the all-platform codepaths. And while changes like these: might only be trivial to performance, my opinion is that any platform-specifics should be completely invisible when compiled on non-supported platforms. adding a new I didn't like having to add windows specific code for the console initialization. as can be seen in the and then in the main.cpp example when you haven't defined _win32 , no windows specific code is run at all. I think that is the right approach if non-portable code is required, so that if posix functions are used they should be wrapped in their own #ifdef POSIX wrappers. instead of changing the llama_model_load func, make a new llama_model_load_mmap() func which is wrapped in ifdef I know there are obviously are things to be gained by removing portability requirement since every platform have their own set of tricks which can be used. There are things unique to linux and unique to windows both. But my view on this is that anything non-portable should go into their own example files and not baked into the main libraries. and I'm not against implementing this, obviously using mmap() could be good on the platforms that support it. I do actually have some windows-only ideas on my head too for improved functionality which don't exist in other platforms. But for the case of both, anything like that should not be in the main libs. However, this is only my opinion, not a fact of life. That being said, even the current implementation would be kinda fine by just wrapping all the code in the commit in a |
@anzz1 I see, I understand it now. In this case however, keep in mind that windows support will be added very soon. This is not going to be dead code on windows for long, and the benefits of being able to work with just one copy of the model are huge (as opposed to one copy in the OS cache and another in llama.cpp). It may still be the case that there are some platforms that we want to support that cannot do memory mapped files at all, but the overhead here is so small that even then I think it is preferable to keep it this way than to litter the code with #ifdefs, because that would make the code harder to maintain in the future. I suspect that the optimizer would be able to statically determine that |
That would be platforms like Arduino which don't have MMUs, which is the only reason I can think of a platform wouldn't support memory mappings. This project shocked the world by managing to get LLMs to run on Raspberry Pis and phones. Perhaps one day we'll do microcontrollers, but I don't think we're quite there yet. |
Yeah there is certainly a huge benefit in implementing a feature like this. Faster loading times objectively creates a better user experience, no question about that. You have made extremely valuable contributions to this project and they are well appreciated. I have seen the work you have done so far and liked every bit of it, so this is in no way a objection towards your work. I do also get that sometimes my criticisms may come off as a being unnecessarily harsh objections and I need to be more mindful about my tone. And I am certainly no authority here whatsoever nor try to steer the boat as it's not my place to do that. When it comes to open-source projects, especially such a fast-moving and popular one like this, there is always a big threat when the project moves too fast that regressions will be caused along the way and it eventually leads to a less coherent and more chaotic structure overall. Too many times I have seen great projects fall to a pattern of taking two steps forward but one step back and eventually reaching a point where the project can diverge and lead to the end-users opting to use an older code base. Of course even proprietary closed-source products aren't immune to this, just a simple search for something like "which minecraft version is the best" shows this, but open-source projects are especially suspectible to this. If the answer to "which version of the product is best" isn't "the latest one", something has gone wrong along the way. Or if the answer to a question from a downstream importer of a library "is the latest version still compatible" isn't "yes". It is a delicate balancing act to keep a coherent project but also to not hamper innovation. When that balance is in a right place, new innovation will spur great development where every step is a step forward, and the latest version is also the greatest. Stable but innovative can be a hard thing to achieve, but this project has succeeded very well in that regard so far. The only thing I'm really trying to say is to be mindful and careful about not introducing incompatibility or regressions in general, and especially regarding the more specialized and less-portable stuff a great care should be taken, not an objection towards this feature per-se. |
It definitely was and is a shocking thing, one I didn't think was possible before this project. In the "before llama.cpp" times, LLM's were something to be run on the 100k$ 8xA100 datacenter computers and maybe on enthusiast-grade consumer PC with a high-end graphics card. I mean general wisdom regarding AI in general was that this couldn't and shouldn't be done on a general purpose CPU. Now it is not only running on CPU's, but low power CPU's like Raspberry Pi or mobile phones. It's really an astonishing feat. Developments like this will greatly accelerate the already fast-moving AI space, since many people were excluded by the simple cost of entry, which is no longer the case. I have a strong feeling that this is the very beginning of an AI explosion now when the idea of "Inference on the edge" brings strong LLM capabilities from the datacenters to the homes. I can see a similarity in this to when computers themselves moved from large mainframes in corporate offices to the homes with the advent of PC in the 1980s. The whole idea is being redefined that AI is no longer something that can be used only by large companies with vast resources, and instead it's something you can run on your phone. It really is a game changer. Ideas like this massively important and go on to show also that the 'general wisdom' isn't always correct. After it became the general consensus that running LLM's definitely require a huge amount of batched parallel computing power only found on GPUs, preferably many of them, most people probably just accepted it as a fact and didn't consider the option of researching whether that's actually true. Well, now we can see, that it was not. 😄 |
We're not just taking great care with the introduction of mmap(), we're taking tender loving care. However technical work like that is just one small piece of the puzzle. The softer skill topics you raised are the hardest challenges of them all. So it makes me happy you're thinking about them. We must maintain the stability of this project while elevating its technical superiority, in order to better serve the users in our community who trust and depend upon this project. Even if we ourselves are just volunteers having fun hacking on open source, I still believe the responsibility for the positive impact of our work is something worth focusing on. |
@anzz1 Absolutely, you raise good points. I apologize if my responses made you think otherwise, of course I appreciate your feedback and your other invaluable contributions to the project. I agree that this shouldn't be merged unless we are convinced that this is the way go in the future. It is too easy to accumulate too much technical debt in a project early on that may doom it in the future. Personally I believe that it is, but I may be wrong and that's why it is always important to get as much feedback as possible. We are just getting started with llama.cpp and the possibilities are endless, I think that's why we all are here. Being able to run a LLM not just locally but on the CPU is an amazing idea that just a few weeks ago seemed impossible. We are just getting started! |
Yeah, it's a bit sad we don't have designated initializers and meaningful compiler errors in C++ for such cases. And it sets the default values in a single place in the code base in the respective Adding What do C people normally do to solve this correctly? Regarding the cross-platform stuff: But I think this will always be an easy change to make, so whatever you guys think is better now - we can do. |
@ggerganov I suspect that the answer may be to replace structs with more encapsulation, opaque pointers and accessor functions. We could add a function like
The downside would be that it would be harder for the users of the library to implement support for memory mapped models. I would be concerned that many users wouldn't bother implementing mmap support at all, and more fundamentally I am not sure that it is a good idea to force them to deal with this complexity for a feature that, from what I can tell, has essentially no downsides.
Of course, WASM! I had completely missed this platform. I suppose that's going to be the main case where mmap is not available. |
Just wanted to update that pulling the latest master and rerunning the same seed and prompt as I did before now matches what I was getting. So output changed but not because of this PR. Odd. And these two warnings remain on master now.
Appears to be 436e561 is where these warnings started and output started to be slightly different. Either way, sorry for attributing it to this PR.
|
Successfully compiled master branch and successfully compiled slaren's master branch, and successfully run ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512. If confused how exactly I did compile it, read #103 (comment) |
Great work, @slaren - I just tested this on an iPhone Pro Max 14 (I'm trying to make room for running llama.cpp in an iPhone app) and this actually made it possible to load the app, but it was too slow to make any sense. After doing some profiling I also found that the cpus were used only like around 30% while most of the work was reported as memory ops. (Here is a quick demo of the app running in the Simulator: https://twitter.com/chrfalch/status/1641092675146244097) |
llama.cpp
Outdated
@@ -12,6 +12,13 @@ | |||
#include <cassert> | |||
#include <cstring> | |||
|
|||
// headers for POSIX mmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommended:
#ifdef __has_include
#if __has_include(<sys/mman.h>)
#include <sys/mman.h>
#endif
#if __has_include(<fcntl.h>)
#include <fcntl.h>
#endif
#if __has_include(<unistd.h>)
#include <unistd.h>
#endif
#elif defined (unix) || defined (APPLE)
include <sys/mman.h>
include <fcntl.h>
include <unistd.h>
#endif
Due to:
ggml.c line 101
#ifdef __has_include
#if __has_include(<sys/mman.h>)
#undef GGML_MLOCK_SUPPORT
#define GGML_MLOCK_SUPPORT 1
#include <sys/mman.h>
#endif
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml.c is C, but this is C++ however and currently we are compiling with -std=c++11
, and __has_include
is C++17. Do you know of any platforms that may fail with the current implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is it would not fail with any platforms because #ifdef __has_include will ignore the definition if __has_include macro does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but in practical terms what are the advantages of doing that instead? Do you know of any specific platform that may fail with the current code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantages are if a person created their own implementation of mmap the code will use their implementation instead, and it decreases the amount of work to support another platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey folks! Thanks for testing this PR. I've created a live (non-draft) PR in #613 that's rebased on this PR. This new PR not only solves the loading problem for single file models like 7B. It also supports multi-file models like 13B too. That's because the underlying issue with the file format is now being solved. I'd love to hear feedback! |
@slaren just curious, I was reading through https://github.com/slaren/llama.cpp/commits/mmap (and mmap_file() here slaren@ef9afe1#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR299) and I'm wondering how much was written by you and how much by jart? |
@InconsolableCellist I really don't want to start any drama so I'll just say that I wrote the code in my commits. |
@slaren Thanks. My contact info is in my profile if there's more to say. I was a bit surprised when I drilled down into the mmap history. This was partially a technical concern too, like, I'm wondering if there's a discussion somewhere about how mmap just defers usage of memory rather than reduces it? If you eventually hit every path through the model I think it ends up loaded in RAM anyway. It seems to have savings for frequent loading/unloading, if I understand it correctly. I'm not sure where to start this convo. |
@InconsolableCellist I haven't looked much into it, but to the best of my knowledge llama is not a sparse model in any way and every tensor in the model file is used every time the network is evaluated. I have no idea why people think that this reduces memory usage, but I may be missing something. |
Yeah, I was reading #638 (comment) and the concept of these network being sparse seems to be just a misunderstanding of the profiling, AFAIK. |
@slaren |
@ggerganov In order to read a file into memory using malloc()+read(), the system will in fact perform an anonymous mmap() for the memory area, and read the file using an equivalent mechanism, then will copy from the file's mapping to your anonymous area. That's a huge waste of RAM and load time, unless you need to transform the data on the fly while loading them. But if your file is loaded as-is, you must absolutely not load it, and map it instead. The file will then be accessed exactly as if this memory area had been swapped out: Accesses to the memory area in areas that are not yet in memory will cause a page fault resulting in an I/O from the file to that area, and the application sees the data. When memory becomes scarce, some less used areas will get swapped out again. You can even further optimize this using madvise() to indicate that some areas are read a lot and others less, etc. But in general, keep in mind that mmap() is an essential and critical component of modern operating systems and has been present everywhere since the early 90s. Hoping this helps. |
@wtarreau Thank you - I now have a better understanding of how it works |
You're welcome! I wanted to implement this last week-end and was very pleased to see that someone had already done it :-) |
Significantly reduces model loading times, especially if they are already cached by the OS.
Requires a single-part model to work, either a 7B model or any model converted with New conversion script #545 (use with
--n_parts 1
).Only tested on Linux, needs to be tested with other POSIX-compatible operating systems. Should work on OS X as is.
The tensors may not be aligned, which may cause performance issues or crashes on some microarchitectures. However, testing by @jart showed no issues so far. This needs to be fixed in the conversion script.
Still missing:
llama_free
Co-authored with @jart.
🤖 Generated by Copilot at ef9afe1
Summary
✨⚡🐧
Added a feature to
llama.cpp
andggml.c
that allows loading and evaluating models from memory mapped files. This can improve performance and reduce memory usage for large models. Added a new parameterno_alloc
to theggml_context
andggml_init_params
structs to control this feature. Implemented memory mapping for Unix-like systems and left a placeholder for Windows.Walkthrough
no_alloc
inggml_context
andggml_init_params
to indicate whether the context should allocate memory for the tensor data or not (link, link)no_alloc
field from theparams
argument inggml_init
(link)no_alloc
field inggml_new_tensor_impl
to prevent allocating memory for the tensor data and to assign thedata
field to the memory mapped address if available (link, link)no_alloc
field to false inggml_opt
andllama_eval_internal
to ensure that the optimization and evaluation processes allocate memory for the intermediate tensors as usual (link, link)llama.cpp
(link)mmap_file
tollama.cpp
to try to memory map a given file and return the mapped address or NULL on failure (link)llama_model_load
to determine whether to use memory mapping or not based on the number of model parts and the success ofmmap_file
(link)llama_model_load
(link)no_alloc
field to the value ofuse_mmap
inllama_model_load
(link)llama_model_load
(link)llama_model_load
by setting the tensor data pointer to the offset from the mapped address and advancing the file pointer accordingly (link)no_alloc
field to false inkv_cache_init
to ensure that the key-value cache allocates memory for the tensors as usual (link)