Add support for memory mapping models #586

slaren · 2023-03-29T00:44:59Z

Significantly reduces model loading times, especially if they are already cached by the OS.

Requires a single-part model to work, either a 7B model or any model converted with New conversion script #545 (use with --n_parts 1).
Only tested on Linux, needs to be tested with other POSIX-compatible operating systems. Should work on OS X as is.
The tensors may not be aligned, which may cause performance issues or crashes on some microarchitectures. However, testing by @jart showed no issues so far. This needs to be fixed in the conversion script.

Still missing:

Unmap the file in llama_free
Test with other POSIX operating systems
Windows support
Consider enabling CoW to support replacing some of the tensors

Co-authored with @jart.

`🤖 Generated by Copilot at ef9afe1`

Summary

✨⚡🐧

Added a feature to llama.cpp and ggml.c that allows loading and evaluating models from memory mapped files. This can improve performance and reduce memory usage for large models. Added a new parameter no_alloc to the ggml_context and ggml_init_params structs to control this feature. Implemented memory mapping for Unix-like systems and left a placeholder for Windows.

Sing, O Muse, of the swift and skillful programmers
Who devised a clever way to load the models of GGML
With no allocation of memory, but using the mapped addresses
Of the files, like the Cyclops who forged the thunderbolts of Zeus.

Walkthrough

Add a feature to support memory mapping of model files (link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Introduce a new field no_alloc in ggml_context and ggml_init_params to indicate whether the context should allocate memory for the tensor data or not (link, link)
Initialize the no_alloc field from the params argument in ggml_init (link)
Check the no_alloc field in ggml_new_tensor_impl to prevent allocating memory for the tensor data and to assign the data field to the memory mapped address if available (link, link)
Set the no_alloc field to false in ggml_opt and llama_eval_internal to ensure that the optimization and evaluation processes allocate memory for the intermediate tensors as usual (link, link)
Add some headers for POSIX mmap to llama.cpp (link)
Add a new function mmap_file to llama.cpp to try to memory map a given file and return the mapped address or NULL on failure (link)
Add some logic to llama_model_load to determine whether to use memory mapping or not based on the number of model parts and the success of mmap_file (link)
Skip adding the size of the tensors that can be memory mapped to the context size in llama_model_load (link)
Set the no_alloc field to the value of use_mmap in llama_model_load (link)
Add a message to indicate whether memory mapping is used or not when loading a model part in llama_model_load (link)
Handle the case of memory mapping in llama_model_load by setting the tensor data pointer to the offset from the mapped address and advancing the file pointer accordingly (link)
Set the no_alloc field to false in kv_cache_init to ensure that the key-value cache allocates memory for the tensors as usual (link)

slaren · 2023-03-29T00:50:09Z

@ggerganov the changes to ggml_init_params may cause current code to crash silently if no_alloc is not initialized and happens to be set to true. Not sure how much of a concern that is, but we may want to add a different init function instead if that is a problem.

rabidcopy · 2023-03-29T01:34:41Z

~~Observed that it does change outputs using the same command/seed/prompt. Regardless,~~ very neat and convenient. Performance seems about the same, maybe a bit slower but margin of error. Edit: Scratch that, compiled without openBLAS and compared versus a build with it linked.

slaren · 2023-03-29T01:41:38Z

@rabidcopy are you sure that you are comparing against current master? It does produce the same output on my machine.

rabidcopy · 2023-03-29T01:51:11Z

@rabidcopy are you sure that you are comparing against current master? It does produce the same output on my machine.

Thought it was latest pull + this PR merged, but also had #577 and using -b 32 but that's strange because with #577 alone output was the same as current master. So I don't see why a combination of this PRs would change the output. I'll try this PR alone and see if it's still different.

Edit: Left is master. Right is this PR by itself. Command was ./main -b 6 -t 6 -n -1 -c 2048 -m '[13b alpaca model]' --seed 1 -p "below is a sentence about a dog, the dog fetched" --color --temp 2 --top_k 1.15 --top_p 0.18 --repeat_last_n 500 --repeat_penalty 1.15 --keep -1 --n_parts 1

slaren · 2023-03-29T02:07:49Z

BLAS can change the output, we observed better perplexity when using it.

rabidcopy · 2023-03-29T02:20:57Z

On the other hand, I'm able to load 1-part 30B weights on only 16GB of RAM without it hanging my entire system now. Still not really usable speed-wise though.

BLAS can change the output, we observed better perplexity when using it.

I must of accidentally compiled without openBLAS on one of these runs. Compiled again with openBLAS for both and re-ran the same exact commands on 7B weights, got the same output.

Edit: I didn't notice but the outputs are still different.. fetched a ball versus fetched something. ~~Well I guess this is somehow affecting outputs, if just by a tiny bit.~~ Maybe I'll try without openBLAS and see if outputs are still different.
Edit: Still the same/different outputs without linking openBLAS.
Not sure if its worth mentioning but I do get some warnings while compiling with this PR.

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  unknown
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  
I CC:       cc (GCC) 12.2.1 20230201
I CXX:      g++ (GCC) 12.2.1 20230201

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3   -c ggml.c -o ggml.o
ggml.c: In function ‘ggml_vec_dot_q4_1’:
ggml.c:1965:72: warning: binary constants are a C2X feature or GCC extension
 1965 |         const __m256 cross_scales = _mm256_blend_ps( scale_0, scale_1, 0b10101010 );
      |                                                                        ^~~~~~~~~~
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c llama.cpp -o llama.o
llama.cpp: In function ‘bool llama_model_quantize_internal(const std::string&, const std::string&, int)’:
llama.cpp:1503:24: warning: cast from type ‘const char*’ to type ‘char*’ casts away qualifiers [-Wcast-qual]
 1503 |             finp.read ((char *) word.data(), len);
      |                        ^~~~~~~~~~~~~~~~~~~~
llama.cpp:1504:24: warning: cast from type ‘const char*’ to type ‘char*’ casts away qualifiers [-Wcast-qual]
 1504 |             fout.write((char *) word.data(), len);
      |                        ^~~~~~~~~~~~~~~~~~~~
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/main/main.cpp ggml.o llama.o common.o -o main 

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize 
examples/quantize/quantize.cpp: In function ‘int main(int, char**)’:
examples/quantize/quantize.cpp:22:52: warning: missing initializer for member ‘ggml_init_params::no_alloc’ [-Wmissing-field-initializers]
   22 |         struct ggml_init_params params = { 0, NULL };
      |                                                    ^
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding

slaren · 2023-03-29T03:46:34Z

Not sure if its worth mentioning but I do get some warnings while compiling with this PR.

Yeah that was useful, I had missed the ggml_init_params in quantize.cpp. Thanks! The other warnings are unrelated to this change.

anzz1 · 2023-03-29T03:52:50Z

There might also be a significant performance increase in using C file streams in place of C++ fstreams, as tested in this PR #216 . fopen() being portable, while mmap() is not and requires a huge polyfill or external library to use in windows.

That being said, the polyfill / mmap might be worth it in case mmap() performs significantly better than fopen() / fread() , but we are comparing the performance against a wrong thing now (the current STL fstream implementation)

slaren · 2023-03-29T03:56:17Z

@anzz1 on my computer this reduces load times of the (cached) 65B model to ~15ms, I am not sure that there is much more to gain here

anzz1 · 2023-03-29T04:08:10Z

@anzz1 on my computer this reduces load times of the (cached) 65B model to ~15ms, I am not sure that there is much more to gain here

I see, well that is certainly impossible using fopen(), as loading the 65B model from disk given a fast modern 5GB/s ssd would still take 8 seconds at least. Obviously impossible (today) for access speed of storage devices to beat RAM.

This can be worth pursuing over even with the increased complexity and need to maintain polyfills.

However, in my opinion the move to C file streams should be done anyway before adding more complexity to the model loading framework since the current STL method of loading files is slooooow as benchmarked in the PR mentioned above.

I'm also not certain whether a large change like this should be implemented right now, since the managing of states and models as their own parts is on the current roadmap . Most of the code would be need to be adapted / completely changed to use the new paradigm anyway.

I think currently the best way would be to not jump the gun and see how it works when the model state is separated from the model. I reckon it could be done then without #ifdef ' ing any existing model load code, but make the mmap() one a separete function like llama_load_model_from_mmap and llama_load_state_from_mmap() or something like that.

Then the whole thing could be wrapped neatly inside a #ifdef POSIX block, and the memory management code would stay portable outside of those extra functions.

slaren · 2023-03-29T04:36:30Z

@anzz1 I am not sure that I understand your objections, the model state is already kept in a different context in llama_kv_cache, it is mostly an issue of reworking the API to allow users to keep different states. This change doesn't affect that at all, and in fact the changes to llama_model_load are minimal. There is no need to #ifdef anything other than the implementation of mmap_file.

anzz1 · 2023-03-29T06:02:59Z

@anzz1 I am not sure that I understand your objections, the model state is already kept in a different context in llama_kv_cache, it is mostly an issue of reworking the API to allow users to keep different states. This change doesn't affect that at all, and in fact the changes to llama_model_load are minimal. There is no need to #ifdef anything other than the implementation of mmap_file.

The objection is mainly towards adding non-portable code to the all-platform codepaths.

And while changes like these:

https://github.com/ggerganov/llama.cpp/blob/e6f1c199378e4d693688d49386f6b4e5ca4eb9ca/llama.cpp#L775-L789

might only be trivial to performance, my opinion is that any platform-specifics should be completely invisible when compiled on non-supported platforms. adding a new no_alloc boolean param to the ggml struct doesn't affect performance either. but i also don't think such a variable which isn't used for other platforms than posix, should exist in the code for other platforms than posix.

I didn't like having to add windows specific code for the console initialization.
However, I made sure that the code is invisible to other platforms, as not being there at all.

as can be seen in the win32_console_init() function
#if defined (_WIN32)

and then in the main.cpp example
https://github.com/ggerganov/llama.cpp/blob/e6f1c199378e4d693688d49386f6b4e5ca4eb9ca/examples/main/main.cpp#L51-L53

when you haven't defined _win32 , no windows specific code is run at all. I think that is the right approach if non-portable code is required, so that if posix functions are used they should be wrapped in their own #ifdef POSIX wrappers. instead of changing the llama_model_load func, make a new llama_model_load_mmap() func which is wrapped in ifdef

I know there are obviously are things to be gained by removing portability requirement since every platform have their own set of tricks which can be used. There are things unique to linux and unique to windows both. But my view on this is that anything non-portable should go into their own example files and not baked into the main libraries.

and I'm not against implementing this, obviously using mmap() could be good on the platforms that support it. I do actually have some windows-only ideas on my head too for improved functionality which don't exist in other platforms. But for the case of both, anything like that should not be in the main libs.

However, this is only my opinion, not a fact of life.

That being said, even the current implementation would be kinda fine by just wrapping all the code in the commit in a #ifdef POSIX

slaren · 2023-03-29T06:11:41Z

@anzz1 I see, I understand it now. In this case however, keep in mind that windows support will be added very soon. This is not going to be dead code on windows for long, and the benefits of being able to work with just one copy of the model are huge (as opposed to one copy in the OS cache and another in llama.cpp).

It may still be the case that there are some platforms that we want to support that cannot do memory mapped files at all, but the overhead here is so small that even then I think it is preferable to keep it this way than to litter the code with #ifdefs, because that would make the code harder to maintain in the future. I suspect that the optimizer would be able to statically determine that mm_addr is always going to be NULL and remove it altogether.

jart · 2023-03-29T07:00:37Z

It may still be the case that there are some platforms that we want to support that cannot do memory mapped files at all

That would be platforms like Arduino which don't have MMUs, which is the only reason I can think of a platform wouldn't support memory mappings. This project shocked the world by managing to get LLMs to run on Raspberry Pis and phones. Perhaps one day we'll do microcontrollers, but I don't think we're quite there yet.

anzz1 · 2023-03-29T07:20:07Z

@slaren

Yeah there is certainly a huge benefit in implementing a feature like this. Faster loading times objectively creates a better user experience, no question about that.

You have made extremely valuable contributions to this project and they are well appreciated. I have seen the work you have done so far and liked every bit of it, so this is in no way a objection towards your work.

I do also get that sometimes my criticisms may come off as a being unnecessarily harsh objections and I need to be more mindful about my tone.

And I am certainly no authority here whatsoever nor try to steer the boat as it's not my place to do that.
My opinions are my own and they are nothing more than opinions. I've never blocked any merge PR because of a having differing opinion and never will, the only objective here is to create discussion.

When it comes to open-source projects, especially such a fast-moving and popular one like this, there is always a big threat when the project moves too fast that regressions will be caused along the way and it eventually leads to a less coherent and more chaotic structure overall.

Too many times I have seen great projects fall to a pattern of taking two steps forward but one step back and eventually reaching a point where the project can diverge and lead to the end-users opting to use an older code base. Of course even proprietary closed-source products aren't immune to this, just a simple search for something like "which minecraft version is the best" shows this, but open-source projects are especially suspectible to this.

If the answer to "which version of the product is best" isn't "the latest one", something has gone wrong along the way. Or if the answer to a question from a downstream importer of a library "is the latest version still compatible" isn't "yes".

It is a delicate balancing act to keep a coherent project but also to not hamper innovation. When that balance is in a right place, new innovation will spur great development where every step is a step forward, and the latest version is also the greatest. Stable but innovative can be a hard thing to achieve, but this project has succeeded very well in that regard so far.

The only thing I'm really trying to say is to be mindful and careful about not introducing incompatibility or regressions in general, and especially regarding the more specialized and less-portable stuff a great care should be taken, not an objection towards this feature per-se.

anzz1 · 2023-03-29T07:33:20Z

That would be platforms like Arduino which don't have MMUs, which is the only reason I can think of a platform wouldn't support memory mappings. This project shocked the world by managing to get LLMs to run on Raspberry Pis and phones. Perhaps one day we'll do microcontrollers, but I don't think we're quite there yet.

It definitely was and is a shocking thing, one I didn't think was possible before this project. In the "before llama.cpp" times, LLM's were something to be run on the 100k$ 8xA100 datacenter computers and maybe on enthusiast-grade consumer PC with a high-end graphics card. I mean general wisdom regarding AI in general was that this couldn't and shouldn't be done on a general purpose CPU.

Now it is not only running on CPU's, but low power CPU's like Raspberry Pi or mobile phones. It's really an astonishing feat. Developments like this will greatly accelerate the already fast-moving AI space, since many people were excluded by the simple cost of entry, which is no longer the case.

I have a strong feeling that this is the very beginning of an AI explosion now when the idea of "Inference on the edge" brings strong LLM capabilities from the datacenters to the homes. I can see a similarity in this to when computers themselves moved from large mainframes in corporate offices to the homes with the advent of PC in the 1980s.

The whole idea is being redefined that AI is no longer something that can be used only by large companies with vast resources, and instead it's something you can run on your phone. It really is a game changer.

Ideas like this massively important and go on to show also that the 'general wisdom' isn't always correct. After it became the general consensus that running LLM's definitely require a huge amount of batched parallel computing power only found on GPUs, preferably many of them, most people probably just accepted it as a fact and didn't consider the option of researching whether that's actually true. Well, now we can see, that it was not. 😄

jart · 2023-03-29T07:42:51Z

We're not just taking great care with the introduction of mmap(), we're taking tender loving care. However technical work like that is just one small piece of the puzzle. The softer skill topics you raised are the hardest challenges of them all. So it makes me happy you're thinking about them. We must maintain the stability of this project while elevating its technical superiority, in order to better serve the users in our community who trust and depend upon this project. Even if we ourselves are just volunteers having fun hacking on open source, I still believe the responsibility for the positive impact of our work is something worth focusing on.

slaren · 2023-03-29T07:57:10Z

@anzz1 Absolutely, you raise good points. I apologize if my responses made you think otherwise, of course I appreciate your feedback and your other invaluable contributions to the project.

I agree that this shouldn't be merged unless we are convinced that this is the way go in the future. It is too easy to accumulate too much technical debt in a project early on that may doom it in the future. Personally I believe that it is, but I may be wrong and that's why it is always important to get as much feedback as possible.

We are just getting started with llama.cpp and the possibilities are endless, I think that's why we all are here. Being able to run a LLM not just locally but on the CPU is an amazing idea that just a few weeks ago seemed impossible. We are just getting started!

ggerganov · 2023-03-29T15:57:34Z

@slaren

@ggerganov the changes to ggml_init_params may cause current code to crash silently if no_alloc is not initialized and happens to be set to true. Not sure how much of a concern that is, but we may want to add a different init function instead if that is a problem.

Yeah, it's a bit sad we don't have designated initializers and meaningful compiler errors in C++ for such cases.
For whisper.cpp and llama.cpp I started doing this pattern where you have for example a function that gives you default parameters: struct llama_context llama_context_default_params();

https://github.com/ggerganov/llama.cpp/blob/2a98bc18ea34dbf15f261a0df37080e588a189d1/llama.h#L66-L68

And it sets the default values in a single place in the code base in the respective .c / .cpp file.
So what do you think about adding struct ggml_init_params ggml_init_default_params(); in a similar manner?
It does not prevent anyone from making the mistake that you described, but if all examples start using ggml_init_default_params() to construct the struct ggml_init_params for the ggml_init() call, I think people will follow the pattern and it will be good.

Adding ggml_init_no_alloc() does not scale very well if in the future we want to have more options added for the ggml_context. But, it's also an option.

What do C people normally do to solve this correctly?

Regarding the cross-platform stuff:
Probably in the long run, I would prefer the mmap functionality to live in common.h / common.cpp and be passed optionally as function callbacks to the model loading functions. I guess for example WASM is probably an architecture that does not support mmap, right? And even if LLaMA cannot be loaded in a web-page, there will be other smaller models in the future that can.

But I think this will always be an easy change to make, so whatever you guys think is better now - we can do.
In the end, it's a just a matter of a few #ifdefs to make this work everywhere

slaren · 2023-03-29T17:02:05Z

@ggerganov I suspect that the answer may be to replace structs with more encapsulation, opaque pointers and accessor functions. We could add a function like ggml_no_alloc(ctx, bool) to enable or disable the flag after the context has been created. In principle I see no reason why a context couldn't have both self-allocated tensors and tensors with external data, so this could also enable more use cases, and the default behavior would remain the same as before.

Probably in the long run, I would prefer the mmap functionality to live in common.h / common.cpp and be passed optionally as function callbacks to the model loading functions.

The downside would be that it would be harder for the users of the library to implement support for memory mapped models. I would be concerned that many users wouldn't bother implementing mmap support at all, and more fundamentally I am not sure that it is a good idea to force them to deal with this complexity for a feature that, from what I can tell, has essentially no downsides.

I guess for example WASM is probably an architecture that does not support mmap, right?

Of course, WASM! I had completely missed this platform. I suppose that's going to be the main case where mmap is not available.

rabidcopy · 2023-03-29T17:54:22Z

Just wanted to update that pulling the latest master and rerunning the same seed and prompt as I did before now matches what I was getting. So output changed but not because of this PR. Odd. And these two warnings remain on master now.

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c llama.cpp -o llama.o
llama.cpp: In function ‘bool llama_model_quantize_internal(const std::string&, const std::string&, int)’:
llama.cpp:1455:24: warning: cast from type ‘const char*’ to type ‘char*’ casts away qualifiers [-Wcast-qual]
 1455 |             finp.read ((char *) word.data(), len);
      |                        ^~~~~~~~~~~~~~~~~~~~
llama.cpp:1456:24: warning: cast from type ‘const char*’ to type ‘char*’ casts away qualifiers [-Wcast-qual]
 1456 |             fout.write((char *) word.data(), len);
      |                        ^~~~~~~~~~~~~~~~~~~~

Appears to be 436e561 is where these warnings started and output started to be slightly different.

Either way, sorry for attributing it to this PR.

Left is 436e561. Right is c1f8850.
Ah if I had read I would of known. #458

Inference output changes, though I wouldn't say for the worse. Perplexity has an ETA of 20 hours on my machine, so I haven't run that yet.

CoderRC · 2023-03-29T19:14:24Z

Successfully compiled master branch and successfully compiled slaren's master branch, and successfully run ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512.
In msys2 with mingw32 gcc compiler using:
make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended'

If confused how exactly I did compile it, read #103 (comment)

chrfalch · 2023-03-29T19:38:42Z

Great work, @slaren - I just tested this on an iPhone Pro Max 14 (I'm trying to make room for running llama.cpp in an iPhone app) and this actually made it possible to load the app, but it was too slow to make any sense. After doing some profiling I also found that the cpus were used only like around 30% while most of the work was reported as memory ops. (Here is a quick demo of the app running in the Simulator: https://twitter.com/chrfalch/status/1641092675146244097)

CoderRC · 2023-03-29T20:19:49Z

llama.cpp

@@ -12,6 +12,13 @@
 #include <cassert>
 #include <cstring>

+// headers for POSIX mmap


Recommended:
#ifdef __has_include
#if __has_include(<sys/mman.h>)
#include <sys/mman.h>
#endif
#if __has_include(<fcntl.h>)
#include <fcntl.h>
#endif
#if __has_include(<unistd.h>)
#include <unistd.h>
#endif
#elif defined (unix) || defined (APPLE)

include <sys/mman.h>

include <fcntl.h>

include <unistd.h>

#endif
Due to:
ggml.c line 101
#ifdef __has_include
#if __has_include(<sys/mman.h>)
#undef GGML_MLOCK_SUPPORT
#define GGML_MLOCK_SUPPORT 1
#include <sys/mman.h>
#endif
#endif

ggml.c is C, but this is C++ however and currently we are compiling with -std=c++11, and __has_include is C++17. Do you know of any platforms that may fail with the current implementation?

The thing is it would not fail with any platforms because #ifdef __has_include will ignore the definition if __has_include macro does not exist.

I understand, but in practical terms what are the advantages of doing that instead? Do you know of any specific platform that may fail with the current code?

The advantages are if a person created their own implementation of mmap the code will use their implementation instead, and it decreases the amount of work to support another platform.

@CoderRC I've created a new mmap() pull request #613 that's rebased on this change. I took special care to make sure your mmap preprocessor macro is included. Please take a look!

jart · 2023-03-30T00:10:01Z

Hey folks! Thanks for testing this PR. I've created a live (non-draft) PR in #613 that's rebased on this PR. This new PR not only solves the loading problem for single file models like 7B. It also supports multi-file models like 13B too. That's because the underlying issue with the file format is now being solved. I'd love to hear feedback!

InconsolableCellist · 2023-04-01T19:02:58Z

@slaren just curious, I was reading through https://github.com/slaren/llama.cpp/commits/mmap (and mmap_file() here slaren@ef9afe1#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR299) and I'm wondering how much was written by you and how much by jart?

slaren · 2023-04-01T19:17:44Z

@InconsolableCellist I really don't want to start any drama so I'll just say that I wrote the code in my commits.

InconsolableCellist · 2023-04-01T19:22:08Z

@slaren Thanks. My contact info is in my profile if there's more to say. I was a bit surprised when I drilled down into the mmap history. This was partially a technical concern too, like, I'm wondering if there's a discussion somewhere about how mmap just defers usage of memory rather than reduces it? If you eventually hit every path through the model I think it ends up loaded in RAM anyway. It seems to have savings for frequent loading/unloading, if I understand it correctly. I'm not sure where to start this convo.

slaren · 2023-04-01T19:28:24Z

@InconsolableCellist I haven't looked much into it, but to the best of my knowledge llama is not a sparse model in any way and every tensor in the model file is used every time the network is evaluated. I have no idea why people think that this reduces memory usage, but I may be missing something.

InconsolableCellist · 2023-04-01T19:35:54Z

@InconsolableCellist I haven't looked much into it, but to the best of my knowledge llama is not a sparse model in any way and every tensor in the model file is used every time the network is evaluated. I have no idea why people think that this reduces memory usage, but I may be missing something.

Yeah, I was reading #638 (comment) and the concept of these network being sparse seems to be just a misunderstanding of the profiling, AFAIK.

ggerganov · 2023-04-02T05:15:55Z

@slaren
I agree, the "sparsity" claims are incorrect. To me it looks like the reported mem usage is lower than it is in reality, but I don't understand the details about how mmap works. So yeah - 30B is not using just 5.8 GB RAM for sure. The important thing is that mmap seems to be really sharing the model weights across multiple process, based on my experiments. I don't fully understand how, but it seems to be true.

wtarreau · 2023-04-02T09:13:49Z

@ggerganov mmap() creates a mapping between one memory area and a file or block device. This is what the system uses to load executables and libraries for example, and even to allocate memory except that in this case it uses a dummy device delivering anonymous pages. Look for example in /proc/self/maps. All the regions you're seeing for your process were created using mmap() from the file mentioned on the right.

In order to read a file into memory using malloc()+read(), the system will in fact perform an anonymous mmap() for the memory area, and read the file using an equivalent mechanism, then will copy from the file's mapping to your anonymous area. That's a huge waste of RAM and load time, unless you need to transform the data on the fly while loading them. But if your file is loaded as-is, you must absolutely not load it, and map it instead. The file will then be accessed exactly as if this memory area had been swapped out: Accesses to the memory area in areas that are not yet in memory will cause a page fault resulting in an I/O from the file to that area, and the application sees the data. When memory becomes scarce, some less used areas will get swapped out again. You can even further optimize this using madvise() to indicate that some areas are read a lot and others less, etc.

But in general, keep in mind that mmap() is an essential and critical component of modern operating systems and has been present everywhere since the early 90s. Hoping this helps.

ggerganov · 2023-04-02T11:18:51Z

@wtarreau Thank you - I now have a better understanding of how it works

wtarreau · 2023-04-02T11:25:30Z

You're welcome! I wanted to implement this last week-end and was very pleased to see that someone had already done it :-)

Add mmap support for model files

ef9afe1

Fix ggml_init_params in quantize

7961493

Make mmap_file static

e6f1c19

slaren added 2 commits March 29, 2023 08:31

Unmap the file in llama_free

baa529e

Always initialize mm_addr and mm_length in llama_model

3f5f428

jart self-requested a review March 29, 2023 06:57

slaren changed the title ~~Add support for POSIX mmap~~ Add support for memory mapping models Mar 29, 2023

CoderRC reviewed Mar 29, 2023

View reviewed changes

Initial windows support (untested)

f789d8d

jart mentioned this pull request Mar 29, 2023

Make loading weights 10-100x faster #613

Merged

slaren closed this Mar 30, 2023

rabidcopy mentioned this pull request Apr 3, 2023

Bring back the ggml model format and revert breaking mmap change (#613) #711

Closed

trollkotze mentioned this pull request Apr 4, 2023

Change mmap parameters to avoid much swap thrashing #753

Closed

AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024

Update expose.cpp '#include <cstdint> (ggml-org#586)

cf360f3

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Add support for memory mapping models #586

Add support for memory mapping models #586

Conversation

slaren commented Mar 29, 2023 • edited Loading

🤖 Generated by Copilot at ef9afe1

Summary

Walkthrough

slaren commented Mar 29, 2023

rabidcopy commented Mar 29, 2023 • edited Loading

slaren commented Mar 29, 2023

rabidcopy commented Mar 29, 2023 • edited Loading

slaren commented Mar 29, 2023

rabidcopy commented Mar 29, 2023 • edited Loading

slaren commented Mar 29, 2023

anzz1 commented Mar 29, 2023

slaren commented Mar 29, 2023

anzz1 commented Mar 29, 2023 • edited Loading

slaren commented Mar 29, 2023

anzz1 commented Mar 29, 2023 • edited Loading

slaren commented Mar 29, 2023 • edited Loading

jart commented Mar 29, 2023

anzz1 commented Mar 29, 2023

anzz1 commented Mar 29, 2023 • edited Loading

jart commented Mar 29, 2023

slaren commented Mar 29, 2023 • edited Loading

ggerganov commented Mar 29, 2023 • edited Loading

slaren commented Mar 29, 2023

rabidcopy commented Mar 29, 2023 • edited Loading

CoderRC commented Mar 29, 2023 • edited Loading

chrfalch commented Mar 29, 2023

CoderRC Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

include <sys/mman.h>

include <fcntl.h>

include <unistd.h>

slaren Mar 29, 2023

Choose a reason for hiding this comment

CoderRC Mar 29, 2023

Choose a reason for hiding this comment

slaren Mar 29, 2023

Choose a reason for hiding this comment

CoderRC Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

jart Mar 30, 2023

Choose a reason for hiding this comment

jart commented Mar 30, 2023

InconsolableCellist commented Apr 1, 2023

slaren commented Apr 1, 2023

InconsolableCellist commented Apr 1, 2023 • edited Loading

slaren commented Apr 1, 2023

InconsolableCellist commented Apr 1, 2023

ggerganov commented Apr 2, 2023

wtarreau commented Apr 2, 2023

ggerganov commented Apr 2, 2023

wtarreau commented Apr 2, 2023

slaren commented Mar 29, 2023 •

edited

Loading

`🤖 Generated by Copilot at ef9afe1`

rabidcopy commented Mar 29, 2023 •

edited

Loading

rabidcopy commented Mar 29, 2023 •

edited

Loading

rabidcopy commented Mar 29, 2023 •

edited

Loading

anzz1 commented Mar 29, 2023 •

edited

Loading

anzz1 commented Mar 29, 2023 •

edited

Loading

slaren commented Mar 29, 2023 •

edited

Loading

anzz1 commented Mar 29, 2023 •

edited

Loading

slaren commented Mar 29, 2023 •

edited

Loading

ggerganov commented Mar 29, 2023 •

edited

Loading

rabidcopy commented Mar 29, 2023 •

edited

Loading

CoderRC commented Mar 29, 2023 •

edited

Loading

CoderRC Mar 29, 2023 •

edited

Loading

CoderRC Mar 29, 2023 •

edited

Loading

InconsolableCellist commented Apr 1, 2023 •

edited

Loading