-
Notifications
You must be signed in to change notification settings - Fork 6
Memory leak exists, should use GC_malloc_ignore_off_page when alloc large object. #31
Comments
Thanks for the report @johnlanni - to confirm, are you saying you don't see the issue with the patch in your branch of nottinygc (https://github.com/higress-group/nottinygc/pull/1/files) or you do see it with that as well? I'm currently running locally first with the code in that branch, using |
We resolved the issue in There are two keypoints:
|
If the "malloc" symbol is not exported, proxy wasm will use the "proxy_on_memory_allocate" symbol. The symbol is correctly bound to the function when using the tetrate sdk: https://github.com/tetratelabs/proxy-wasm-go-sdk/blob/1b9daaf70730bd6197ca8a34b2a8af713bbce7ad/proxywasm/internal/abi_callback_alloc.go#L20 |
Thanks for the pointers, and sorry I had an issue with my build command, I'm able to verify the fix. For me, only ignore_page seems to be enough to not see an issue, I will start with it and thinnk more about malloc. AFAIK, Envoy uses malloc to allocate static memory so it should not need to be GC'd (languages without a GC would basically not be able to implement it otherwise), in fact it could cause worse performance for the GC otherwise. For the first change I sent #32, perhaps you can try using commit 667d0d4 to see if you also see the same behavior as me, that it is enough to fix it in your repro? Thank you. |
@anuraaga Envoy needs to use malloc in the wasm module when copying memory from the host to the wasm vm, such as copying the request body to the memory section of the wasm module. This memory should be managed by the wasm vm, not Envoy itself. So the second change is needed, otherwise, this copid memory will leak. |
Thanks @johnlanni - I also found https://prog.world/webassembly-will-bring-them-all-together/ which mentions it. That's a surprising model 😅 In that case it does seem like the only option is to stop exporting malloc, I will do some more checking on it and then make the change. |
@anuraaga Yes, this is a bit counterintuitive. But because envoy doesn't know when the wasm module is not using the copied buffer. So it can only be released by the wasm module itself. |
I have another question that is a bit confusing. Why do we need to introduce mimalloc after using bdwgc? Bdwgc doesn't really hand back the memory to OS, right? So the memory pool in mimalloc is not used? |
My understanding is that when using GC_malloc, bdwgc will consider that any section in the allocated memory will be refrenced by pointers, so the larger the allocated memory, the greater the possibility of being mistaken for a potential pointer to generate a reference. |
This is inherited from original exploration where it improved performance in apps that use malloc a lot like coraza-proxy-wasm. Later we added bdwgc and I think it was the far more significant improvement so I have been wondering if dlmalloc will perform fine and want to compare it. Mimalloc may still have general improvements besides pooling though.
Since 0.5, we switched from GC_malloc to _typed, which indicates which parts may be pointers, and most importantly that there aren't pointers in byte arrays. Notably, there is a version of _typed to ignore off page, so it appears that they are orthogonal issues. I'll try to look more at the code in bdwgc to see if I can understand the difference. |
Thanks @johnlanni! I also realized that the manual coraza-proxy-wasm ftw test I always use should be in CI in this repo. Any other tests will also be welcome here, or there as well of course depending on the scope. I'll ping after adding the coraza one to see what makes sense structure-wise, thank you |
@johnlanni I have added an e2e test based on the higress gc-test https://github.com/wasilibs/nottinygc/tree/main/e2e/higress-gc-test Let me know if you have any suggestions for it. Thanks. |
@anuraaga That's look great👍, maybe we should add more types of memory allocation in this plugin (with or without layout, etc.) which could cover the different if-else branches here: Lines 86 to 113 in a3593a7
|
Hi, |
Hi @behnamnikbakht - are you seeing perpetually increasing memory usage over time or is it that memory is not reduced when idle? Unfortunately Wasm has no means to return memory to the os, this is fundamental to the spec and nothing GC can do about it. So res memory will never go down even when idle By the way, your test case appears to be the same as this e2e https://github.com/wasilibs/nottinygc/blob/main/e2e/higress-gc-test/main.go#L64 AFAICT it is behaving as expected. |
In our production environment, the WASM plugin processes 2000 requests per minute. The plugin heavily uses make for allocation. Our performance environment simulated 2000 RPM using Jmeter and ran the test for almost 2 days continuously. RES RES Memory growth of the envoy went from 150 MB to 900MB (see the below picture). If the RES memory will never go down, restarting the envoy every day is the only option. Any suggestions, please? |
Thanks for the graph @cheranm - I realized that a lot of the focus in our e2e tests is on the GC allocation and not process memory. When running the higress e2e and looking at I don't know what to make of this - the numbers being reported by bdwgc and mimalloc could be wrong for some reason, but I don't think the chance of this is high. But in that case, it means the memory increase is outside what is allocated by this library, and I have no clues on how to debug that. |
Hi @anuraaga, thanks for your project, we are using this nottinygc in our higress project which based on Envoy's wasm extension mechanism.
We discovered a memory leak during use, and you can find demo code to reproduce the memory leak here:
https://github.com/alibaba/higress/tree/main/plugins/wasm-go/extensions/gc-test
Note here that we replace nottinygc with
github.com/higress-group/nottinygc
, if we comment out this replacement like this:Then build the wasm file and start envoy with the following configuration:
Use the following command to perform a stress test:
After a few seconds, we can see the following log:
The text was updated successfully, but these errors were encountered: