Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Memory leak at least since version v10.9.0 #6402

Closed
jakubgs opened this issue Mar 10, 2025 · 22 comments · Fixed by #6449
Closed

Memory leak at least since version v10.9.0 #6402

jakubgs opened this issue Mar 10, 2025 · 22 comments · Fixed by #6449
Assignees
Milestone

Comments

@jakubgs
Copy link
Member

jakubgs commented Mar 10, 2025

I have recently deployed the v10.9.0 release on our notify.prod fleet, which has been running v0.166.4 before.

Here is memory usage for status-go node on node-01.do-ams3.notify.prod for the last 20 days:

Image

You can clearly see the change in memory usage on the 3rd of March when the Docker image was upgraded.

This is not sustainable and needs to be investigated.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 10, 2025

I've decided to try to bisect it a bit. Since v0.166.4 is from 30 Aug 2023 and v10.9.0 is from 27 Feb 2025, which means the middle happens to be on 2024-05-29:

Version Commit Commit Date Deploy Date Leak?
v0.166.4 bf748de 2023-08-30 15:47:29 unknown 🆗
v0.179.13 4f4b7a9 2024-05-07 10:17:25 2025-03-10 🆗
v3.0.0 a08319f 2024-10-07 14:05:37 2025-03-11 🆗
v3.7.0 a08319f 2024-11-19 08:11:52 2025-03-13 🆗
v3.10.0 3824691 2024-11-22 02:23:17 2025-03-17 🆗
- ed36d60 2024-11-22 08:23:42 2025-03-20 🆗
- d18dfef 2024-11-22 09:03:22 2025-03-22 🆗
- c5c28cc 2024-11-22 12:32:49 2025-03-21 🎯
- f80042c 2024-11-22 12:32:49 2025-03-19
- 5fa57c8 2024-11-22 12:32:49 2025-03-18
v4.0.0 ad28f15 2024-12-03 21:20:26 2025-03-16
v6.0.0 991d5df 2024-11-28 10:55:02 2025-03-15
v7.1.0 d07e61f 2024-12-06 14:16:54 2025-03-14
v8.0.0 a4e36d4 2024-12-23 08:32:48 2025-03-12
v10.9.0 fb70236 2025-02-27 16:29:20 2025-03-03

@jakubgs
Copy link
Member Author

jakubgs commented Mar 11, 2025

It appears v0.179.13 is free from the memory leak based on one day of metrics:

Image

Nest closest release between that and v10.9.0 is v3.0.0 from 2024-10-07.

Deployed: https://ci.infra.status.im/job/status-go/job/deploy-notify-prod/17/

@jakubgs
Copy link
Member Author

jakubgs commented Mar 12, 2025

It appears v3.0.0 is free from the memory leak based on one day of metrics:

Image

Nest closest release between that and v10.9.0 is v8.0.0 from 2024-12-23.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 13, 2025

And v8.0.0 seems to have the memory leak:

Image

So we have to go back in time between v8.0.0 and v3.0.0. For example v3.7.0 from 2024-11-19.

@jakubgs jakubgs self-assigned this Mar 13, 2025
@jakubgs
Copy link
Member Author

jakubgs commented Mar 14, 2025

It seems v3.7.0 is fine:

Image

Next should be v7.1.0 from 2024-12-06.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 15, 2025

Memory leak present in v7.1.0:

Image

Next up is v6.0.0.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 16, 2025

And v6.0.0 is also leaking:

Image

We need to go back to v4.0.0.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 17, 2025

And v4.0.0 is a dud too:

Image

Next up v3.10.0.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 18, 2025

v3.10.0 is fine:

Image

So there's no more releases to check, just individual commits.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 19, 2025

And 5fa57c8 is broken too:

Image

@jakubgs
Copy link
Member Author

jakubgs commented Mar 20, 2025

We're getting closer, f80042c is broken:

Image

We're down to 6 commits:

 > g sl v3.10.0~1..f80042c
f80042c5c 2024-11-22 Sale Djenic          > chore(wallet)_: ens resolver identified under ens api  (origin/test-mem-leak, test-mem-leak)
c5dede93b 2024-11-22 Alex Jbanca          > feat(browser connect)_: Implementing signTypedData_V4 
c5c28cc56 2024-11-22 Patryk Osmaczko      > chore(logging)_: switch to zap.Logger as central logger 
d18dfeffc 2024-11-22 Patryk Osmaczko      > chore(logging)_: introduce geth adapter 
ed36d6054 2024-11-22 Alex Jbanca          > fix(BC)_: Fix personal_sign RPC event processing 
3912e7645 2024-11-22 Patryk Osmaczko      > chore(logging)_: add benchmark for custom zap core 
eae034aea 2024-11-22 Patryk Osmaczko      > feat(logging)_: introduce custom zap.Core enabling runtime changes 
3824691f7 2024-11-22 frank                > chore_: remove CLI option  MEDIA_HTTPS (#6112)  (tag: v3.10.0)

@igor-sirotin
Copy link
Collaborator

So there's a 50% chance it's @osmaczko commit 😂

@jakubgs
Copy link
Member Author

jakubgs commented Mar 21, 2025

And ed36d60 is clean:

Image

So we have 3 left, and c5c28cc5687963491ce2815c4523ce6248ff1adb should tell us which one it is.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 22, 2025

The c5c28cc commit has the memory leak:

Image

Which means it's either that commit, or d18dfef.

@jakubgs
Copy link
Member Author

jakubgs commented Mar 23, 2025

And it turns out that d18dfef has no memory leak:

Image

Which means I'm reverting to c5c28cc to verify last time it's the cause.

@osmaczko
Copy link
Contributor

osmaczko commented Mar 23, 2025

Image

Thanks @jakubgs for blame. I'll try to fix ASAP, sorry for inconvenience 😓

@osmaczko osmaczko self-assigned this Mar 23, 2025
osmaczko added a commit that referenced this issue Mar 23, 2025
References prevented GC to remove unused loggers.

Potentially fixes #6402
osmaczko added a commit that referenced this issue Mar 23, 2025
References prevented GC to remove unused loggers.

Potentially fixes #6402
@osmaczko
Copy link
Contributor

osmaczko commented Mar 23, 2025

@jakubgs could you please do a test against #6449? Branch is on top of c5c28cc

@jakubgs
Copy link
Member Author

jakubgs commented Mar 24, 2025

Done: https://ci.infra.status.im/job/status-go/job/deploy-notify-prod/55/

@igor-sirotin
Copy link
Collaborator

@jakubgs The neatness of your investigation is one for the books and deserves a round of applause, thank you! 👏

@jrainville jrainville added this to the 2.33.1 milestone Mar 24, 2025
@osmaczko
Copy link
Contributor

Looks promising, will give it few more hours.

Image

osmaczko added a commit that referenced this issue Mar 24, 2025
References prevented GC to remove unused loggers.

fixes: #6402
@osmaczko osmaczko moved this to Code Review in Status Desktop/Mobile Board Mar 24, 2025
@osmaczko
Copy link
Contributor

osmaczko commented Mar 25, 2025

Looks good except for the large interim spike.

Image

Such spikes seem to appear before the upgrade to v10.9.0, so I consider it "normal."

Image

osmaczko added a commit that referenced this issue Mar 25, 2025
References prevented GC to remove unused loggers.

fixes: #6402
@jakubgs
Copy link
Member Author

jakubgs commented Mar 25, 2025

Indeed, it does look better. I have no idea what this spike is tho. But if it's only temporary we can live with it, until it becomes too big of a problem. Thanks for looking into the issue.

osmaczko added a commit that referenced this issue Mar 25, 2025
References prevented GC to remove unused loggers.

fixes: #6402
@github-project-automation github-project-automation bot moved this from Code Review to Done in Status Desktop/Mobile Board Mar 25, 2025
osmaczko added a commit that referenced this issue Mar 25, 2025
References prevented GC to remove unused loggers.

fixes: #6402
igor-sirotin pushed a commit that referenced this issue Mar 26, 2025
References prevented GC to remove unused loggers.

fixes: #6402
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants