Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

SIPNET swallows all memory when met is missing in the middle of a time series #38

Open
ashiklom opened this issue Oct 19, 2018 · 8 comments
Assignees

Comments

@ashiklom
Copy link
Member

Describe the bug
Long (multi-decadal) runs of SIPNET can exceed system memory limitations, at least on Docker.
This causes the model run process to be killed by the kernel OOM killer.
In the logfile, this just looks like the cryptic error message Killed.
In the system logs (log command on Mac), this looks like:

2018-10-19 10:56:16.133675-0400 0x20cf5    Default     0x0                  11671  com.docker.hyperkit: [51322.605158] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
...
2018-10-19 10:56:16.191227-0400 0x20cf5    Default     0x0                  11671  com.docker.hyperkit: [51322.662809] [75007]     0 75007  1968408  1845966    3849      11   121487             0 sipnet.136
2018-10-19 10:56:16.191775-0400 0x20cf5    Default     0x0                  11671  com.docker.hyperkit: [51322.663345] Out of memory: Kill process 75007 (sipnet.136) score 855 or sacrifice child
2018-10-19 10:56:16.192534-0400 0x20cf5    Default     0x0                  11671  com.docker.hyperkit: [51322.663901] Killed process 75007 (sipnet.136) total-vm:7873632kB, anon-rss:7383864kB, file-rss:0kB, shmem-rss:0kB

To Reproduce
Run SIPENT (r136) for 100 years.

Expected behavior
The model should run.

Machine (please complete the following information):

  • Server: Docker
  • OS: Mac OS High Sierra
  • Machine itself has tons of memory (32 GB totall; >10 GB free at time of execution), so this from an artificial bottleneck imposed by Docker or the system.
@ashiklom
Copy link
Member Author

An extended Kernel log is available here.

@robkooper
Copy link
Member

Does the same apply if you try to run this on the machine (outside of the container)?

@ashiklom
Copy link
Member Author

Haven't tried yet, but I'll give it a whirl.

@ashiklom
Copy link
Member Author

So the problem here turned out to be that I was missing a year of met (2004, for a run from 1902 to 2008) in the middle of the time series. That apparently tripped up SIPNET. Running 1900 to 2003 worked fine and, profiling via Valgrind, only used about 20-30 MB of RAM.

That missing met would kill it this way is insidious. I'm leaving this issue open but changing the name to reflect the real problem.

(The missing met was partially user error -- I had already downloaded year 2004 for this site, but because I did it before my land-sea mask fix, it was all NA.)

@ashiklom ashiklom changed the title Dockerized SIPNET runs out of memory for long runs SIPNET swallows all memory when met is missing in the middle of a time series Oct 19, 2018
@github-actions
Copy link

This issue is stale because it has been open 365 days with no activity.

@ashiklom
Copy link
Member Author

This is a SIPNET issue, not a PEcAn issue, so transferring to #7.

@dlebauer dlebauer transferred this issue from PecanProject/pecan Mar 3, 2025
@dlebauer dlebauer reopened this Mar 3, 2025
@Alomir
Copy link
Collaborator

Alomir commented Mar 5, 2025

Update: fscanf just spins its wheels when it hits a string that doesn't match the defined format ("NA" into a double in this case), where "spins its wheels" means that it never proceeds past that NA in the file and continually creates empty clim records since it never hits EOF.

  1. This falls into "undefined behavior" in the C standard, so anything we do here won't be foolproof...
  2. BUT - I can put in a check to make sure that all the records were parsed for each line, and print/exit if not

IMO, we should also doc that input files (.clim and the new events file) SHALT BE constructed according to specs, else all bets are off.

@Alomir Alomir self-assigned this Mar 5, 2025
@infotroph
Copy link
Member

Supporting @Alomir's analysis, looks like this doesn't need to be a fully missing interval -- even a single NA in one otherwise-complete row is enough. Here's a 3-line file that hangs:

0	2016	1	0	0.125	9.62	9.30	7.19	-6.93e-15	504.41	478.76	697.26	1.23	0.6
0	2016	1	3	0.125	4.22	NA	0.31	-6.93e-15	200.31	46.88	626.89	1.85	0.6
0	2016	1	6	0.125	0.89	9.34	-1.92e-15	-6.93e-15	100.29	625.06	551.67	0.85	0.6

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants