[BUGS] Default ulimit setting too low #1656

PhilippeAB · 2017-02-23T09:23:48Z

Hi,
For production environment with about 20 shares with snapshot we ran into "too many open files"
problem.
Snapshot, shadow copy and exportfs fail then the whole webgui crash with internal server error 500

Created a file in /etc/security/limits.d/
cat 30-nofile.conf
root soft nofile 8192
root hard nofile 16364

(default is 1024)

PhilippeAB · 2017-03-01T13:50:52Z

With systemd the configuration through security/limits don't work.
You need to update the rockstor.service with the needed file limits

[Service]
LimitNOFILE=10000

PhilippeAB · 2017-05-19T14:03:45Z

The php frontend crash even with 10000 files. It just take longer.
There is a FD leak on /tmp/exports
lsof give this
gunicorn 13354 13367 root 615u REG 0,40 592 9875127 /tmp/exports (deleted)

3700+ files after few days

phillxnet · 2017-08-28T15:26:00Z

Linking to relevant forum threads where forum members are experiencing this same issue:
nfriedly in https://forum.rockstor.com/t/web-ui-errors-mostly-too-many-files-open-while-runing-a-balance/3689
and:
peter in https://forum.rockstor.com/t/error-24-too-many-open-files-causes-scheduled-scrub-job-to-fail-and-prevents-login-to-web-ui/3649
Please update these forum threads with this issues resolution.

nfriedly · 2017-08-28T16:06:30Z

Yea, I ran into this during a rebalancing, here's a couple of comments I can add to follow up on my forum thread:

I have a number of rockons, but I turned off the rockons service before I noticed much of this issue
I have a scheduled scrub, but it did not occur during the balance (so far)
I have scheduled nightly, weekly, and monthly snapshots of my main share (~4TB of data). Since starting the balance it's attempted 3 nightly and 1 weekly snapshot, all have failed.
Grabbing files over SMB, or even just navigating folders, frequently failed. It would often time out several times, and then go very quickly when it did eventually work.
I installed lsof today, and running lsof | wc -l right now reports 10925. I'll see if I can come up with some bash foo to narrow down which commands make up the bulk of those

Both SMB and the web UI seem to be getting worse as time goes on, so I'm wondering if the snapshots are making things worse - perhaps each one is still attempting to run or something? The day I started the balance, I don't recall getting many issues navigating the web ui - it might have happened once or twice, but not enough to leave an impression. Today every page takes several attempts before it will load.

Update: This is a bit ugly, but it gives us a summary of which processes have the most open files (it omits anything with < 200):

lsof | awk 'BEGIN{print "command     files"} NR!=1 {a[$1]++}END{for (i in a) if(a[i]>200){printf("%-10s %6.0f\n", i, a[i])}}'

command     files
data-coll     635
gssproxy      264
afpd         1123
gunicorn     3748
postgres     1125
django        369

gunicorn in the lead followed by afpd and postgres. I'm not sure what's normal for any of those, so I can't really comment further today. I'll run it again after the balance finishes and then again after a reboot and see what things look like then.

nfriedly · 2017-08-29T14:36:43Z

Balance finished, this is what the output looks like now:

command     files
data-coll     595
smbd          397
gssproxy      264
gunicorn     3748
postgres      828
django        426

gunicorn definately looks suspicious. Looking at the output, it appeared that a lot of the files were random temp files, e.g. /tmp/tmpXzUEIe

lsof | grep gunicorn | grep '/tmp/' | wc -l
2957

curiously enough, none of those files actually appear to exist - /tmp/ looks to be empty:

[root@rockstor ~]# cat /tmp/tmpXzUEIe
cat: /tmp/tmpXzUEIe: No such file or directory
[root@rockstor ~]# ls /tmp/
[root@rockstor ~]#

I'm going to reboot it now, I'll post one more update in a minute.

Edit: the output from lsof actually marks the tmp files as deleted:

COMMAND     PID   TID     USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
...
gunicorn  10442           root   39u      REG               0,44      350      48673 /tmp/tmpovv04R (deleted)
...

I saved the full output of lsof to a text file in case it's needed.

nfriedly · 2017-08-29T15:24:47Z

So, after a couple of attempts, I concluded that the web UI's reboot wasn't working - the "graceful shutdown" would go for 5 minutes, and then I'd log back in and be greeted with another "too many files open" error. Then I noticed that my ssh connection never closed. So I rebooted it there.

After the real reboot, this is what things look like:

command     files
data-coll     452
gssproxy      264
gunicorn      799
postgres      489
django        426

The total is about 5k:

lsof | wc -l
5020

gunicorn still has a few deleted temp files, although now they're all prefixed with wgunicorn-:

lsof | grep gunicorn | grep '/tmp/' 
gunicorn  10381           root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10381           root    8u      REG               0,44        0      31188 /tmp/wgunicorn-XbnXgu (deleted)
gunicorn  10396           root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10396 10415     root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10396 10416     root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10397           root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10397           root    8u      REG               0,44        0      31188 /tmp/wgunicorn-XbnXgu (deleted)
gunicorn  10397 10413     root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10397 10413     root    8u      REG               0,44        0      31188 /tmp/wgunicorn-XbnXgu (deleted)
gunicorn  10397 10414     root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10397 10414     root    8u      REG               0,44        0      31188 /tmp/wgunicorn-XbnXgu (deleted)

nfriedly · 2017-08-29T16:13:17Z

Not sure if it's related, but just for completeness: after the reboot my Rock-ons service wouldn't start. After a bit of digging around, the solution here worked for me: https://forum.rockstor.com/t/docker-service-doesnt-start/1657

Plex seems to think it's a new server and that the old one is offline with a second copy of all of my media.. but I think that is just a plex bug.

Update: actually the plex issue seems to be a side-effect of none of my shares mounting. The data is all there, e.g. /mnt2/tank/e is full of files, but /mnt2/e is empty. (tank is the name of my pool, e is my share with all of my data)

Any ideas?

Second update: things work when I mount the shares manually. I am guessing that they failed to mount because I disabled quotas to make a rebalance not take weeks.

PhilippeAB · 2017-10-19T18:56:14Z

The files does not exist because there are deleted But the file descriptor is still in memory so the OS keep the file open. You just can't see it when doing a ls
You need to close the file descriptor.
I get this problem a lot on the UI
There are more than one tmp file with this problem.
For me it's mainly
/tmp/afp.conf
/tmp/exports

update gunicorn. #1656

schakrava · 2017-10-21T22:54:19Z

I've updated our dependency on gunicorn, it was severely outdated. If the file descriptor leak still exists after this change, we can debug a bit better.

phillxnet · 2018-01-19T21:20:43Z

Linking to another report by forum user smanley of "Too many open files":
https://forum.rockstor.com/t/support-server-down-gui-crashing/4350

jvanderb · 2018-01-25T02:51:46Z

I'm having this same issue with alarming frequency. Following the directions on the error message, I opened a support ticket for it several months ago, but evidently nobody looks at that. This issue has been open for 11 months now, with no fix? Is anybody working on this, or is the solution to reboot the system all the time?

phillxnet · 2018-06-06T17:46:16Z

Linking to another forum thread (with recent activity by member erisler) on "Too many open files":
https://forum.rockstor.com/t/exception-while-running-command-usr-bin-hostnamectl-static-errno-24-too-many-open-files/2272/5
with more investigative info on gunicorn and afpd.

ericrisler · 2018-06-06T19:38:47Z

Thank you @phillxnet for linking the above troubleshooting...I'll add here as directed:

I poked around onthe gunicorn github for issues with file descriptors (https://github.com/benoitc/gunicorn/issues?utf8=✓&q=file+descriptor) and found several where gunicorn worker processes inherit the open file descriptors from parent processes...perhaps some of the pros can take a look at these for inspiration:

benoitc/gunicorn#1428
benoitc/gunicorn#1375
benoitc/gunicorn#1327
https://bugs.python.org/issue10099

Also this: http://carsonip.me/posts/gevent-pywsgi-http-keep-alive-fd-leak which suggests to reverse proxy connections to pywsgi to ensure proper detection of closed http connections.

Update and possible fix:
I've searched the rockstor codebase and found that the rockstor-core/src/rockstor/cli/api_wrapper.py bypasses nginx for local api calls...this seems to be the only place where port 8000 is called. I changed line 37 from self.url = 'http://127.0.0.1:8000' to self.url = 'http://127.0.0.1:443' and found that the webui still works and that lsof | grep gunicorn | wc -l holds a steady count.

Added PR with possible fix: #1934

pass api calls through nginx to combat ulimit. See my comments in this thread: rockstor#1656 and https://forum.rockstor.com/t/exception-while-running-command-usr-bin-hostnamectl-static-errno-24-too-many-open-files/2272/5

phillxnet · 2019-04-10T11:03:00Z

Linking to another report of this by forum member Sublevel4:
https://forum.rockstor.com/t/cant-see-dashboard-exception-while-running-command-usr-bin-hostnamectl-static-errno-24-too-many-open-files/5957

Hooverdan96 · 2023-12-21T21:15:43Z

Along with #2020 has this been an issue since we have moved to OpenSUSE? Do we still think that AFP might have been the underlying issue (which then would have been resolved after deprecating the same).

I believe, @phillxnet closed the #1934 due to age back then.

phillxnet · 2023-12-22T17:58:56Z

@Hooverdan96 I'll close this as we haven't had a report of this on our openSUSE base and since we dropped AFP.

PhilippeAB changed the title ~~Default ulimit setting maybee too low~~ [BUGS] Default ulimit setting maybee too low Mar 1, 2017

PhilippeAB changed the title ~~[BUGS] Default ulimit setting maybee too low~~ [BUGS] Default ulimit setting too low Mar 1, 2017

schakrava added this to the Point Bonita milestone Mar 24, 2017

schakrava self-assigned this Aug 28, 2017

schakrava added a commit to schakrava/rockstor-core that referenced this issue Oct 21, 2017

update gunicorn. rockstor#1656

2855380

schakrava added a commit that referenced this issue Oct 21, 2017

Merge pull request #1842 from schakrava/1656_gunicorn

5143f0b

update gunicorn. #1656

schakrava modified the milestones: Point Bonita, After Six Nov 10, 2017

ericrisler mentioned this issue Jun 8, 2018

Update api_wrapper.py ericrisler/rockstor-core#1

Merged

phillxnet mentioned this issue Feb 24, 2019

Exception: Exception while running command(['/usr/bin/hostnamectl', '--static']): [Errno 24] Too many open files #2020

Closed

phillxnet mentioned this issue Apr 10, 2019

Possible fix for https://github.com/rockstor/rockstor-core/issues/1656 ("Too many open files" / umlimit issue) #1934

Closed

phillxnet mentioned this issue Feb 5, 2020

Disabling AFP doesn't actually disable it #2123

Closed

FroggyFlox mentioned this issue Mar 15, 2020

Netatalk maintenance woes and deprecation by Apple #2146

Closed

phillxnet removed this from the After Six milestone Jan 23, 2021

phillxnet closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGS] Default ulimit setting too low #1656

[BUGS] Default ulimit setting too low #1656

PhilippeAB commented Feb 23, 2017 •

edited

Loading

PhilippeAB commented Mar 1, 2017 •

edited

Loading

PhilippeAB commented May 19, 2017

phillxnet commented Aug 28, 2017

nfriedly commented Aug 28, 2017 •

edited

Loading

nfriedly commented Aug 29, 2017 •

edited

Loading

nfriedly commented Aug 29, 2017 •

edited

Loading

nfriedly commented Aug 29, 2017 •

edited

Loading

PhilippeAB commented Oct 19, 2017

schakrava commented Oct 21, 2017

phillxnet commented Jan 19, 2018

jvanderb commented Jan 25, 2018

phillxnet commented Jun 6, 2018

ericrisler commented Jun 6, 2018 •

edited

Loading

phillxnet commented Apr 10, 2019

Hooverdan96 commented Dec 21, 2023

phillxnet commented Dec 22, 2023

[BUGS] Default ulimit setting too low #1656

[BUGS] Default ulimit setting too low #1656

Comments

PhilippeAB commented Feb 23, 2017 • edited Loading

PhilippeAB commented Mar 1, 2017 • edited Loading

PhilippeAB commented May 19, 2017

phillxnet commented Aug 28, 2017

nfriedly commented Aug 28, 2017 • edited Loading

nfriedly commented Aug 29, 2017 • edited Loading

nfriedly commented Aug 29, 2017 • edited Loading

nfriedly commented Aug 29, 2017 • edited Loading

PhilippeAB commented Oct 19, 2017

schakrava commented Oct 21, 2017

phillxnet commented Jan 19, 2018

jvanderb commented Jan 25, 2018

phillxnet commented Jun 6, 2018

ericrisler commented Jun 6, 2018 • edited Loading

phillxnet commented Apr 10, 2019

Hooverdan96 commented Dec 21, 2023

phillxnet commented Dec 22, 2023

PhilippeAB commented Feb 23, 2017 •

edited

Loading

PhilippeAB commented Mar 1, 2017 •

edited

Loading

nfriedly commented Aug 28, 2017 •

edited

Loading

nfriedly commented Aug 29, 2017 •

edited

Loading

nfriedly commented Aug 29, 2017 •

edited

Loading

nfriedly commented Aug 29, 2017 •

edited

Loading

ericrisler commented Jun 6, 2018 •

edited

Loading