Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUGS] Default ulimit setting too low #1656

Closed
PhilippeAB opened this issue Feb 23, 2017 · 16 comments
Closed

[BUGS] Default ulimit setting too low #1656

PhilippeAB opened this issue Feb 23, 2017 · 16 comments
Assignees

Comments

@PhilippeAB
Copy link

PhilippeAB commented Feb 23, 2017

Hi,
For production environment with about 20 shares with snapshot we ran into "too many open files"
problem.
Snapshot, shadow copy and exportfs fail then the whole webgui crash with internal server error 500

Created a file in /etc/security/limits.d/
cat 30-nofile.conf
root soft nofile 8192
root hard nofile 16364

(default is 1024)

@PhilippeAB
Copy link
Author

PhilippeAB commented Mar 1, 2017

With systemd the configuration through security/limits don't work.
You need to update the rockstor.service with the needed file limits

[Service]
LimitNOFILE=10000

@PhilippeAB PhilippeAB changed the title Default ulimit setting maybee too low [BUGS] Default ulimit setting maybee too low Mar 1, 2017
@PhilippeAB PhilippeAB changed the title [BUGS] Default ulimit setting maybee too low [BUGS] Default ulimit setting too low Mar 1, 2017
@schakrava schakrava added this to the Point Bonita milestone Mar 24, 2017
@PhilippeAB
Copy link
Author

The php frontend crash even with 10000 files. It just take longer.
There is a FD leak on /tmp/exports
lsof give this
gunicorn 13354 13367 root 615u REG 0,40 592 9875127 /tmp/exports (deleted)

3700+ files after few days

@schakrava schakrava self-assigned this Aug 28, 2017
@phillxnet
Copy link
Member

Linking to relevant forum threads where forum members are experiencing this same issue:
nfriedly in https://forum.rockstor.com/t/web-ui-errors-mostly-too-many-files-open-while-runing-a-balance/3689
and:
peter in https://forum.rockstor.com/t/error-24-too-many-open-files-causes-scheduled-scrub-job-to-fail-and-prevents-login-to-web-ui/3649
Please update these forum threads with this issues resolution.

@nfriedly
Copy link

nfriedly commented Aug 28, 2017

Yea, I ran into this during a rebalancing, here's a couple of comments I can add to follow up on my forum thread:

  • I have a number of rockons, but I turned off the rockons service before I noticed much of this issue
  • I have a scheduled scrub, but it did not occur during the balance (so far)
  • I have scheduled nightly, weekly, and monthly snapshots of my main share (~4TB of data). Since starting the balance it's attempted 3 nightly and 1 weekly snapshot, all have failed.
  • Grabbing files over SMB, or even just navigating folders, frequently failed. It would often time out several times, and then go very quickly when it did eventually work.
  • I installed lsof today, and running lsof | wc -l right now reports 10925. I'll see if I can come up with some bash foo to narrow down which commands make up the bulk of those

Both SMB and the web UI seem to be getting worse as time goes on, so I'm wondering if the snapshots are making things worse - perhaps each one is still attempting to run or something? The day I started the balance, I don't recall getting many issues navigating the web ui - it might have happened once or twice, but not enough to leave an impression. Today every page takes several attempts before it will load.

Update: This is a bit ugly, but it gives us a summary of which processes have the most open files (it omits anything with < 200):

lsof | awk 'BEGIN{print "command     files"} NR!=1 {a[$1]++}END{for (i in a) if(a[i]>200){printf("%-10s %6.0f\n", i, a[i])}}'
command     files
data-coll     635
gssproxy      264
afpd         1123
gunicorn     3748
postgres     1125
django        369

gunicorn in the lead followed by afpd and postgres. I'm not sure what's normal for any of those, so I can't really comment further today. I'll run it again after the balance finishes and then again after a reboot and see what things look like then.

@nfriedly
Copy link

nfriedly commented Aug 29, 2017

Balance finished, this is what the output looks like now:

command     files
data-coll     595
smbd          397
gssproxy      264
gunicorn     3748
postgres      828
django        426

gunicorn definately looks suspicious. Looking at the output, it appeared that a lot of the files were random temp files, e.g. /tmp/tmpXzUEIe

lsof | grep gunicorn | grep '/tmp/' | wc -l
2957

curiously enough, none of those files actually appear to exist - /tmp/ looks to be empty:

[root@rockstor ~]# cat /tmp/tmpXzUEIe
cat: /tmp/tmpXzUEIe: No such file or directory
[root@rockstor ~]# ls /tmp/
[root@rockstor ~]#

I'm going to reboot it now, I'll post one more update in a minute.

Edit: the output from lsof actually marks the tmp files as deleted:

COMMAND     PID   TID     USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
...
gunicorn  10442           root   39u      REG               0,44      350      48673 /tmp/tmpovv04R (deleted)
...

I saved the full output of lsof to a text file in case it's needed.

@nfriedly
Copy link

nfriedly commented Aug 29, 2017

So, after a couple of attempts, I concluded that the web UI's reboot wasn't working - the "graceful shutdown" would go for 5 minutes, and then I'd log back in and be greeted with another "too many files open" error. Then I noticed that my ssh connection never closed. So I rebooted it there.

After the real reboot, this is what things look like:

command     files
data-coll     452
gssproxy      264
gunicorn      799
postgres      489
django        426

The total is about 5k:

lsof | wc -l
5020

gunicorn still has a few deleted temp files, although now they're all prefixed with wgunicorn-:

lsof | grep gunicorn | grep '/tmp/' 
gunicorn  10381           root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10381           root    8u      REG               0,44        0      31188 /tmp/wgunicorn-XbnXgu (deleted)
gunicorn  10396           root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10396 10415     root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10396 10416     root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10397           root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10397           root    8u      REG               0,44        0      31188 /tmp/wgunicorn-XbnXgu (deleted)
gunicorn  10397 10413     root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10397 10413     root    8u      REG               0,44        0      31188 /tmp/wgunicorn-XbnXgu (deleted)
gunicorn  10397 10414     root    7u      REG               0,44        0      31187 /tmp/wgunicorn-ECwU7j (deleted)
gunicorn  10397 10414     root    8u      REG               0,44        0      31188 /tmp/wgunicorn-XbnXgu (deleted)

@nfriedly
Copy link

nfriedly commented Aug 29, 2017

Not sure if it's related, but just for completeness: after the reboot my Rock-ons service wouldn't start. After a bit of digging around, the solution here worked for me: https://forum.rockstor.com/t/docker-service-doesnt-start/1657

Plex seems to think it's a new server and that the old one is offline with a second copy of all of my media.. but I think that is just a plex bug.

Update: actually the plex issue seems to be a side-effect of none of my shares mounting. The data is all there, e.g. /mnt2/tank/e is full of files, but /mnt2/e is empty. (tank is the name of my pool, e is my share with all of my data)

Any ideas?

Second update: things work when I mount the shares manually. I am guessing that they failed to mount because I disabled quotas to make a rebalance not take weeks.

@PhilippeAB
Copy link
Author

The files does not exist because there are deleted But the file descriptor is still in memory so the OS keep the file open. You just can't see it when doing a ls
You need to close the file descriptor.
I get this problem a lot on the UI
There are more than one tmp file with this problem.
For me it's mainly
/tmp/afp.conf
/tmp/exports

schakrava added a commit to schakrava/rockstor-core that referenced this issue Oct 21, 2017
schakrava added a commit that referenced this issue Oct 21, 2017
@schakrava
Copy link
Member

I've updated our dependency on gunicorn, it was severely outdated. If the file descriptor leak still exists after this change, we can debug a bit better.

@schakrava schakrava modified the milestones: Point Bonita, After Six Nov 10, 2017
@phillxnet
Copy link
Member

Linking to another report by forum user smanley of "Too many open files":
https://forum.rockstor.com/t/support-server-down-gui-crashing/4350

@jvanderb
Copy link

I'm having this same issue with alarming frequency. Following the directions on the error message, I opened a support ticket for it several months ago, but evidently nobody looks at that. This issue has been open for 11 months now, with no fix? Is anybody working on this, or is the solution to reboot the system all the time?

@phillxnet
Copy link
Member

Linking to another forum thread (with recent activity by member erisler) on "Too many open files":
https://forum.rockstor.com/t/exception-while-running-command-usr-bin-hostnamectl-static-errno-24-too-many-open-files/2272/5
with more investigative info on gunicorn and afpd.

@ericrisler
Copy link

ericrisler commented Jun 6, 2018

Thank you @phillxnet for linking the above troubleshooting...I'll add here as directed:

I poked around onthe gunicorn github for issues with file descriptors (https://github.com/benoitc/gunicorn/issues?utf8=✓&q=file+descriptor) and found several where gunicorn worker processes inherit the open file descriptors from parent processes...perhaps some of the pros can take a look at these for inspiration:

benoitc/gunicorn#1428
benoitc/gunicorn#1375
benoitc/gunicorn#1327
https://bugs.python.org/issue10099

Also this: http://carsonip.me/posts/gevent-pywsgi-http-keep-alive-fd-leak which suggests to reverse proxy connections to pywsgi to ensure proper detection of closed http connections.

Update and possible fix:
I've searched the rockstor codebase and found that the rockstor-core/src/rockstor/cli/api_wrapper.py bypasses nginx for local api calls...this seems to be the only place where port 8000 is called. I changed line 37 from self.url = 'http://127.0.0.1:8000' to self.url = 'http://127.0.0.1:443' and found that the webui still works and that lsof | grep gunicorn | wc -l holds a steady count.

Added PR with possible fix: #1934

@phillxnet
Copy link
Member

@Hooverdan96
Copy link
Member

Along with #2020 has this been an issue since we have moved to OpenSUSE? Do we still think that AFP might have been the underlying issue (which then would have been resolved after deprecating the same).

I believe, @phillxnet closed the #1934 due to age back then.

@phillxnet
Copy link
Member

@Hooverdan96 I'll close this as we haven't had a report of this on our openSUSE base and since we dropped AFP.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants