-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
504 Timeout #978
Comments
I don't really know - I don't believe this error would be generated just for one particular user (unless they were perhaps using a proxy, in which case they might get a non-CantusDB-themed error page), but it could be being generated for just a few specific pages. It would be helpful if these users could list the pages they're trying to visit, ideally providing URLs. |
I asked, but "It seemed to be any page and any action, just at certain times it wouldn't load. I got some "500" Error messages briefly today but it was pretty brief." [this would seem to be a different problem from the 504?] |
Hm... in the middle of the afternoon, I updated the Production server, and forgot to update one of our configuration files (the one that specifies whether the code is running on Staging or Production), which took about 5 minutes to fix; there were 500 errors on all pages for those few minutes. |
But yeah, I don't know what would cause all pages on the site to 504 sporadically. Hopefully it was something transient with the Alliance servers, and hopefully it doesn't happen again! |
different user, yesterday: |
Immediately following yesterday's update, I noticed the whole site was a bit sluggish. It improved after a few minutes. |
so it seems like this is just users noticing the updates happening from afar? (I would say to maybetry to choose the timing to be minimally disruptive, but it's always going to be working hours somewhere...) |
Today it's my turn, at 8:43 PM ET, trying to load a chant page I needed to consult: https://cantusdatabase.org/chant/702606. For fun I also tried https://cantusdatabase.org/source/696271 and https://cantusdatabase.org/ and they all gave me a timeout. |
hm... so chant/702606 has a Cantus ID of g01258. Around 8:43, Cantus Index took more than 5s to give us concordances when we made a request to https://cantusindex.org/json-con/g01258. It's possible that we could decrease this timeout, which would potentially prevent 504s on chant detail pages, at the cost of more often not displaying concordances. Not sure whether it would work, but might be worth a try. |
I'm not sure what to say about source/696271 - trying to load it locally, there are no requests being made to any other sites, so that can't be it. But I see that someone got a 504 at |
I spent a little while gathering some statistics on how often users are getting 504 codes. It appears these 504s are not happening systematically on certain pages, even if they are perhaps occurring clustered in time. If we want to make progress on this, it might make sense to set a target for how many 504 errors are acceptable, as I expect that it won't be possible to fully eliminate them. We might use the ratio of 200 Success codes to 504 Gateway Timeout errors as a metric to optimize. I downloaded the Nginx logs for Production for the last hour.
Looking at a different one-hour window earlier in the day, I found similar results:
Looking at a 1-hour window containing 8:43pm yesterday:
Looking at the past 24 hours:
What's a good ratio to aim for - >1000:1 (in which case, we seem to be good already)? >10000:1? |
I feel like this depends a bit on the nature of the problem--I feel like "all users have to reload a chant page 1 in 10 times" (just to exaggerate the problem) is a slightly different situation from "1 in 10 users cannot use Cantus at all" or "Cantus is down for 2.4 hours of every day," with different levels of acceptability for each. |
Aren’t the 504 errors caused by calling out to Cantus Index at page load time? If Cantus Index doesn’t provide a timely return, does this mean the Cantus Database page load stops? If so, you could eliminate this by caching cantus index data locally and refreshing it periodically in a separate process (cron?). |
For views where we send requests to Cantus Index and, if it takes more than 5s to get a response, we give up and display an unobtrusive error message on the relevant part of the page. Our Source Detail view should not make any such requests; nor should the homepage. On the Source Detail page, we run a sometimes-costly function to decide which Feasts to display in the sidebar panel, which could possibly lead to timeouts. |
Today it seems like all users were reaching a 504 Gateway Timeout even when trying to reach the homepage. That leads me to suspect that the Cantus Index requests aren't the sole cause of this error. I restarted the the containers on the production server to solve the 504 timeout, but I'm still unsure what exactly caused the issue in the first place since the code wasn't changed or updated in the time between when it was working and when it wasn't. Here are a few possibilities related to the docker configuration:
We should continue by analyzing logs, monitoring resource usage, checking for application bugs, reviewing network configurations, etc. |
A couple of thoughts:
|
To start, a comparison between Staging and Production as they are before restarting Production: STAGING
postgres container:
PRODUCTION
postgres container
(and again, because values were actively changing)
I'm not yet really sure what to look for, but a few observations:
|
About to update the production machine after merging staging into production. Django container:
Postgres container:
observations: django container: we're still using a fair bit of CPU - it adds up to ~70% |
Production VM, within a minute after restarting the containers, django container:
Postgres container, within 2-3 minutes:
about 15 minutes later (after running tests etc.)
postgres container:
observations: I keep noticing the value for |
Looking through logs in the django container, I'm noticing a few things:
|
I'm going to do some looking into Nginx and caching - again, it's not likely to fully solve the problem, but reducing the number of requests that make it all the way to to gunicorn can't hurt. |
I sent an email to Jan about I sampled a minute or two of nginx logs from earlier in August, and out of ~6800 requests, 2400 contained Another 1000-request sample (about 1 minute) from today contained ~400 In summary, we're going to continue to get rogue requests from CI for the next while, and people are going to continue to request static resources. If nginx can handle these requests itself, we can reduce the load on gunicorn/Django by approximately half. So I'm going to spend some time over the next few days learning about Nginx, so that we can get it to serve 404s to CantusIndex and cache static files. |
Jan fixed the Looking at the last 20_000 requests in our nginx logs, there is not a single instance of a 502 or 504. While I still think it would be a good idea to set up caching of static files, in a spirit of optimism I'm going to tentatively close this issue. |
Users seem to sometimes be getting 504 errors.
From an email Sunday night (9 PM):
"I'm getting a "504 Gateway Timeout" on the Cantus website. Any idea if this is a database-wide problem or if it's specific to me? "
Another user confirmed that it wasn't happening for them, but do we know why users would be hitting this?
The text was updated successfully, but these errors were encountered: