Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

504 Timeout #978

Closed
annamorphism opened this issue Aug 14, 2023 · 23 comments
Closed

504 Timeout #978

annamorphism opened this issue Aug 14, 2023 · 23 comments
Assignees

Comments

@annamorphism
Copy link

Users seem to sometimes be getting 504 errors.

From an email Sunday night (9 PM):
"I'm getting a "504 Gateway Timeout" on the Cantus website. Any idea if this is a database-wide problem or if it's specific to me? "

Another user confirmed that it wasn't happening for them, but do we know why users would be hitting this?

@jacobdgm
Copy link
Contributor

I don't really know - I don't believe this error would be generated just for one particular user (unless they were perhaps using a proxy, in which case they might get a non-CantusDB-themed error page), but it could be being generated for just a few specific pages. It would be helpful if these users could list the pages they're trying to visit, ideally providing URLs.

@annamorphism
Copy link
Author

I asked, but "It seemed to be any page and any action, just at certain times it wouldn't load. I got some "500" Error messages briefly today but it was pretty brief." [this would seem to be a different problem from the 504?]

@jacobdgm
Copy link
Contributor

Hm... in the middle of the afternoon, I updated the Production server, and forgot to update one of our configuration files (the one that specifies whether the code is running on Staging or Production), which took about 5 minutes to fix; there were 500 errors on all pages for those few minutes.

@jacobdgm
Copy link
Contributor

But yeah, I don't know what would cause all pages on the site to 504 sporadically. Hopefully it was something transient with the Alliance servers, and hopefully it doesn't happen again!

@annamorphism
Copy link
Author

different user, yesterday:
"Incidentally, I got a 504 tonight trying to load the home page, but I refreshed it after a minute and it loaded properly. shrug"

@jacobdgm
Copy link
Contributor

jacobdgm commented Aug 15, 2023

different user, yesterday:
"Incidentally, I got a 504 tonight trying to load the home page, but I refreshed it after a minute and it loaded properly. shrug"

Immediately following yesterday's update, I noticed the whole site was a bit sluggish. It improved after a few minutes.

@annamorphism
Copy link
Author

so it seems like this is just users noticing the updates happening from afar? (I would say to maybetry to choose the timing to be minimally disruptive, but it's always going to be working hours somewhere...)

@annamorphism
Copy link
Author

Today it's my turn, at 8:43 PM ET, trying to load a chant page I needed to consult: https://cantusdatabase.org/chant/702606. For fun I also tried https://cantusdatabase.org/source/696271 and https://cantusdatabase.org/ and they all gave me a timeout.
I had been merrily working along with no issues an hour earlier.
I tried refreshing the source page and, after a long wait, got it to load, but the homepage was still churning away a minute later, and eventually gave me another timeout (this at 8:48).
I tried the same pages in Safari instead of Firefox, just to see, and they also churned away for several minutes before giving a 504.
In the ten+ minutes I have had this comment open I have refreshed several times, and at last got the homepage just now (at 8:52) and then got https://cantusdatabase.org/chants/?source=696271 to cooperate. https://cantusdatabase.org/chant/702606 is still not loading up in either browser, and neither is https://cantusdatabase.org/source/696271 in my Safari tab.

@jacobdgm
Copy link
Contributor

jacobdgm commented Aug 17, 2023

hm... so chant/702606 has a Cantus ID of g01258. Around 8:43, Cantus Index took more than 5s to give us concordances when we made a request to https://cantusindex.org/json-con/g01258. It's possible that we could decrease this timeout, which would potentially prevent 504s on chant detail pages, at the cost of more often not displaying concordances. Not sure whether it would work, but might be worth a try.

@jacobdgm
Copy link
Contributor

jacobdgm commented Aug 17, 2023

I'm not sure what to say about source/696271 - trying to load it locally, there are no requests being made to any other sites, so that can't be it. But I see that someone got a 504 at 17/Aug/2023:00:48:17 +0000, but then one of the resources required to load the page (style.css) was returned with a 200 at the same timestamp (not sure how to interpret this last bit).

@jacobdgm
Copy link
Contributor

jacobdgm commented Aug 17, 2023

I spent a little while gathering some statistics on how often users are getting 504 codes. It appears these 504s are not happening systematically on certain pages, even if they are perhaps occurring clustered in time. If we want to make progress on this, it might make sense to set a target for how many 504 errors are acceptable, as I expect that it won't be possible to fully eliminate them. We might use the ratio of 200 Success codes to 504 Gateway Timeout errors as a metric to optimize.

I downloaded the Nginx logs for Production for the last hour. cmd-Fing through the file, there were

  • 26297 instances of GET
  • 17709 instances of 200
  • no instances of 504
  • 200:504 ratio: ∞:1

Looking at a different one-hour window earlier in the day, I found similar results:

  • 22894 instances of GET
  • 16510 instances of 200
  • no instances of 504
  • 200:504 ratio: ∞:1

Looking at a 1-hour window containing 8:43pm yesterday:

  • 20392 instances of `GET
  • 10607 instances of 200
  • 118 instances of 504
  • 200:504 ratio: ~90:1

Looking at the past 24 hours:

  • ~617000 for GET
  • 419420 for 200
  • 118 for 504
  • 200:504 ratio: ~3500:1

What's a good ratio to aim for - >1000:1 (in which case, we seem to be good already)? >10000:1?

@annamorphism
Copy link
Author

I feel like this depends a bit on the nature of the problem--I feel like "all users have to reload a chant page 1 in 10 times" (just to exaggerate the problem) is a slightly different situation from "1 in 10 users cannot use Cantus at all" or "Cantus is down for 2.4 hours of every day," with different levels of acceptability for each.

@ahankinson
Copy link
Member

Aren’t the 504 errors caused by calling out to Cantus Index at page load time? If Cantus Index doesn’t provide a timely return, does this mean the Cantus Database page load stops?

If so, you could eliminate this by caching cantus index data locally and refreshing it periodically in a separate process (cron?).

@jacobdgm
Copy link
Contributor

Aren’t the 504 errors caused by calling out to Cantus Index at page load time? If Cantus Index doesn’t provide a timely return, does this mean the Cantus Database page load stops?

For views where we send requests to Cantus Index and, if it takes more than 5s to get a response, we give up and display an unobtrusive error message on the relevant part of the page.

Our Source Detail view should not make any such requests; nor should the homepage.

On the Source Detail page, we run a sometimes-costly function to decide which Feasts to display in the sidebar panel, which could possibly lead to timeouts.

@lucasmarchd01
Copy link
Contributor

Today it seems like all users were reaching a 504 Gateway Timeout even when trying to reach the homepage. That leads me to suspect that the Cantus Index requests aren't the sole cause of this error.

I restarted the the containers on the production server to solve the 504 timeout, but I'm still unsure what exactly caused the issue in the first place since the code wasn't changed or updated in the time between when it was working and when it wasn't. Here are a few possibilities related to the docker configuration:

  1. Resource Exhaustion: If a container or multiple containers were consuming excessive CPU, memory, or network bandwidth, this could have led to a situation where the server was unable to respond to requests
  2. Stale State: Containers can accumulate stale connections or cached data over time, leading to performance degradation.
  3. Network Configuration: Misconfigurations can lead to connectivity issues.
  4. Resource Leaks: Containers might not release resources properly when they're done with them. Over time, this can lead to resource exhaustion and performance issues.

We should continue by analyzing logs, monitoring resource usage, checking for application bugs, reviewing network configurations, etc.

@jacobdgm
Copy link
Contributor

A couple of thoughts:

  • I wonder whether we can have Nginx cache certain rarely-changing pages such as the homepage - I don't think that it will affect 504 errors across the rest of the site very much, but we should never to show a 504 on, say, the "Fields and Content Descriptions" page.
  • Since restarting the Docker containers seems to fix the problem at least for a little while, updating the Production VM for Merge Staging into Production, 2023-Aug-21 edition #1009 would be a natural time to see how resource usage changes following a restart of various containers. I'll spend a while looking at the state of resource usage in the various containers before a restart, and then look to see whether anything changes dramatically once I restart them to run our new code.

@jacobdgm
Copy link
Contributor

To start, a comparison between Staging and Production as they are before restarting Production:

STAGING
django container

top - 18:28:50 up 31 days, 23:30,  0 users,  load average: 0.00, 0.02, 0.04
Tasks:   9 total,   1 running,   8 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7975.2 total,   2130.8 free,    845.6 used,   4998.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   6692.9 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1 root      20   0   30940  26736   7952 S   0.0   0.3   2:02.57 gunicorn
8 root      20   0  203128 115252  16288 S   0.0   1.4 219:21.77 gunicorn
9 root      20   0  206380 118588  16096 S   0.0   1.5 218:18.03 gunicorn
10 root      20   0  202304 114368  16040 S   0.0   1.4 218:29.29 gunicorn
11 root      20   0  231468 143844  16288 S   0.0   1.8 217:55.63 gunicorn
12 root      20   0  200144 112472  16040 S   0.0   1.4 217:28.97 gunicorn
13 root      20   0    6052   3812   3292 S   0.0   0.0   0:00.03 bash
1417 root      20   0    6052   3728   3296 S   0.0   0.0   0:00.01 bash
1424 root      20   0    8952   3748   3168 R   0.0   0.0   0:00.00 top

postgres container:

Mem: 5984536K used, 2182096K free, 143012K shrd, 343008K buff, 4470332K cached
CPU:   0% usr   0% sys   0% nic 100% idle   0% io   0% irq   0% sirq
Load average: 0.00 0.01 0.03 4/290 28942
PID    PPID  USER     STAT   VSZ %VSZ CPU %CPU COMMAND
27     1     postgres S     166m   2%   3   0% postgres: autovacuum launcher
29     1     postgres S     166m   2%   7   0% postgres: logical replication launcher
24     1     postgres S     166m   2%   5   0% postgres: checkpointer
25     1     postgres S     166m   2%   0   0% postgres: background writer
26     1     postgres S     166m   2%   7   0% postgres: walwriter
1      0     postgres S     166m   2%   4   0% postgres
28     1     postgres S    24988   0%   5   0% postgres: stats collector
28936  0     root     S     2656   0%   6   0% bash
28942  28936 root     R     1616   0%   0   0% top

PRODUCTION
django container

top - 18:33:33 up 44 days,  2:43,  0 users,  load average: 1.06, 1.26, 1.50
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.6 us,  1.5 sy,  0.0 ni, 76.5 id,  6.9 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem :   7975.2 total,    190.1 free,   2436.4 used,   5348.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5114.5 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
28778 root      20   0  477908 321992  14044 S  40.5   3.9 393:52.48 gunicorn
28783 root      20   0  481088 330628  13960 S  31.0   4.0 398:53.74 gunicorn
28771 root      20   0  478380 317684  13996 S  12.7   3.9 399:44.89 gunicorn
28795 root      20   0  515996 363476  14128 S  10.1   4.5 395:27.13 gunicorn
28798 root      20   0  480220 328952  13936 S   9.5   4.0 400:43.75 gunicorn
1     root      20   0   30936  24396   5612 S   0.0   0.3   1:47.98 gunicorn
12449 root      20   0    6048   3548   3148 S   0.0   0.0   0:00.03 bash
12484 root      20   0    8948   3688   3112 R   0.0   0.0   0:00.00 top

postgres container

Mem: 7906948K used, 259684K free, 143012K shrd, 310828K buff, 4374132K cached
CPU:   5% usr   2% sys   0% nic  92% idle   0% io   0% irq   0% sirq
Load average: 1.29 1.28 1.50 2/313 8448
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
 8445     1 postgres S     168m   2%   6   0% postgres: cantusdb cantusdb 192.168.128.4(57632) idle
 8442     1 postgres S     168m   2%   0   0% postgres: cantusdb cantusdb 192.168.128.4(57598) idle
   26     1 postgres S     166m   2%   3   0% postgres: autovacuum launcher 
   28     1 postgres S     166m   2%   6   0% postgres: logical replication launcher 
   23     1 postgres S     166m   2%   5   0% postgres: checkpointer 
   24     1 postgres S     166m   2%   3   0% postgres: background writer 
   25     1 postgres S     166m   2%   2   0% postgres: walwriter 
    1     0 postgres S     166m   2%   5   0% postgres
   27     1 postgres S    25092   0%   0   0% postgres: stats collector 
 8414     0 root     S     2588   0%   7   0% bash
 8446  8414 root     R     1596   0%   0   0% top

(and again, because values were actively changing)

Mem: 7910008K used, 256624K free, 143544K shrd, 310828K buff, 4377812K cached
CPU:  12% usr   3% sys   0% nic  83% idle   0% io   0% irq   0% sirq
Load average: 1.08 1.24 1.48 6/312 8475
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    1     0 postgres S     166m   2%   2   0% postgres
 8471     1 postgres R     170m   2%   0   0% postgres: cantusdb cantusdb 192.168.128.4(40612) SELECT
 8467     1 postgres S     168m   2%   6   0% postgres: cantusdb cantusdb 192.168.128.4(40584) idle
   26     1 postgres S     166m   2%   3   0% postgres: autovacuum launcher 
   28     1 postgres S     166m   2%   6   0% postgres: logical replication launcher 
   23     1 postgres S     166m   2%   5   0% postgres: checkpointer 
   24     1 postgres S     166m   2%   3   0% postgres: background writer 
   25     1 postgres S     166m   2%   2   0% postgres: walwriter 
   27     1 postgres S    25092   0%   3   0% postgres: stats collector 
 8414     0 root     S     2588   0%   7   0% bash
 8446  8414 root     R     1596   0%   0   0% top

I'm not yet really sure what to look for, but a few observations:

  • Postgres
    • both systems both appear to be using about the same amount of memory
    • CPU usage is a little bit higher on production than on staging (makes sense), but it doesn't seem that production is being pushed close to its limits at all.
  • Django
    • both systems are using a lot of their memory, though much of this is taken up by buffers/caches. This is especially true of production.
    • production is using a lot more CPU than staging is. The CPU values for the six gunicorn processes add up to ~104% (I'm assuming the system can handle >100% since it has multiple VCPUs or something).

@jacobdgm
Copy link
Contributor

jacobdgm commented Aug 30, 2023

About to update the production machine after merging staging into production. Django container:


top - 13:23:50 up 44 days, 21:33,  0 users,  load average: 0.98, 0.88, 0.92
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.0 us,  2.8 sy,  0.0 ni, 86.3 id,  0.5 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem :   7975.2 total,    207.3 free,   2230.8 used,   5537.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5314.0 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                      
28783 root      20   0  480804 326968  10584 S  17.9   4.0 568:36.20 gunicorn
28771 root      20   0  493392 335716  10676 S  15.9   4.1 569:08.02 gunicorn
28798 root      20   0  461748 307280  10584 S  15.3   3.8 569:35.61 gunicorn
28795 root      20   0  486436 324024  10780 S  10.3   4.0 564:02.10 gunicorn
28778 root      20   0  485076 326576  10632 S   6.3   4.0 559:13.53 gunicorn
    1 root      20   0   30936  22132   3348 S   0.0   0.3   2:01.74 gunicorn
21314 root      20   0    6048   3636   3216 S   0.0   0.0   0:00.02 bash
21325 root      20   0    8948   3656   3080 R   0.0   0.0   0:00.00 top

Postgres container:

Mem: 7961432K used, 205200K free, 143012K shrd, 307256K buff, 4681388K cached
CPU:   0% usr   0% sys   0% nic 100% idle   0% io   0% irq   0% sirq
Load average: 0.81 0.90 0.91 2/314 6644
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
   26     1 postgres S     166m   2%   6   0% postgres: autovacuum launcher 
   28     1 postgres S     166m   2%   6   0% postgres: logical replication launcher 
   23     1 postgres S     166m   2%   7   0% postgres: checkpointer 
   24     1 postgres S     166m   2%   4   0% postgres: background writer 
   25     1 postgres S     166m   2%   1   0% postgres: walwriter 
    1     0 postgres S     166m   2%   5   0% postgres
   27     1 postgres S    25092   0%   4   0% postgres: stats collector 
 6636     0 root     S     2592   0%   5   0% bash
 6644  6636 root     R     1596   0%   0   0% top

observations: django container: we're still using a fair bit of CPU - it adds up to ~70%

@jacobdgm
Copy link
Contributor

jacobdgm commented Aug 30, 2023

Production VM, within a minute after restarting the containers, django container:

top - 14:10:02 up 44 days, 22:19,  0 users,  load average: 2.98, 1.91, 1.38
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.8 us,  0.9 sy,  0.0 ni, 85.4 id, 11.8 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :   7975.2 total,    994.8 free,   1444.0 used,   5536.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   6100.9 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                      
    8 root      20   0  254296 165200  14828 S   7.0   2.0   0:04.97 gunicorn                                                                                                     
    7 root      20   0  168320  78460  14972 S   1.7   1.0   0:04.02 gunicorn                                                                                                     
   10 root      20   0  322460 233384  15012 S   1.3   2.9   0:07.82 gunicorn                                                                                                     
   11 root      20   0  319824 231432  14884 S   0.3   2.8   0:06.74 gunicorn                                                                                                     
    1 root      20   0   30936  26488   7708 S   0.0   0.3   0:00.92 gunicorn                                                                                                     
    9 root      20   0  199040 110156  14972 S   0.0   1.3   0:05.86 gunicorn                                                                                                     
   37 root      20   0    6048   3724   3288 S   0.0   0.0   0:00.03 bash                                                                                                         
   45 root      20   0    8948   3652   3076 R   0.0   0.0   0:00.00 top                                                                                                          

Postgres container, within 2-3 minutes:

Mem: 7218436K used, 948196K free, 142940K shrd, 307960K buff, 4679872K cached
CPU:   8% usr   3% sys   0% nic  86% idle   0% io   0% irq   1% sirq
Load average: 1.55 1.71 1.35 3/319 820
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
   26     1 postgres S     166m   2%   1   0% postgres: autovacuum launcher 
   28     1 postgres S     166m   2%   3   0% postgres: logical replication launcher 
   24     1 postgres S     166m   2%   4   0% postgres: background writer 
   25     1 postgres S     166m   2%   4   0% postgres: walwriter 
   23     1 postgres S     166m   2%   3   0% postgres: checkpointer 
    1     0 postgres S     166m   2%   4   0% postgres
   27     1 postgres S    25092   0%   1   0% postgres: stats collector 
  812     0 root     S     2592   0%   4   0% bash
  819   812 root     R     1596   0%   6   0% top

about 15 minutes later (after running tests etc.)
django container:

top - 14:25:39 up 44 days, 22:35,  0 users,  load average: 1.49, 1.81, 1.75
Tasks:   8 total,   2 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 21.6 us,  3.4 sy,  0.0 ni, 74.0 id,  0.2 wa,  0.0 hi,  0.5 si,  0.3 st
MiB Mem :   7975.2 total,    611.9 free,   1805.5 used,   5557.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5737.8 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                      
    7 root      20   0  397352 309272  14972 S  75.3   3.8   2:15.58 gunicorn                                                                                                     
   11 root      20   0  296732 208596  15076 R  24.7   2.6   2:14.69 gunicorn                                                                                                     
   10 root      20   0  300336 212232  15012 S  20.0   2.6   2:18.67 gunicorn                                                                                                     
    8 root      20   0  283876 195640  14828 S  14.3   2.4   2:16.47 gunicorn                                                                                                     
    9 root      20   0  361952 274036  15160 S  10.3   3.4   2:13.90 gunicorn                                                                                                     
    1 root      20   0   30936  26488   7708 S   0.0   0.3   0:01.12 gunicorn                                                                                                     
  128 root      20   0    6048   3684   3164 S   0.0   0.0   0:00.04 bash                                                                                                         
  692 root      20   0    8948   3720   3144 R   0.0   0.0   0:00.00 top                                                                                                          

postgres container:

top - 14:26:09 up 44 days, 22:35,  0 users,  load average: 1.93, 1.89, 1.78
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.2 us,  2.2 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,  0.2 si,  0.1 st
MiB Mem :   7975.2 total,    728.5 free,   1692.4 used,   5554.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5852.6 avail Mem 
Mem: 7444340K used, 722292K free, 143244K shrd, 308128K buff, 4692776K cached
CPU:   8% usr   2% sys   0% nic  83% idle   5% io   0% irq   0% sirq
Load average: 1.40 1.75 1.73 2/341 8285
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
 8279     1 postgres S     168m   2%   1   0% postgres: cantusdb cantusdb 192.168.128.5(56852) idle
 8284     1 postgres S     168m   2%   6   0% postgres: cantusdb cantusdb 192.168.128.5(56886) idle
 8282     1 postgres S     168m   2%   3   0% postgres: cantusdb cantusdb 192.168.128.5(56866) idle
   26     1 postgres S     166m   2%   0   0% postgres: autovacuum launcher 
   28     1 postgres S     166m   2%   3   0% postgres: logical replication launcher 
   24     1 postgres S     166m   2%   4   0% postgres: background writer 
   23     1 postgres S     166m   2%   7   0% postgres: checkpointer 
   25     1 postgres S     166m   2%   3   0% postgres: walwriter 
    1     0 postgres S     166m   2%   7   0% postgres
   27     1 postgres S    25120   0%   4   0% postgres: stats collector 
 8064     0 root     S     2592   0%   6   0% bash
 8085  8064 root     R     1596   0%   4   0% top

observations: I keep noticing the value for %CPU in the django container. It dropped right off upon restarting, but did it ever ramp back up quickly! Within the 10-15 minutes, it was back up to ~150%. I wonder what could be going on here.

@jacobdgm
Copy link
Contributor

jacobdgm commented Aug 30, 2023

Looking through logs in the django container, I'm noticing a few things:

  • there are a bunch of requests for static files - for example, see this flurry of requests made by a single IP address:
23.98.137.97 - - [30/Aug/2023:14:10:05 +0000] "GET /about/acknowledgements/ HTTP/1.1" 301 162 "-" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/sshrc-logo.jpg HTTP/1.1" 200 28423 "http://cantusdatabase.org/static/sshrc-logo.jpg" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/CantusLogoSmall.gif HTTP/1.1" 200 7083 "http://cantusdatabase.org/static/CantusLogoSmall.gif" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/dalhousie-logo.png HTTP/1.1" 200 30901 "http://cantusdatabase.org/static/dalhousie-logo.png" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/digital-alliance-logo.png HTTP/1.1" 200 33447 "http://cantusdatabase.org/static/digital-alliance-logo.png" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /#/?next=/article/674793 HTTP/1.1" 200 23548 "http://cantusdatabase.org/#/?next=/article/674793" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/creative-commons-logo.png HTTP/1.1" 200 1672 "http://cantusdatabase.org/static/creative-commons-logo.png" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/uwaterloo-logo.png HTTP/1.1" 200 67090 "http://cantusdatabase.org/static/uwaterloo-logo.png" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/facebookIcon.png HTTP/1.1" 200 615 "http://cantusdatabase.org/static/facebookIcon.png" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/fonts/volpiano.css HTTP/1.1" 200 1121 "http://cantusdatabase.org/static/fonts/volpiano.css" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /article/712335 HTTP/1.1" 200 25412 "http://cantusdatabase.org/article/712335" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/favicon.ico HTTP/1.1" 200 1150 "http://cantusdatabase.org/static/favicon.ico" "lychee/0.13.0" "-"
23.98.137.97 - - [30/Aug/2023:14:10:06 +0000] "GET /static/sendMailIcon.png HTTP/1.1" 200 859 "http://cantusdatabase.org/static/sendMailIcon.png" "lychee/0.13.0" "-"
  • In principle, since files with /static/ in their paths don't change very often, we could have nginx handle requests for these resources downstream, without having to ask the upstream (django) server for them. Not sure how to set this up, and it's unlikely to solve all our problems, but could be worth looking into.
    • (update: checking the nginx logs over the past two hours since I restarted our containers, we've received ~167,000 requests, and about 30% of these are requests to a url with /static/ in its path)
  • one specific IP address keeps on making requests to /json-cid/cantus_id:
46.28.105.119 - - [30/Aug/2023:14:10:10 +0000] "GET /json-cid/cantus_id HTTP/1.0" 301 162 "-" "Drupal (+http://drupal.org/)" "-"
46.28.105.119 - - [30/Aug/2023:14:10:10 +0000] "GET /json-cid/008364a HTTP/1.0" 301 162 "-" "Drupal (+http://drupal.org/)" "-"
46.28.105.119 - - [30/Aug/2023:14:10:10 +0000] "GET /json-cid/cantus_id HTTP/1.0" 301 162 "-" "Drupal (+http://drupal.org/)" "-"
46.28.105.119 - - [30/Aug/2023:14:10:10 +0000] "GET /json-cid/cantus_id HTTP/1.0" 200 14 "-" "Drupal (+http://drupal.org/)" "-"
46.28.105.119 - - [30/Aug/2023:14:10:10 +0000] "GET /json-cid/008364a HTTP/1.0" 200 14 "-" "Drupal (+http://drupal.org/)" "-"
46.28.105.119 - - [30/Aug/2023:14:10:11 +0000] "GET /json-cid/cantus_id HTTP/1.0" 200 14 "-" "Drupal (+http://drupal.org/)" "-"
  • it seems to be a website running Drupal; the IP address is for someplace in the Czech Republic. Notice it making a couple of requests for proper Cantus IDs, but mostly making requests for info on chants with a Cantus ID of cantus_id. My guess is that there's a bug with Cantus Index, where a string template was not set up correctly, causing all of these useless requests (our server dutifully serves a response consisting of {"chants": []}) Again, it won't solve all our problems, but it might be worth sending an email to Jan to ask about this.
    • (update: checking the nginx logs over the past two hours since I restarted our containers, we've received ~167,000 requests, and requests by 46.28.105.119 to /json-cid/cantus_id make up about 5% of all requests.)

@jacobdgm
Copy link
Contributor

jacobdgm commented Sep 8, 2023

I'm going to do some looking into Nginx and caching - again, it's not likely to fully solve the problem, but reducing the number of requests that make it all the way to to gunicorn can't hurt.

@jacobdgm jacobdgm self-assigned this Sep 8, 2023
@jacobdgm
Copy link
Contributor

jacobdgm commented Sep 13, 2023

I sent an email to Jan about /json-cid/cantus_id/, and his response amounted to "Yes, that's Cantus Index's IP address. The site is being rebuilt, so I hope these requests stop once the new site is up. In the meantime, I hope they're not causing problems."

I sampled a minute or two of nginx logs from earlier in August, and out of ~6800 requests, 2400 contained /cantus_id in their path, and 1800 contained /static.

Another 1000-request sample (about 1 minute) from today contained ~400 /statics and ~50 contained /cantus_id.

In summary, we're going to continue to get rogue requests from CI for the next while, and people are going to continue to request static resources. If nginx can handle these requests itself, we can reduce the load on gunicorn/Django by approximately half. So I'm going to spend some time over the next few days learning about Nginx, so that we can get it to serve 404s to CantusIndex and cache static files.

@jacobdgm
Copy link
Contributor

Jan fixed the json-cid/cantus_id/ thing! Turns out we were at least partly at fault (see #1053).

Looking at the last 20_000 requests in our nginx logs, there is not a single instance of a 502 or 504. While I still think it would be a good idea to set up caching of static files, in a spirit of optimism I'm going to tentatively close this issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants