Fix handful of networking issues including silence errors #4875
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes
Fixes #4774 by @stacimc
Description
Okay, this is a bit of a big one. We can split it up, but I'm running out of time before I need to end working for the week, and I needed to get this all down at least somewhere. The changes here (except the PIL ones) are all related to the linked issue, either by just ignoring the errors as non-actionable the way we discussed, or by actually fixing some of them (hopefully).
The main difference there being the refactor to treat aiohttp client responses as async context managers. This, apparently, is necessary when using aiohttp. I wrote the original code that introduced aiohttp, and I don't know how I missed this, or how the application was able to keep working so long with it missed, especially in such heavily trafficked endpoints like the thumbnail. In any case, I believe this is the root cause of the RuntimeErrors we've seen persistently. They "went away" for a while, but only because we stopped logging them. I was really worried about silencing them again, so spent some time pouring over the issues list for aiohttp, uvloop, uvicorn, httpx (to compare), asgiref, and django. I came up with nothing, and then sort of randomly read some of the aiohttp code trying to understand the request flow better... and yeah, noticed that it returns a context manager from
get
(et co). The docs also say this rather clearly if you read the client reference, but do not on the quickstart (and some instances of network calls just use await without closing).This section, for example, does not use async with, just await.
But if you look here, on the actual
get
method's docs, it's clear about returning a context manager that needs closing.So anyway, we got away with it for a long time, but hopefully this fixes it! It certainly makes sense that not using the context manager there would cause weird state issues with the client session being open but the underlying transport being closed and not recycled.
The rest of the changes around the aiohttp code are to get at the rest of the errors described by the issue, the ones we feel are non-actionable. I'm not entirely sure I agree with the list, and if we had more time, I would like to spend it looking at
ClientOSError
in particular. But we don't have time, and considering we're hopefully fixing a large portion of errors that otherwise did not need to happen, I'm fine going forward with this list as "non-actionable". I've made it clear in the comments on the lists of non-actionable exception classes that we might change how we approach this in the future. So this isn't closing any doors.Finally, when looking at the context manager bits, I realised there were PIL Image objects never getting closed. They are only used in the oembed and watermark endpoints, both of which are known to be underused. Nevertheless, this should fix whatever small potential memory leak might have come about through that.
Testing Instructions
CI should pass. It took a while to get tests in a good state again, but only because the tests themselves are somewhat complex and obscure the underlying errors. The image_proxy.get function's tendency to just swallow errors makes debugging failing tests really difficult.
Test the following routes by visiting them on your local computer, after running
ov just api/up
:filter_dead=true
)All of these should return correct responses without errors.
Lastly, try to think of places where there might be dangling context managers, not necessarily to add to this PR, but to open issues to follow up for. If you search the codebase and find any aiohttp references that aren't doing this correctly, we should include them here, I think.
Checklist
Update index.md
).main
) or a parent feature branch.ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
ov just catalog/generate-docs media-props
for the catalog or
ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin