Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix handful of networking issues including silence errors #4875

Merged
merged 1 commit into from
Sep 6, 2024

Conversation

sarayourfriend
Copy link
Collaborator

Fixes

Fixes #4774 by @stacimc

Description

Okay, this is a bit of a big one. We can split it up, but I'm running out of time before I need to end working for the week, and I needed to get this all down at least somewhere. The changes here (except the PIL ones) are all related to the linked issue, either by just ignoring the errors as non-actionable the way we discussed, or by actually fixing some of them (hopefully).

The main difference there being the refactor to treat aiohttp client responses as async context managers. This, apparently, is necessary when using aiohttp. I wrote the original code that introduced aiohttp, and I don't know how I missed this, or how the application was able to keep working so long with it missed, especially in such heavily trafficked endpoints like the thumbnail. In any case, I believe this is the root cause of the RuntimeErrors we've seen persistently. They "went away" for a while, but only because we stopped logging them. I was really worried about silencing them again, so spent some time pouring over the issues list for aiohttp, uvloop, uvicorn, httpx (to compare), asgiref, and django. I came up with nothing, and then sort of randomly read some of the aiohttp code trying to understand the request flow better... and yeah, noticed that it returns a context manager from get (et co). The docs also say this rather clearly if you read the client reference, but do not on the quickstart (and some instances of network calls just use await without closing).

This section, for example, does not use async with, just await.

But if you look here, on the actual get method's docs, it's clear about returning a context manager that needs closing.

So anyway, we got away with it for a long time, but hopefully this fixes it! It certainly makes sense that not using the context manager there would cause weird state issues with the client session being open but the underlying transport being closed and not recycled.

The rest of the changes around the aiohttp code are to get at the rest of the errors described by the issue, the ones we feel are non-actionable. I'm not entirely sure I agree with the list, and if we had more time, I would like to spend it looking at ClientOSError in particular. But we don't have time, and considering we're hopefully fixing a large portion of errors that otherwise did not need to happen, I'm fine going forward with this list as "non-actionable". I've made it clear in the comments on the lists of non-actionable exception classes that we might change how we approach this in the future. So this isn't closing any doors.

Finally, when looking at the context manager bits, I realised there were PIL Image objects never getting closed. They are only used in the oembed and watermark endpoints, both of which are known to be underused. Nevertheless, this should fix whatever small potential memory leak might have come about through that.

Testing Instructions

CI should pass. It took a while to get tests in a good state again, but only because the tests themselves are somewhat complex and obscure the underlying errors. The image_proxy.get function's tendency to just swallow errors makes debugging failing tests really difficult.

Test the following routes by visiting them on your local computer, after running ov just api/up:

  • image and audio search (test with and without filter_dead=true)
  • thumbnails
  • watermark
  • oembed

All of these should return correct responses without errors.

Lastly, try to think of places where there might be dangling context managers, not necessarily to add to this PR, but to open issues to follow up for. If you search the codebase and find any aiohttp references that aren't doing this correctly, we should include them here, I think.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
    PRs) or the media properties generator (ov just catalog/generate-docs media-props
    for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: api Related to the Django API labels Sep 6, 2024
@sarayourfriend sarayourfriend requested a review from a team as a code owner September 6, 2024 04:26
@sarayourfriend sarayourfriend requested review from obulat, stacimc and dhruvkb and removed request for a team September 6, 2024 04:26
Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make total sense to me. LGTM!

@dhruvkb dhruvkb merged commit 18d8416 into main Sep 6, 2024
71 checks passed
@dhruvkb dhruvkb deleted the fix/api-request-error-reporting branch September 6, 2024 06:11
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Do not log timeout errors for upstream thumbnail exception
2 participants