-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Fix performance issues when having remote schema with multiple refs #451
Conversation
We would request the refs without any caching on disk and also do the network request for each file that is loaded. Instead, also cache refs of the given schema. To avoid more HTTP overhead, cached schemas are not revalidated when they were loaded/revalidated in the same process. Before: ``` time check-jsonschema --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins/**.json ok -- validation done The following files were checked: .... plugins/virtualmonitor/kdeconnect_virtualmonitor.json ________________________________________________________ Executed in 31.14 secs fish external usr time 19.73 secs 5.08 millis 19.72 secs sys time 0.28 secs 0.01 millis 0.28 secs ``` After: ``` time python3 ~/projects/check-jsonschema/src/check_jsonschema/__main__.py --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins /**.json ok -- validation done The following files were checked: ... plugins/virtualmonitor/kdeconnect_virtualmonitor.json ________________________________________________________ Executed in 2.23 secs fish external usr time 1.43 secs 4.72 millis 1.43 secs sys time 0.08 secs 0.09 millis 0.08 secs ```
Thanks for raising this issue and coming to the door with working code in a PR! In this particular case, I don't think I can accept this as-is and wanted to step back and think about what my requirements are. I've written this feature up in an issue (#452). There are three mechanical issues I see with this as-is:
If you're interested in continuing to work on this, I'd really appreciate it if you took a look at that issue. |
def open(self) -> t.Iterator[t.IO[bytes]]:
if (not self._cache_dir) or self._disable_cache:
yield io.BytesIO(self._get_request().content) We have a separate codepath for that.
Hmm, I see. But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.
Yep As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged. |
The
The issue in which I outlined a solution discusses this. By taking the full URI and using it to build a filename, I think we can solve this problem.
I'm slightly confused by this comment, given that this PR is about adding on-disk caching.
This seems inaccurate to me? Or if we are doing downloads of the schema even on a cache hit, it would mean there is a bug. Are you seeing redownloads of schemas even when there is a |
It hit me as soon as I posted that last comment that there is a bug and I even know when/how it got introduced. I can't work on a fix for it right now, during my workday, but I'll be circling back on it as soon as I have time. |
Should I still work on the discussed changes here? |
I think it's best if you wait at least until I put together a fix for #453, which I will hopefully be able to start working on in a few hours. The fix will likely refactor some of the existing codepaths, and I wouldn't want you to lose extra time on the resulting merge conflicts. Although normally I love to share authorship as much as possible, I have a lot of relevant context and some specific ideas about the requisite testing, so I think it will be most efficient for me to make that fix myself. I'll credit you in the changelog for raising the issue, however, since thanks are definitely deserved here! |
May I resume working on this or do you expect other conflicitng changes? |
I would advise against working on this PR. |
I'm closing this as superseded by #454, which should deliver these gains. This work was the original inspiration and it showed a lot of what had to happen, but I was unfortunately not able to structure the changeset in a way that could meaningfully be based off of this commit. I did, however, credit you in the changelog. This will be released in v0.29.0 in just a few minutes time. Thanks again for the contribution! |
Thank you for your efforts and credeting me! However, even with the latest version the performance is still very slow. Like if I pass in the same list of files, the schemas get requested lots of times. The latency still adds up to around 30 seconds. I think this is due to determining if we have a cache hit, we still make the request. |
To determine if there is a cache hit, we need to open a connection and get some bytes back, to get the headers. Are you saying that the behavior you see is that there's a full download? It's possible that there's some case that I missed during testing which causes this. It might be best to open an issue with a couple of example files to demonstrate the issue. When I tested things out after the improvements, I saw the behavior I was expecting, which showed a measurable improvement. |
But if we have revalidated/downloaded the same file in the same process, there is no need to open a HTTP connection again. My original issue was not about the data transfered, but the amount of requests that add up to quite a slow linting. |
There is an in-memory cache which is used here to retrieve data when ref resolution requests the same URL twice. This is built once per validator, and only one validator should be built per process. I'm very much open to the idea that there's some bug which causes this not to function correctly, but I need more information to reproduce in order to track it down. Can you please include some sample files which demonstrate the issue? I have the schema from your initial PR text, but I don't know what the content is supposed to be, so it's difficult for me to understand what you're seeing. |
But create_retrieve_callable gets called multiple times in case we have mutliple files passed in? The example repo I have used is https://invent.kde.org/network/kdeconnect-kde/ |
Are you certain of this? That should be called once per Validator instantiation, and therefore once per top-level schema, i.e. once per invocation. If it is being called for each |
I have just gotten some time to test and confirm; we're building a new validator for each document. So, yes, confirmed bug. Now that I have the example data I can repro this and see the issue. I'm going to file a distinct ticket as a bug report with some detail on what's happening. |
We would request the refs without any caching on disk and also do the network request for each file that is loaded.
Instead, also cache refs of the given schema.
To avoid more HTTP overhead, cached schemas are not revalidated when they were loaded/revalidated in the same process.
Before:
After: