Fix performance issues when having remote schema with multiple refs #451

alex1701c · 2024-06-26T19:30:56Z

We would request the refs without any caching on disk and also do the network request for each file that is loaded.
Instead, also cache refs of the given schema.

To avoid more HTTP overhead, cached schemas are not revalidated when they were loaded/revalidated in the same process.

Before:

time check-jsonschema --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins/**.json
ok -- validation done
The following files were checked:
  ....
  plugins/virtualmonitor/kdeconnect_virtualmonitor.json

________________________________________________________
Executed in   31.14 secs    fish           external
   usr time   19.73 secs    5.08 millis   19.72 secs
   sys time    0.28 secs    0.01 millis    0.28 secs

After:

time python3 ~/projects/check-jsonschema/src/check_jsonschema/__main__.py --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins
/**.json
ok -- validation done
The following files were checked:
  ...
  plugins/virtualmonitor/kdeconnect_virtualmonitor.json

________________________________________________________
Executed in    2.23 secs    fish           external
   usr time    1.43 secs    4.72 millis    1.43 secs
   sys time    0.08 secs    0.09 millis    0.08 secs

We would request the refs without any caching on disk and also do the network request for each file that is loaded. Instead, also cache refs of the given schema. To avoid more HTTP overhead, cached schemas are not revalidated when they were loaded/revalidated in the same process. Before: ``` time check-jsonschema --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins/**.json ok -- validation done The following files were checked: .... plugins/virtualmonitor/kdeconnect_virtualmonitor.json ________________________________________________________ Executed in 31.14 secs fish external usr time 19.73 secs 5.08 millis 19.72 secs sys time 0.28 secs 0.01 millis 0.28 secs ``` After: ``` time python3 ~/projects/check-jsonschema/src/check_jsonschema/__main__.py --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins /**.json ok -- validation done The following files were checked: ... plugins/virtualmonitor/kdeconnect_virtualmonitor.json ________________________________________________________ Executed in 2.23 secs fish external usr time 1.43 secs 4.72 millis 1.43 secs sys time 0.08 secs 0.09 millis 0.08 secs ```

sirosen · 2024-06-27T14:46:51Z

Thanks for raising this issue and coming to the door with working code in a PR!

In this particular case, I don't think I can accept this as-is and wanted to step back and think about what my requirements are. I've written this feature up in an issue (#452). There are three mechanical issues I see with this as-is:

--no-cache would not be respected for refs
multiple refs with the same filename would conflict (--cache-filename works around this issue for the existing caching)
we need to add tests for this feature, including testing for scenarios like two refs with the same filename but different URIs

If you're interested in continuing to work on this, I'd really appreciate it if you took a look at that issue.
Alternatively, I'm happy to pick this up and run with it, making changes to meet the requirements and adding tests. Cutting 30s runtimes to 2s is a laudable goal; we just need to make sure we don't create problematic scenarios as we do it.

alex1701c · 2024-06-27T15:15:21Z

--no-cache would not be respected for refs

    def open(self) -> t.Iterator[t.IO[bytes]]:
        if (not self._cache_dir) or self._disable_cache:
            yield io.BytesIO(self._get_request().content)

We have a separate codepath for that.

multiple refs with the same filename would conflict (--cache-filename works around this issue for the existing caching)

Hmm, I see. But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts

I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.

we need to add tests for this feature, including testing for scenarios like two refs with the same filename but different URIs

Yep

As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged.

sirosen · 2024-06-27T16:13:11Z

The CacheDownloader used inside of the ref resolver does not have disable_cache=... set on init. As a result, it uses the class default. i.e. self._disable_cache is always False in the new usage site, even if --no-cache was passed.

But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts

The issue in which I outlined a solution discusses this. By taking the full URI and using it to build a filename, I think we can solve this problem.

I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.

I'm slightly confused by this comment, given that this PR is about adding on-disk caching.
The ref resolution mechanism provided by referencing is already holding results in memory.

As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged.

This seems inaccurate to me? Or if we are doing downloads of the schema even on a cache hit, it would mean there is a bug.
The caching usage today reads the Last-Modified header and compares against the mtime on disk.
There are lots of other headers (e.g. E-Tags) which could be used, but this was the simple 95% effective solution I started with.

Are you seeing redownloads of schemas even when there is a Last-Modified header provided? A reproducer would be key to tracking down this sort of bug.

sirosen · 2024-06-27T16:23:37Z

It hit me as soon as I posted that last comment that there is a bug and I even know when/how it got introduced.
#453 captures it, but the caching got borked when I added a safety check that the remote content is parseable -- I added it incorrectly and didn't catch it during testing.

I can't work on a fix for it right now, during my workday, but I'll be circling back on it as soon as I have time.

alex1701c · 2024-06-27T19:49:50Z

Should I still work on the discussed changes here?

sirosen · 2024-06-27T22:34:21Z

I think it's best if you wait at least until I put together a fix for #453, which I will hopefully be able to start working on in a few hours. The fix will likely refactor some of the existing codepaths, and I wouldn't want you to lose extra time on the resulting merge conflicts.

Although normally I love to share authorship as much as possible, I have a lot of relevant context and some specific ideas about the requisite testing, so I think it will be most efficient for me to make that fix myself. I'll credit you in the changelog for raising the issue, however, since thanks are definitely deserved here!

alex1701c · 2024-06-30T17:44:20Z

May I resume working on this or do you expect other conflicitng changes?

sirosen · 2024-07-01T14:50:13Z

I would advise against working on this PR.
I have a complete implementation I worked up at this point, but want to make some minor changes for non-JSON refs (e.g. YAML files being used). I'll have it in a PR shortly and plan to close this and #452 once it's merged and released.

sirosen · 2024-07-08T20:35:08Z

I'm closing this as superseded by #454, which should deliver these gains. This work was the original inspiration and it showed a lot of what had to happen, but I was unfortunately not able to structure the changeset in a way that could meaningfully be based off of this commit. I did, however, credit you in the changelog.

This will be released in v0.29.0 in just a few minutes time. Thanks again for the contribution!

alex1701c · 2024-07-09T20:07:36Z

Thank you for your efforts and credeting me!

However, even with the latest version the performance is still very slow. Like if I pass in the same list of files, the schemas get requested lots of times. The latency still adds up to around 30 seconds. I think this is due to determining if we have a cache hit, we still make the request.

sirosen · 2024-07-09T20:21:21Z

To determine if there is a cache hit, we need to open a connection and get some bytes back, to get the headers. Are you saying that the behavior you see is that there's a full download? It's possible that there's some case that I missed during testing which causes this.

It might be best to open an issue with a couple of example files to demonstrate the issue. When I tested things out after the improvements, I saw the behavior I was expecting, which showed a measurable improvement.

alex1701c · 2024-07-10T19:55:12Z

But if we have revalidated/downloaded the same file in the same process, there is no need to open a HTTP connection again. My original issue was not about the data transfered, but the amount of requests that add up to quite a slow linting.

sirosen · 2024-07-10T21:28:20Z

There is an in-memory cache which is used here to retrieve data when ref resolution requests the same URL twice. This is built once per validator, and only one validator should be built per process.

I'm very much open to the idea that there's some bug which causes this not to function correctly, but I need more information to reproduce in order to track it down. Can you please include some sample files which demonstrate the issue? I have the schema from your initial PR text, but I don't know what the content is supposed to be, so it's difficult for me to understand what you're seeing.

alex1701c · 2024-07-13T19:59:39Z

But create_retrieve_callable gets called multiple times in case we have mutliple files passed in?

The example repo I have used is https://invent.kde.org/network/kdeconnect-kde/

sirosen · 2024-07-15T22:12:40Z

But create_retrieve_callable gets called multiple times in case we have mutliple files passed in?

Are you certain of this? That should be called once per Validator instantiation, and therefore once per top-level schema, i.e. once per invocation. If it is being called for each $ref, that would be a bug which needs resolution.

sirosen · 2024-07-16T00:01:36Z

I have just gotten some time to test and confirm; we're building a new validator for each document. So, yes, confirmed bug. Now that I have the example data I can repro this and see the issue.

I'm going to file a distinct ticket as a bug report with some detail on what's happening.

sirosen mentioned this pull request Jun 27, 2024

Design a solution for caching downloads of $refs in order to improve performance in cases with many remote refs #452

Closed

sirosen mentioned this pull request Jun 27, 2024

Remote file caching improvement is destroyed by use of parse as a validation callback #453

Closed

sirosen closed this Jul 8, 2024

sirosen mentioned this pull request Jul 16, 2024

New validator instances are built for each instancefile, discarding any internal caching #463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance issues when having remote schema with multiple refs #451

Fix performance issues when having remote schema with multiple refs #451

alex1701c commented Jun 26, 2024

sirosen commented Jun 27, 2024

alex1701c commented Jun 27, 2024

sirosen commented Jun 27, 2024

sirosen commented Jun 27, 2024

alex1701c commented Jun 27, 2024

sirosen commented Jun 27, 2024

alex1701c commented Jun 30, 2024

sirosen commented Jul 1, 2024

sirosen commented Jul 8, 2024

alex1701c commented Jul 9, 2024

sirosen commented Jul 9, 2024

alex1701c commented Jul 10, 2024

sirosen commented Jul 10, 2024

alex1701c commented Jul 13, 2024

sirosen commented Jul 15, 2024

sirosen commented Jul 16, 2024

Fix performance issues when having remote schema with multiple refs #451

Fix performance issues when having remote schema with multiple refs #451

Conversation

alex1701c commented Jun 26, 2024

sirosen commented Jun 27, 2024

alex1701c commented Jun 27, 2024

sirosen commented Jun 27, 2024

sirosen commented Jun 27, 2024

alex1701c commented Jun 27, 2024

sirosen commented Jun 27, 2024

alex1701c commented Jun 30, 2024

sirosen commented Jul 1, 2024

sirosen commented Jul 8, 2024

alex1701c commented Jul 9, 2024

sirosen commented Jul 9, 2024

alex1701c commented Jul 10, 2024

sirosen commented Jul 10, 2024

alex1701c commented Jul 13, 2024

sirosen commented Jul 15, 2024

sirosen commented Jul 16, 2024