Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix performance issues when having remote schema with multiple refs #451

Closed
wants to merge 1 commit into from

Conversation

alex1701c
Copy link

We would request the refs without any caching on disk and also do the network request for each file that is loaded.
Instead, also cache refs of the given schema.

To avoid more HTTP overhead, cached schemas are not revalidated when they were loaded/revalidated in the same process.

Before:

time check-jsonschema --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins/**.json
ok -- validation done
The following files were checked:
  ....
  plugins/virtualmonitor/kdeconnect_virtualmonitor.json

________________________________________________________
Executed in   31.14 secs    fish           external
   usr time   19.73 secs    5.08 millis   19.72 secs
   sys time    0.28 secs    0.01 millis    0.28 secs

After:

time python3 ~/projects/check-jsonschema/src/check_jsonschema/__main__.py --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins
/**.json
ok -- validation done
The following files were checked:
  ...
  plugins/virtualmonitor/kdeconnect_virtualmonitor.json

________________________________________________________
Executed in    2.23 secs    fish           external
   usr time    1.43 secs    4.72 millis    1.43 secs
   sys time    0.08 secs    0.09 millis    0.08 secs

We would request the refs without any caching on disk and also do the
network request for each file that is loaded.
Instead, also cache refs of the given schema.

To avoid more HTTP overhead, cached schemas are not revalidated when
they were loaded/revalidated in the same process.

Before:
```
time check-jsonschema --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins/**.json
ok -- validation done
The following files were checked:
  ....
  plugins/virtualmonitor/kdeconnect_virtualmonitor.json

________________________________________________________
Executed in   31.14 secs    fish           external
   usr time   19.73 secs    5.08 millis   19.72 secs
   sys time    0.28 secs    0.01 millis    0.28 secs
```

After:
```
time python3 ~/projects/check-jsonschema/src/check_jsonschema/__main__.py --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins
/**.json
ok -- validation done
The following files were checked:
  ...
  plugins/virtualmonitor/kdeconnect_virtualmonitor.json

________________________________________________________
Executed in    2.23 secs    fish           external
   usr time    1.43 secs    4.72 millis    1.43 secs
   sys time    0.08 secs    0.09 millis    0.08 secs

```
@sirosen
Copy link
Member

sirosen commented Jun 27, 2024

Thanks for raising this issue and coming to the door with working code in a PR!

In this particular case, I don't think I can accept this as-is and wanted to step back and think about what my requirements are. I've written this feature up in an issue (#452). There are three mechanical issues I see with this as-is:

  • --no-cache would not be respected for refs
  • multiple refs with the same filename would conflict (--cache-filename works around this issue for the existing caching)
  • we need to add tests for this feature, including testing for scenarios like two refs with the same filename but different URIs

If you're interested in continuing to work on this, I'd really appreciate it if you took a look at that issue.
Alternatively, I'm happy to pick this up and run with it, making changes to meet the requirements and adding tests. Cutting 30s runtimes to 2s is a laudable goal; we just need to make sure we don't create problematic scenarios as we do it.

@alex1701c
Copy link
Author

--no-cache would not be respected for refs

    def open(self) -> t.Iterator[t.IO[bytes]]:
        if (not self._cache_dir) or self._disable_cache:
            yield io.BytesIO(self._get_request().content)

We have a separate codepath for that.

multiple refs with the same filename would conflict (--cache-filename works around this issue for the existing caching)

Hmm, I see. But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts

I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.

we need to add tests for this feature, including testing for scenarios like two refs with the same filename but different URIs

Yep

As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged.

@sirosen
Copy link
Member

sirosen commented Jun 27, 2024

The CacheDownloader used inside of the ref resolver does not have disable_cache=... set on init. As a result, it uses the class default. i.e. self._disable_cache is always False in the new usage site, even if --no-cache was passed.

But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts

The issue in which I outlined a solution discusses this. By taking the full URI and using it to build a filename, I think we can solve this problem.

I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.

I'm slightly confused by this comment, given that this PR is about adding on-disk caching.
The ref resolution mechanism provided by referencing is already holding results in memory.

As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged.

This seems inaccurate to me? Or if we are doing downloads of the schema even on a cache hit, it would mean there is a bug.
The caching usage today reads the Last-Modified header and compares against the mtime on disk.
There are lots of other headers (e.g. E-Tags) which could be used, but this was the simple 95% effective solution I started with.

Are you seeing redownloads of schemas even when there is a Last-Modified header provided? A reproducer would be key to tracking down this sort of bug.

@sirosen
Copy link
Member

sirosen commented Jun 27, 2024

It hit me as soon as I posted that last comment that there is a bug and I even know when/how it got introduced.
#453 captures it, but the caching got borked when I added a safety check that the remote content is parseable -- I added it incorrectly and didn't catch it during testing.

I can't work on a fix for it right now, during my workday, but I'll be circling back on it as soon as I have time.

@alex1701c
Copy link
Author

Should I still work on the discussed changes here?

@sirosen
Copy link
Member

sirosen commented Jun 27, 2024

I think it's best if you wait at least until I put together a fix for #453, which I will hopefully be able to start working on in a few hours. The fix will likely refactor some of the existing codepaths, and I wouldn't want you to lose extra time on the resulting merge conflicts.

Although normally I love to share authorship as much as possible, I have a lot of relevant context and some specific ideas about the requisite testing, so I think it will be most efficient for me to make that fix myself. I'll credit you in the changelog for raising the issue, however, since thanks are definitely deserved here!

@alex1701c
Copy link
Author

May I resume working on this or do you expect other conflicitng changes?

@sirosen
Copy link
Member

sirosen commented Jul 1, 2024

I would advise against working on this PR.
I have a complete implementation I worked up at this point, but want to make some minor changes for non-JSON refs (e.g. YAML files being used). I'll have it in a PR shortly and plan to close this and #452 once it's merged and released.

@sirosen
Copy link
Member

sirosen commented Jul 8, 2024

I'm closing this as superseded by #454, which should deliver these gains. This work was the original inspiration and it showed a lot of what had to happen, but I was unfortunately not able to structure the changeset in a way that could meaningfully be based off of this commit. I did, however, credit you in the changelog.

This will be released in v0.29.0 in just a few minutes time. Thanks again for the contribution!

@sirosen sirosen closed this Jul 8, 2024
@alex1701c
Copy link
Author

Thank you for your efforts and credeting me!

However, even with the latest version the performance is still very slow. Like if I pass in the same list of files, the schemas get requested lots of times. The latency still adds up to around 30 seconds. I think this is due to determining if we have a cache hit, we still make the request.

@sirosen
Copy link
Member

sirosen commented Jul 9, 2024

To determine if there is a cache hit, we need to open a connection and get some bytes back, to get the headers. Are you saying that the behavior you see is that there's a full download? It's possible that there's some case that I missed during testing which causes this.

It might be best to open an issue with a couple of example files to demonstrate the issue. When I tested things out after the improvements, I saw the behavior I was expecting, which showed a measurable improvement.

@alex1701c
Copy link
Author

But if we have revalidated/downloaded the same file in the same process, there is no need to open a HTTP connection again. My original issue was not about the data transfered, but the amount of requests that add up to quite a slow linting.

@sirosen
Copy link
Member

sirosen commented Jul 10, 2024

There is an in-memory cache which is used here to retrieve data when ref resolution requests the same URL twice. This is built once per validator, and only one validator should be built per process.

I'm very much open to the idea that there's some bug which causes this not to function correctly, but I need more information to reproduce in order to track it down. Can you please include some sample files which demonstrate the issue? I have the schema from your initial PR text, but I don't know what the content is supposed to be, so it's difficult for me to understand what you're seeing.

@alex1701c
Copy link
Author

But create_retrieve_callable gets called multiple times in case we have mutliple files passed in?

The example repo I have used is https://invent.kde.org/network/kdeconnect-kde/

@sirosen
Copy link
Member

sirosen commented Jul 15, 2024

But create_retrieve_callable gets called multiple times in case we have mutliple files passed in?

Are you certain of this? That should be called once per Validator instantiation, and therefore once per top-level schema, i.e. once per invocation. If it is being called for each $ref, that would be a bug which needs resolution.

@sirosen
Copy link
Member

sirosen commented Jul 16, 2024

I have just gotten some time to test and confirm; we're building a new validator for each document. So, yes, confirmed bug. Now that I have the example data I can repro this and see the issue.

I'm going to file a distinct ticket as a bug report with some detail on what's happening.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants