Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

New validator instances are built for each instancefile, discarding any internal caching #463

Closed
sirosen opened this issue Jul 16, 2024 · 5 comments · Fixed by #466
Closed
Labels
bug Something isn't working

Comments

@sirosen
Copy link
Member

sirosen commented Jul 16, 2024

First reported via #451

For each file, we get a validator from the "schema loader" objects. This interface is correct for the top-level checker. It allows, for example, metaschema checking to proceed with different schemas for different files.

However, this also is being applied to the remote and local class -- SchemaLoader.
As a result, a new validator is built for each file, and any internal caching done under the validator is discarded.

To resolve, one of two approaches should be used:

  • create exactly one validator per remote schema (or (schema, settings) where there might not be any current settings)
  • create a new validator per call, but reuse the components from any matching schemas
@sirosen sirosen added the bug Something isn't working label Jul 16, 2024
@sirosen
Copy link
Member Author

sirosen commented Jul 16, 2024

There's a secondary bug here, which is that the retrieve callable's in-memory caching is not working correctly because it's mixing absolute and relative URIs.

I have a potential fix for this pair of issues which I need to think about for a little before I proceed:

$ git diff HEAD
diff --git a/src/check_jsonschema/schema_loader/main.py b/src/check_jsonschema/schema_loader/main.py
index 4ce95c9..88ff6a0 100644
--- a/src/check_jsonschema/schema_loader/main.py
+++ b/src/check_jsonschema/schema_loader/main.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import functools
 import pathlib
 import typing as t
 import urllib.error
@@ -130,11 +131,21 @@ class SchemaLoader(SchemaLoaderBase):
         instance_doc: dict[str, t.Any],
         format_opts: FormatOptions,
         fill_defaults: bool,
+    ) -> jsonschema.protocols.Validator:
+        return self._get_validator(format_opts, fill_defaults)
+
+    @functools.lru_cache
+    def _get_validator(
+        self,
+        format_opts: FormatOptions,
+        fill_defaults: bool,
     ) -> jsonschema.protocols.Validator:
         retrieval_uri = self.get_schema_retrieval_uri()
         schema = self.get_schema()
 
         schema_dialect = schema.get("$schema")
+        if schema_dialect is not None and not isinstance(schema_dialect, str):
+            schema_dialect = None
 
         # format checker (which may be None)
         format_checker = make_format_checker(format_opts, schema_dialect)
diff --git a/src/check_jsonschema/schema_loader/resolver.py b/src/check_jsonschema/schema_loader/resolver.py
index c63b7bb..5084328 100644
--- a/src/check_jsonschema/schema_loader/resolver.py
+++ b/src/check_jsonschema/schema_loader/resolver.py
@@ -79,8 +79,8 @@ def create_retrieve_callable(
         else:
             full_uri = uri
 
-        if full_uri in cache._cache:
-            return cache[uri]
+        if full_uri in cache:
+            return cache[full_uri]
 
         full_uri_scheme = urllib.parse.urlsplit(full_uri).scheme
         if full_uri_scheme in ("http", "https"):
@@ -100,8 +100,8 @@ def create_retrieve_callable(
         else:
             parsed_object = get_local_file(full_uri)
 
-        cache[uri] = parsed_object
-        return cache[uri]
+        cache[full_uri] = parsed_object
+        return cache[full_uri]
 
     return retrieve_reference

This makes validator creation cached internally for the main SchemaLoader, and fixes the remote resource caching bug. Between these two, we get much improved performance in the "many files with many refs" case.

That said, I need to at the very least devise some tests for these to ensure there are no regressions in the future.

@alex1701c
Copy link

It would be amazing to get this fixed - then I can integrate it efficiently in the CI setup I am planning to :)

@sirosen
Copy link
Member Author

sirosen commented Jul 27, 2024

I wasn't able to work on this last week, but I've just put the above into a PR with a test to run against. I should be able to get this all merged and released soon, assuming that there are no surprises when CI runs.

@sirosen
Copy link
Member Author

sirosen commented Jul 27, 2024

v0.29.1 has the above fix and is freshly available.

I tested it against the usage described in #451 and saw execution times of around 0.8s on average. Please let me know (comment or fresh issue) if the new version is still performing in an unexpectedly bad way!

@alex1701c
Copy link

I can confirm the execution time very fast now - thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants