Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Critical Bugfix: Stop caching nested sub-schemas. #111

Merged
merged 2 commits into from
May 20, 2020

Conversation

theturtle32
Copy link
Contributor

At Dollar Shave Club, for several years we've been using our own fork of avro_turf with a much simpler version of the patch in this pull request, as none of our applications can function at all using the official release. I felt it was high time to finally contribute a high quality, tested version of our patch upstream.

All our .avsc files embed the sub-schemas for every nested field all the way down the hierarchy, so that the one .avsc file is all that is needed to encode an entire, complex, nested data structure consisting of many different types into Avro. Many of our top-level schemas end up embedding their own copies of the same sub-types. For example, a Person and a Company might both embed the Address schema. We generate these nested .avsc files using Salsify's delightful avro-builder gem.

With every official version of avro_turf up to and including 0.11.0, trying to load more than one of these deeply-nested .avsc files will result in the Apache Avro gem raising an exception:

#<Avro::SchemaParseError: The name "foo.bar" is already in use.>

The reason for this is that AvroTurf::StateStore passes its own internal @schemas cache to Avro::Schema.real_parse for the Avro gem to directly use as the internal names cache as it parses the schema definition tree. The Avro gem mutates this cache in place as it parses, adding to it every named type that it encounters along the way. (Incidentally, this also means that it can pollute the @schemas object with partial/invalid data in the event that it ultimately ends up raising a parsing exception, because there is no mechanism for restoring @schemas to its original state after a failed call to Avro::Schema.real_parse. The patch in this pull request partially mitigates the effects of that behavior as well.)

The solution is to create a new, empty context to pass as the names parameter to Avro::Schema.real_parse, and only add the one resulting top-level schema it returns to the @schemas hash, because only schemas that have their own .avsc file on disk should be resolvable by AvroTurf::SchemaStore.find.

Our simple patched version wasn't compatible with having two different .avsc files containing circular references to each other, so I fixed that up and made sure the rest of the test suite still passed. I added lots of descriptive comments along the way.

I also added a couple new test cases:

  1. Verify that multiple .avsc files may define their own totally independent sub-schemas having the same fully-qualified name. It's illegal for schemas to define a type using the same fully-qualified name more than once, but it is absolutely acceptable for two independent schemas to each define the same or different types using the same fully-qualified name.

  2. Verify that sub-schemas do not get cached in the top level @schemas cache inside the schema store. The only schemas that should be resolvable via a call to AvroTurf::SchemaStore#find are those that have their own .avsc file on disk.

It is important to note that this does change the behavior of avro_turf's schema store API:

If an application using avro_turf expects to pre-cache all its avsc files and then be able to resolve all the nested sub-schemas directly by calling store.find('subschema', 'some.namespace'), those applications will break with this patch. I have no idea if anyone relies upon this incorrect behavior, and there are no existing tests in the rspec suite that expect sub-schemas to be resolvable after caching the top-level schema that embedded them. So, incorporating this change may warrant adding a warning to the README.

@dasch
Copy link
Owner

dasch commented May 4, 2020

Can you add a note to the README as you suggest? I think this patch is fine 👍 thanks!

@dasch
Copy link
Owner

dasch commented May 11, 2020

@theturtle32 just wanted to bump this – I'm ready to merge as soon as there's a README entry! :D

@theturtle32
Copy link
Contributor Author

@dasch - Sorry for the delay. I'm not quite sure how to word it in the README, or where to put it.

I was thinking a message like "as of version XYZ, this behavior changes" but I'm not sure what release you'd plan to incorporate it into.

@theturtle32 theturtle32 force-pushed the dont-cache-nested-subschemas branch 2 times, most recently from bc9f067 to 60d5f75 Compare May 19, 2020 22:59
@theturtle32
Copy link
Contributor Author

theturtle32 commented May 19, 2020

@dasch - Ok, added a speculative README.md update in 1dad2f0.

Let me know what you think.

@theturtle32 theturtle32 force-pushed the dont-cache-nested-subschemas branch from 60d5f75 to f406efb Compare May 19, 2020 23:03
@theturtle32 theturtle32 force-pushed the dont-cache-nested-subschemas branch from f406efb to 1dad2f0 Compare May 19, 2020 23:05
@dasch
Copy link
Owner

dasch commented May 20, 2020

Awesome, thanks!

@dasch dasch merged commit c3fb2ef into dasch:master May 20, 2020
@dasch
Copy link
Owner

dasch commented May 20, 2020

Released in v1.0.0.

@dorner
Copy link
Contributor

dorner commented May 21, 2020

Awesome! I've been unable to cache more than one schema at a time because of this - schemas A and B both define a subtype named C and AvroTurf would crash. This should allow me to cache or prefetch all schemas and not have to worry about this. 👍

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants