Critical Bugfix: Stop caching nested sub-schemas. #111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
At Dollar Shave Club, for several years we've been using our own fork of avro_turf with a much simpler version of the patch in this pull request, as none of our applications can function at all using the official release. I felt it was high time to finally contribute a high quality, tested version of our patch upstream.
All our .avsc files embed the sub-schemas for every nested field all the way down the hierarchy, so that the one .avsc file is all that is needed to encode an entire, complex, nested data structure consisting of many different types into Avro. Many of our top-level schemas end up embedding their own copies of the same sub-types. For example, a
Person
and aCompany
might both embed theAddress
schema. We generate these nested .avsc files using Salsify's delightful avro-builder gem.With every official version of avro_turf up to and including 0.11.0, trying to load more than one of these deeply-nested .avsc files will result in the Apache Avro gem raising an exception:
The reason for this is that AvroTurf::StateStore passes its own internal
@schemas
cache toAvro::Schema.real_parse
for the Avro gem to directly use as the internalnames
cache as it parses the schema definition tree. The Avro gem mutates this cache in place as it parses, adding to it every named type that it encounters along the way. (Incidentally, this also means that it can pollute the@schemas
object with partial/invalid data in the event that it ultimately ends up raising a parsing exception, because there is no mechanism for restoring@schemas
to its original state after a failed call toAvro::Schema.real_parse
. The patch in this pull request partially mitigates the effects of that behavior as well.)The solution is to create a new, empty context to pass as the
names
parameter toAvro::Schema.real_parse
, and only add the one resulting top-level schema it returns to the@schemas
hash, because only schemas that have their own .avsc file on disk should be resolvable byAvroTurf::SchemaStore.find
.Our simple patched version wasn't compatible with having two different .avsc files containing circular references to each other, so I fixed that up and made sure the rest of the test suite still passed. I added lots of descriptive comments along the way.
I also added a couple new test cases:
Verify that multiple .avsc files may define their own totally independent sub-schemas having the same fully-qualified name. It's illegal for schemas to define a type using the same fully-qualified name more than once, but it is absolutely acceptable for two independent schemas to each define the same or different types using the same fully-qualified name.
Verify that sub-schemas do not get cached in the top level
@schemas
cache inside the schema store. The only schemas that should be resolvable via a call toAvroTurf::SchemaStore#find
are those that have their own .avsc file on disk.It is important to note that this does change the behavior of avro_turf's schema store API:
If an application using avro_turf expects to pre-cache all its avsc files and then be able to resolve all the nested sub-schemas directly by calling
store.find('subschema', 'some.namespace')
, those applications will break with this patch. I have no idea if anyone relies upon this incorrect behavior, and there are no existing tests in the rspec suite that expect sub-schemas to be resolvable after caching the top-level schema that embedded them. So, incorporating this change may warrant adding a warning to the README.