Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

proto3 and unknown fields #272

Closed
joshuarubin opened this issue Apr 7, 2015 · 69 comments
Closed

proto3 and unknown fields #272

joshuarubin opened this issue Apr 7, 2015 · 69 comments
Labels
Milestone

Comments

@joshuarubin
Copy link

I know that unknown fields have been removed from proto3, but I am trying to get an explanation about why this change was made and if there is any way to replicate that behavior in proto3.

Thanks so much.

referred from golang/protobuf#25

@dhendry
Copy link

dhendry commented May 6, 2015

I too am wondering about this. I am looking into migrating what is essentially a messaging system to gRPC (where proto3 seems to be recommended). In my case, clients send messages (text plus rendering information) to each other via a server where the server needs to understand the text and certain parts of the rendering info. I want to allow client developers to experiment with new features (pre-release) without having to deploy server code for every change.

Essentially, its a case where I want a shared proto definition between the client(s) and server, but dont want to require the server proto definition to be the latest to process requests.

@solicomo
Copy link

I'd like to hear about the explanation, too.

The behavior of proto2 makes sense to me.

@jeremyong
Copy link

I have a lot of concerns about silently deleting data upon deserialization, to the point that even though we have internally been using proto3 for several months, I am considering changing things back to proto2. This change would be a lot easier to stomach if there was a message option to allow serialization and deserialization of unknown fields instead of discarding them.

@jeremyong
Copy link

Being unable to add unknown fields that persist is also unacceptable for us. Reading the code, it's pretty clear the decision to omit unknown fields happens at compile time rather than at runtime (based on the generated code), so it seems proto3 is a no-go. Personally, I very much liked most of the changes to the new version except this one. Changing the default behavior alone might have been ok, especially given that the new behavior is well-documented, but doing so without a way to restore old behavior seems like a misstep. Supporting a plugin that reverts that behavior seems too expensive relative to the cost of just using proto2 with restrictions (optional only, etc).

@dhendry
Copy link

dhendry commented Apr 19, 2016

Still no answers to this? This is a fundamental issue which is seriously hindering our the adoption of protobuf in many areas.

@jeremyong
Copy link

+1 proto2 is a permanent fixture for us. Changing default behavior is one thing but changing it in a way that doesn't let the user even control it is a strict loss in my opinion. What I foresee moving forward is a huge fragmentation in the client ecosystem. Maintaining support for both proto2 and proto3 semantics is too much to chew for most developers, and I'm already seeing some client libraries do this awkward dance where they have some proto2 properties and some proto3 properties. The easiest example of this causing a problem in history is the move from Python2 to Python3. One possible solution might be a file level option that informs the protobuf compiler not to strip unknown fields.

@liujisi
Copy link
Contributor

liujisi commented Apr 20, 2016

The proto3 spec doesn't forbid preserving unknown fields. Instead, it allows implementation to choose whether to preserve unknowns. The current C++/Java chose to drop the unknowns though. We are currently looking the issue and will keep this thread posted.

@jeremyong
Copy link

Thanks @pherl for providing the update. FWIW, I think it is worth considering how the behavior might be standardized, for the same reason people argue against undefined behavior in C or C++. Undefined behavior (if present) should really be due to a lack of foresight if it exists, but for something like this, we might as well come up with an actual solution since we're already aware of the problem.

@joshuarubin
Copy link
Author

Thanks for keeping this issue alive. I'd just like to add that we are interested in support for Go, but that might need to be addressed in golang/protobuf.

@jeremyong
Copy link

@pherl Any progress on this front?

@gfecher
Copy link

gfecher commented Jun 12, 2016

+1 for preserving unknown fields.

I accept that you can not trivially maintain compatibility with the JSON format (at least as long as you want to marshal fields with their names), but I think a lot of shops would be happy to pay this price for not having to release their low-level infrastructure in lock step with their newest clients.

In fact Kenton seems to wonder himself (https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html): Apparently, version 3 of Protocol Buffers, aka “proto3”, removes this feature. I honestly don’t know what they’re thinking. This feature has been absolutely essential in many of Google’s internal systems.

In my opinion the right approach would be to make this an option of the proto compiler on compiling the proto: this way everybody can decide for themselves whether the benefits outweigh the downsides.

For now I have overridden the PreserveUnknownFields function in both cpp_helpers.h and java_helpers.h in the compiler code to always return true and this seems to work, but I would appreciate it if someone from google could confirm.

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Jun 12, 2016

Some updates: we tried to gather data to prove "unknown fields are essential for Google systems", but the result is not so convincing (the experiment is done in a Google sub-system, not the whole of Google).

For those of you who are interested in adding back unknown fields in proto3, could you describe your use case in more details and explain why unknown fields is required (e.g., can the same use case be supported using some other proto3 features)? We need to prove unknown fields are needed in some common use cases in order to add it back.

@jeremyong
Copy link

Here is a use case I developed internally that makes heavy usage of unknown
fields:

In addition to the message itself, we often annotate the message before
sending it over the wire with metadata indicating if a field was deleted or
not, if it was set to a default field, etc. Internally, we use a diff-ing
scheme to create a protobuf message "diff" which handles maps, fields, and
messages (recursively applied). The application of the diff itself is
associative, so many diffs can accumulate into one, and this makes for a
fairly elegant scheme for updating state for a particular message across
many clients that may or may not be online.

Generalizing this use case, any protobuf message that is derived from the
reflection API must necessarily leverage the unknown field set, since by
definition, we cannot know the shape of the message a priori. Think of this
as a "higher order message" whereas messages that are schema defined are
first order messages.

On Sun, Jun 12, 2016 at 11:30 AM, Feng Xiao notifications@github.com
wrote:

Some updates: we tried to gather data to prove "unknown fields are
essential for Google systems", but the result is not so convincing (the
experiment is done in a Google sub-system, not the whole of Google).

For those of you who are interested in adding back unknown fields in
proto3, could you describe your use case in more details and explain why
unknown fields is required (e.g., can the same use case be supported using
some other proto3 features)? We need to prove unknown fields are needed in
some common use cases in order to add it back.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAPRJdU9zmn_iHC60rz014oYrvt0n_zQks5qLFBXgaJpZM4D8C3u
.

Jeremy Ong
PlexChat CTO
650.400.6453

@gfecher
Copy link

gfecher commented Jun 13, 2016

Hi,

We have a use case with a mixture of data validation/data transformation and storage.
Our infrastructure component understands certain bits of the schema that it validates/changes, but it is oblivious to the rest of the payload. It does store it, however, and clients running on the new schema expect the newly introduced fields to be returned intact.

In general any component could benefit from preserving unknown fields where only a partial understanding of the message is needed, especially where the bits the component does care about does not change often, but the rest of the schema does. I can think of routing, storage, certain types of data transformation, etc.

I would be interested in knowing how you managed to solve these use cases (which I'm sure you have internally at google) without preserving unknown fields.

@InfinitiesLoop
Copy link

InfinitiesLoop commented Jun 18, 2016

We need unknown fields, because it's one of the ways we know on the server-side that our proto definition is out of date, and needs to be re-synchronized. Without unknown fields, we would have to resort to polling or some other less authoritative way of detecting when the client has added fields.

Also while I understand trying to reduce feature surface area, unknown fields don't exactly cause a problem, do they? Dropping them has more negatives than positives, please add them back to proto3.

@JesseChisholm
Copy link

JesseChisholm commented Jun 18, 2016

If the proto3 way was to set some option, like option (ProtoOptions).preserveUnknownFields = True; that would allow those of use who need it to keep it and those of you who don't need to do without it.

Best of both worlds. :)

@dhendry
Copy link

dhendry commented Jun 28, 2016

I would absolutely want the ability to preserve or strip unknown fields at runtime. There are levels of our system which get deployed regularly, are kept up to date, and should be validating the well known schema (and stripping unknown fields), but there are other internal layers which get deployed far less frequently, that are not directly exposed to clients or potentially malicious actors where preserving unknown fields is highly desirable so we dont have to do full and extensive deploys for every little change.

@rohitsaboo
Copy link

Hey guys,

We would love to have this feature, too :) During my relatively long time at Google, I was aware of many services that relied on this behavior from proto2.

Essentially, think of any set of three or more services where A talks to C via B, and we don't want to redeploy B when a proto that is being passed between A and C gets a new field added to it. (I also posted this as a question on stackoverflow.)

Would be great to have an update for supporting this feature and/or an alternative mechanism that you believe can solve this problem for us.

Thanks,
Rohit

@jeremyong
Copy link

Still no word on what the original justification was too.

@Kaiserchen
Copy link

Kaiserchen commented Aug 24, 2016

The use-case we have is the following:

We use Stream Processors, namely kafka-streams, that rearranges protobuf messages. For example we have 2 streams of protobuf messages that we join with each other. The join will just output a joined message having the two others as fields. Sometimes we also aggregate streams to list of messages of previous streams. The stream processors only know about the fields relevant for them (join fields, group by fields ...) all the other fields are carried along as unknown-fields.

This allows the stream processor to continue working even when upstream schema changes happen, we do not need to redeploy our stream processing application, and the new fields end up in the output for free.

To add some drama: I think loosing the unknown fields will force us to move to avro

@matthewrj
Copy link

This is a bit of a deal breaker for us too. We have the same use case where A sends data to B which reads some fields and forwards the message to C. We don't want to have to constantly update B when the schema changes even though it doesn't read any of the new fields. The current behaviour is quite dangerous since C can't tell if one of the new fields was set to the default value or if B is just out of date and lost data.

@InfinitiesLoop
Copy link

Would really appreciate an update on the feedback here. Whether Proto3 is going to ever support unknown fields can impact decisions being made even for folks still on Proto2, because if it isn't, we may need to invent other ways of solving our problems in order to avoid rearchitecting things when/if we move to proto3.

@chmod007
Copy link

chmod007 commented Oct 5, 2016

I have two use cases, both of which have sub-optimal workarounds:

  1. Include a signature in the same protobuf as the payload to be signed. To verify the signature, I deserialize, extract and remove the signature, reserialize and verify the signature. This breaks if the signed message contains any new fields unknown to the process verifying the signature. The workaround is to serialize in two levels, with the inner (signed) message serialized as bytes in the outer message.

  2. A server is the ultimate source of small update packets that are then routed peer-to-peer. Unserializing and reserializing before passing the message on to other peers strips out unknown fields. The workaround is for peers to share the original bytes instead of deserializing and reserializing.

@acozzette
Copy link
Member

One thing to keep in mind is that proto2 is not going away. We are still actively improving it and plan to keep doing so indefinitely, so proto2 is still a good choice if you have a use case that depends on unknown fields. The one main drawback is that a few languages (such as C# and Ruby) are currently proto3-only, but if you're not using those languages then that's not a problem.

@chmod007 , have you thought about using proto2 for your two use cases? Is that possible or do your schemas have to be proto3 for another reason?

@Xorlev
Copy link

Xorlev commented Nov 18, 2016

I'll add a few usecases.

  • We have a gRPC service proxying RPC traffic. It would be awfully nice to not have a hard requirement to deploy the proxy first upon schema changes in any of the services it proxies.
  • We also maintain stream processing services which are processing protos from other parts of the organization. If they add a field, I'd prefer that field doesn't disappear unexpectedly just by flowing through our stream processor. There's some pretty awful documentation / tooling / coupling implications of needing to redeploy stream jobs any time upstream producers evolve their schema. Depending on any cycles in data flows, there may be no topological order that produces valid schema updates without doing a 2-step deploy: 1) upgrade proto schema, redeploy all the (many) things that might rely on it 2) update producer to fill in field, deploy producer. Pray all the systems were updated.

re: proto2 vs. proto3, it's kind of annoying to mix and match. It's pretty counterintuitive to only use proto2 to maintain unknown fields, but have proto3 definitions for gRPC servers. I agree with most of the design choices in proto3 (e.g. removing optional/required fields, map types), but not this.

I'd actually been unaware proto3 removed unknown field support until I expected it to maintain an unknown field and it didn't (and came to report it as an issue). I'd touted unknown field support as a huge selling point for protobufs when we'd first implemented them.

The protobuf website originally recommended that new projects use proto3, which is why we'd adopted it, but this is a pretty huge issue for us. We'll likely be forking the compiler similarly to @gfecher as the proto3 ship has long since sailed and this behavior is very important to helping us produce robust infrastructure.

@stevvooe
Copy link

@pherl @xfxyjwf Do you have suggestions for how to work around this with proto3? If this was removed, what techniques were used to avoid requiring this pattern within Google?

As far as I see it, this was the chief benefit of protobuf:

+----------+                        +----------+
|          |   +----------------+   |          |
|          |   |                |   |          |
| Producer +--->  Intermediate  +---> Consumer |
|          |   |                |   |          |
|          |   +----------------+   |          |
+----------+                        +----------+

Producer and Consumer could be updated with new fields, while intermediate can remain on the same version. If intermediate is a proxy of sorts, then this is important.

@jeremyong
Copy link

@stevvooe We've been continuing to use proto2 for the intermediate proxy type thing since they are binary compatible. Throughout our codebase, we've been propagating proto2 everywhere since it's really annoying to maintain two different semantics for the proto definitions themselves but if you wanted, producer and consumer could use proto3.

I do have some plans eventually to do a separate C++ compiler entirely that consumes proto3 syntax but retains the API of the unknown fields unless someone else gets to it first. I want to do other changes like using more STL containers (vectors and maps) as the backing in-memory storage and fix the oddities with the arenas we've been seeing.

@liujisi
Copy link
Contributor

liujisi commented Aug 14, 2017

@jeremyherbert This is the same issue as proto3 currently doesn't preserve unknown fields. It should be addressed in the next couple releases.

Note that adding new fields into oneofs is risky. Even with unknown fields preserved, the new field will not be visible in the oneof of the old binary. Instead of seeing an unrecognized type, the old message will treat the oneof as not set. You would have to dig into the unknown fields to distinguish between an unset oneof vs an unrecognized oneof.

@vozbu
Copy link

vozbu commented Sep 7, 2017

@pherl, the pattern "save unknown fields and then discard it" seems excessive for me. Isn't it better just to pass a flag to parsing function telling it to save or not to save unknown fields while parsing? It will save you memory and CPU in case you don't need these fields while will retain all desired benefits. In our workflows we sometimes have most of fields in message as unknown, and I'm afraid that parsing it will degrade our performance.

Actually, I would like to have such flag in proto2 too.

@liujisi
Copy link
Contributor

liujisi commented Sep 7, 2017

@vozbu what language are you using? We do have API to skip unknowns fields in Java. Other languages chose to have a discard unknown fields API after parsing is finished mostly to reduce the complexity in implementation.

@vozbu
Copy link

vozbu commented Sep 8, 2017

@pherl, I'm talking about C++. I haven't seen the implementation to judge about it. I speak my thoughts as a user.

@jbolla
Copy link

jbolla commented Sep 13, 2017

@pherl, the doc you shared states "3.4 release (ETA: Q3 2017): Google protobuf implementation for each language will provide APIs to explicitly drop or preserve unknowns for proto3. A temporary flag will be introduced for the default parsing behavior - default to drop unknowns."

3.4 is released. Did that actually make it in? I'm using Java and I see the flag for retaining unknowns, explicitDiscardUnknownFields in CodedInputStream, but the parsing code I see is using:
final boolean shouldDiscardUnknownFieldsProto3() { return explicitDiscardUnknownFields ? true : proto3DiscardUnknownFieldsDefault; }
So even if you don't set that flag you get proto3DiscardUnknownFieldsDefault, which defaults to false and appears not to have any way for external users to change.

@liujisi
Copy link
Contributor

liujisi commented Sep 14, 2017 via email

@liujisi
Copy link
Contributor

liujisi commented Dec 11, 2017

All languages will be fixed in 3.5.x releases.

@leighmcculloch
Copy link
Contributor

@liujisi Now that direction has changed and support added for preserving field to some implementations, will this recommendation in the official proto3 documentation be changing?

Proto3 implementations can parse messages with unknown fields successfully, however, implementations may or may not support preserving those unknown fields. You should not rely on unknown fields being preserved or dropped. For most Google protocol buffers implementations, unknown fields are not accessible in proto3 via the corresponding proto runtimes, and are dropped and forgotten at deserialization time.

Ref: https://developers.google.com/protocol-buffers/docs/proto3#unknowns

@acozzette
Copy link
Member

@leighmcculloch Good catch, I'll update that documentation to say that unknown fields are now preserved for proto3 messages as of version 3.5.

@MalteJ
Copy link

MalteJ commented Aug 26, 2018

Is there a public method to detect if a deserialized message has unknown fields?

This would be useful to check a message which is coming from an untrusted source.
I do not want to relay the message to other services if I am not sure it complies to my proto format. Also in my case I cannot reserialize it because the serialized messages bytes are cryptographically signed (the serializer is not deterministic across different protobuf implementations).

I'm about to replace protobuf with JWT for this :(

@MalteJ
Copy link

MalteJ commented Aug 26, 2018

There are methods to get a list of unknown fields. But:

In Go the parameter name suggests it should not be used ("XXX_unrecognized").
And the C++ docs say:

Get the UnknownFieldSet for the message.

This contains fields which were seen when the Message was parsed but were not recognized according to the Message's definition. For proto3 protos, this method will always return an empty UnknownFieldSet.

https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message#Reflection.GetUnknownFields.details

@dsnet
Copy link
Contributor

dsnet commented Sep 4, 2018

In Go, there is not currently a reliable way to programmatically interact with unknown fields. At best, you can use proto.DiscardUnknown to recursively discard all unknown fields. However, there is no stable API to iterate and/or modify the current set of unknown fields.

Furthermore, not all unknown fields are stored in XXX_unrecognized, unknown fields in the extension ranges are stored in proto.XXX_InternalExtensions. The current state of affairs is unfortunate, and we're working on v2 of the API, which will provide a stable way to read, modify, and write unknown fields.

@kditrj2d
Copy link

I'm coming to this party rather late... I've just upgraded a C# application that uses protobuffers from version 3.4.0 to 3.6.1. The application relies on unknown fields not being preserved. Now by default they ARE preserved and I've seen a significant and unacceptable increase in memory consumption. (The ratio of known to unknown fields is about 1:5.) There is mention here of APIs being available to explicitly discard the unknown fields but its not clear to me whether these were temporary and have now been removed or still exist. What is the current situation? Do these APIs still exist in the version 3.6.1 C# distribution? If so where can I find details?

@Xorlev
Copy link

Xorlev commented Mar 19, 2019

From my understanding (though I don't work on protobufs, I've just been a part of this thread for a long time), these APIs are here to stay -- you will be able to keep or discard unknown fields depending on your use case.

/// <summary>
/// Creates a new message parser which optionally discards unknown fields when parsing.
/// </summary>
/// <param name="discardUnknownFields">Whether or not to discard unknown fields when parsing.</param>
/// <returns>A newly configured message parser.</returns>
public new MessageParser<T> WithDiscardUnknownFields(bool discardUnknownFields) =>
new MessageParser<T>(factory, discardUnknownFields);
}
Appears to be what you want -- applied to a MessageParser, it returns a new MessageParser which discards/doesn't discard unknown fields.

@kditrj2d
Copy link

Thanks for the reply. Found it, tried it, code now works again.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests