-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
proto: reject invalid UTF-8 in strings #499
Conversation
The proto2 and proto3 specifications explicitly say that strings must be composed of valid UTF-8 characters. Thus, report a marshal/unmarshal error if any string fields are invalid. The C++ implementation prints an error anytime invalid UTF-8 strings is present during parsing or serializing. For proto3, it explicitly errors out when parsing an invalid string. In Go, we'll attempt to apply this check on proto2 and proto3, but may relax that constraint depending on the extent of problems.
@@ -1514,6 +1515,9 @@ func unmarshalStringValue(b []byte, f pointer, w int) ([]byte, error) { | |||
return nil, io.ErrUnexpectedEOF | |||
} | |||
v := string(b[:x]) | |||
if !utf8.ValidString(v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, but we could test the values with utf8.Valid
before converting them to a string.
I mean, the code is essentially the same, but if we can validate it sooner, why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no performance penalty for utf8.ValidString
since it doesn't convert the string to a []byte
and call utf8.Valid
under the hood.
I'm going to keep it as is since utf8.ValidString(v)
is more readable than utf8.Valid(b[:x])
, where you don't have to consciously think about slicing b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, good call. 👍
This seems like a mistake. It is correct according to the spec, but, as you noticed, C++ doesn't actually enforce it for proto2, allowing it to interoperate with bad clients. One could make the argument that the spec, in practice, is whatever the C++ implementation does. I suggest you revert this for proto2 string fields until C++ actually enforces it. |
\cc @xfxyjwf. Should Go drop UTF-8 enforcement on proto2 strings because that's what C++ does, as opposed to what the spec specifies? This won't be the first time Go follows C++ and finds out what C++ does is actually wrong and should not be emulated. |
It's hard to argue that C++ violating the spec is particular bad to follow. We've been burned too many times from people complaining that something works with C++ but not with Go, despite it not strictly following the spec. It's skating uphill for Go to be stricter than C++ in ways that people will trip over. For most of the same reasons that C++ isn't following the spec, Go shouldn't too. |
We've been burned in both directions. For example, C++ unmarshals known fields with an incorrect wire type as an unknown field, while Go had rejected them as a parse error. People complained and my investigation (including asking the proto team) showed that parsing as an unknown field was the right behavior. So I made the change (#511) to be more compliant with the other languages, but then other Go users started complaining that they were relying on the parse error. Everything we do in protos breaks someone and it's fairly tiring defending our actions to all parties involved. In this situation, I'm personally convinced holding to the spec is the correct stance. I checked what Python does and they also fail on both proto2 and proto3 for invalid UTF-8. I didn't check what Java does, but I'm nearly certain they fail as well (given that Java strings cannot represent arbitrary binary). Of course, that's my opinion. I'll let the proto team make the final call. |
…lid UTF-8. For more information see: - golang/protobuf#484 - golang/protobuf#499
* Prevent logs agent from submitting protocol buffer payloads with invalid UTF-8. For more information see: - golang/protobuf#484 - golang/protobuf#499 * Reno
Proto team says to drop enforcement for proto2, but keep it for proto3. See #622. |
The proto2 and proto3 specifications explicitly say that strings must be
composed of valid UTF-8 characters.
The C++ implementation prints an error anytime strings with invalid UTF-8
is present during parsing or serializing. For proto3, it explicitly
errors out when parsing an invalid string.
Thus, we cause marshal/unmarshal to error if any string fields are invalid.
The error returned is fail-fast and indistinguishable. This means that the
we stop the unmarshal process the moment we detect an invalid string,
as opposed to finishing the unmarshal. An indistinguishable error means that
we provide no API for the user to programmatically check specifically for
invalid UTF-8 errors so that they can ignore it.
In conversations with the protobuf team, they felt strongly that there should be
no ability for users to ignore the UTF-8 validation error. However, if this change is
overly problematic for users, we can consider a workaround for users already
depending on strings containing invalid UTF-8.