Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Querier error returned as compressed binary data #938

Closed
leth opened this issue Aug 20, 2018 · 17 comments
Closed

Querier error returned as compressed binary data #938

leth opened this issue Aug 20, 2018 · 17 comments

Comments

@leth
Copy link
Contributor

leth commented Aug 20, 2018

If I run a certain query in graph mode, I am told: Error: rpc error: code = Code(413) desc = exceeded maximum number of samples in a query (100000)
If I run the same query in table mode, instead I am told: Server error (rpc error: code = Code(500) desc = ����������4�M �0�����V-���*ЕW��1��B��{ ��w��g3�<Śo�� U�����^)N�ܨٯ�-N��|�!�H̸���<�FDZ� n���H~[RO�=ݨ(w�Ou�a��xt�a:~��������������+W����� )

@leth
Copy link
Contributor Author

leth commented Aug 20, 2018

This issue started after 6024681. @tomwilkie Perhaps this was introduced in #910 as it was the next build to release a new querier image?

@leth
Copy link
Contributor Author

leth commented Sep 3, 2018

I'm not sure what fixed this, but it seems to no longer happen on the latest releases!

@leth leth closed this as completed Sep 3, 2018
@leth
Copy link
Contributor Author

leth commented Sep 3, 2018

Turns out we didn't test what we thought we had tested!

@leth leth reopened this Sep 3, 2018
@leth
Copy link
Contributor Author

leth commented Sep 5, 2018

Can confirm this is still an issue on the latest image

@bboreham
Copy link
Contributor

bboreham commented Sep 5, 2018

It's working on 6024681c and broken on 57b71bbf, which is the first commit that produced a Querier image after the working one.

57b71bbf is from #910 which changed httpgrpc.pb.go.

@bboreham
Copy link
Contributor

bboreham commented Sep 5, 2018

@leth gave me a tip which revealed what the corruption is - the code generating the error message is writing to a compressedResponseWriter, while ErrorFromHTTPResponse which pulls it out into gRPC is assuming it can use those bytes directly.

@tomwilkie does that raise any thoughts as to how we should avoid this?

@tomwilkie
Copy link
Contributor

tomwilkie commented Sep 5, 2018 via email

@bboreham
Copy link
Contributor

bboreham commented Sep 5, 2018

So this was actually caused by #898 not #910, and the reason I was looking at the wrong commit is that a unit test failed on the merge for #910 https://circleci.com/gh/weaveworks/cortex/2700

@bboreham
Copy link
Contributor

bboreham commented Sep 5, 2018

@tomwilkie yeah it can wait. Maybe I'll just delete httpgrpc 😉

@tomwilkie
Copy link
Contributor

tomwilkie commented Sep 5, 2018 via email

@bboreham bboreham changed the title Querier error returned as binary data Querier error returned as compressed binary data Sep 6, 2018
@bboreham
Copy link
Contributor

bboreham commented Sep 6, 2018

One approach would be to inspect the content-encoding of the response buffer and decompress if necessary.

Seems like a lot of work.

@tomwilkie
Copy link
Contributor

So this was actually caused by #898

I don't see how - would you mind explaining? AFAICT the Prometheus API Register has always compressed results. I don't see how it ever worked...

@tomwilkie
Copy link
Contributor

One approach would be to inspect the content-encoding of the response buffer and decompress if necessary.

Not sure - everything should roundtrip. If we return a non-200 compressed response from the querier, it gets propagated as a grpc error from querier -> authfe, then converted back to a http response and written (compression header and all) back to the client.

@bboreham
Copy link
Contributor

would you mind explaining

I haven't identified a specific code change, but we can make the issue come and go by switching to those querier versions noted above.

Even if round-trip was working, we'd still have a smaller problem that the error message is dumped into the OpenTracing span.

@bboreham
Copy link
Contributor

I wonder if this relates to #780 which is also about corruption of response data.

@bboreham
Copy link
Contributor

One difference between query and query_range is that the Prometheus API turns a storage error into an internal error (500), whereas it doesn't do that on query_range.

Cortex does send back storage errors, so we should avoid that in cases like exceeded maximum number of samples, but there is a deeper problem: that Prometheus switch statement actually treats all errors as storage errors, because a type assertion on an interface type is an 'implements' check, not an identical match.

Checking into where it breaks, I confirmed it returns 500s, but readable, 75de09a5 but returns compressed data in 57b71bbf, so it broke in #910. And if I update our front-end proxy to the same version of weaveworks/common the garbled data goes away, so I conclude the change to gRPC/proto generation was the culprit.

@bboreham
Copy link
Contributor

Problems went away eventually after updating everything to the latest version after protobuf changes.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants