Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Jaeger-collector stops storing traces #897

Closed
tmszdmsk opened this issue Jun 29, 2018 · 6 comments
Closed

Jaeger-collector stops storing traces #897

tmszdmsk opened this issue Jun 29, 2018 · 6 comments

Comments

@tmszdmsk
Copy link
Contributor

tmszdmsk commented Jun 29, 2018

Requirement - what kind of business use case are you trying to solve?

I want to be sure jaeger-collector collects and persists data

Problem - what in Jaeger blocks you from solving the requirement?

We've started using Jaeger in production 5 days ago. Since we started sending traces from 60+ services it stopped working twice. I mean by that that traces that are sent to this particular collector are no longer available in jaeger-query. We have 4 collectors (on prod & dev). All use the same Elasticsearch cluster to store their traces. Still, only those handling production traffic break. Dev traces go through.

Despite not very readable logs on stderr (#896) I was able to fish out some stacktrace which may be helpful. Appearance of logs on stderr correlates with this collector not working. There are no logs on stderr or stdout if it operates normally.

"stacktrace":"github.com/jaegertracing/jaeger/pkg/es/config.(*Configuration).NewClient.func2\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/es/config/config.go:90\ngithub.heygears.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic%2ev5.(*bulkWorker).commit\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic.v5/bulk_processor.go:506\ngithub.heygears.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic%2ev5.(*bulkWorker).work\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic.v5/bulk_processor.go:442"

jaeger-collector v1.5.0 deployed on Kubernetes cluster

Pod restart resolves the issue for some time

Proposal - what do you suggest to solve the problem or improve the existing situation?

N/A

Any open questions to address

N/A

@jpkrohling
Copy link
Contributor

@jkandasa, are we using Elasticsearch for the soak tests? For how long are we running those?

@tmszdmsk
Copy link
Contributor Author

tmszdmsk commented Jun 29, 2018

this may add some more insight: #896 (comment)

@tmszdmsk
Copy link
Contributor Author

tmszdmsk commented Jun 29, 2018

After some digging it looks like the culprit is

github.com/jaegertracing/jaeger/pkg/es/config.(*Configuration).NewClient.func2
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/pkg/es/config/config.go:90
github.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic%2ev5.(*bulkWorker).commit
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic.v5/bulk_processor.go:506
github.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic%2ev5.(*bulkWorker).work
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/gopkg.in/olivere/elastic.v5/bulk_processor.go:455

which is reported in #779. However, I think maybe we should disable buffering of messages that were rejected because of validation so that Jaeger will still be able to process and save next traces.

@kevinearls
Copy link
Contributor

@jpkrohling In the past I've run soak tests for up to 12 hours with as many as 5 collectors and have not seen this issue. We have been waiting to get a better performance testing setup to run longer tests. @jkandasa can help more with this in the future.

@yurishkuro
Copy link
Member

@tmszdmsk does #905 solve this issue? I am not sure what the problem was here.

@tmszdmsk
Copy link
Contributor Author

Yes

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants