Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Plugin crashes when search returns docs containing invalid UTF-8 byte sequences. #101

Open
yaauie opened this issue Mar 6, 2019 · 3 comments

Comments

@yaauie
Copy link
Contributor

yaauie commented Mar 6, 2019

This is a rephrasing of elastic/logstash#10516, opened by @matteogrolla on 2019-03-06.

I have a document in Elasticsearch that crashes Logstash elasticsearch inputplugin when it tries to read it
the document is reported at the end of the message with the error log reported by logstash.
I'm using logstash to migrate documents from Elasticsearch to Mongo, but when logstash encounters the critical document the input plugin is restarted and starts from the beginning.
I'd like at least to skip the documents that can't be parsed, but I can't find a way to do so.
Can you help me?

P.S. If I create a new document in ES using curl and the textual representation of the critical document given here, I don't get parse error from logstash on this new document

-------Error log-------

[2019-03-06T12:43:47,696][ERROR][logstash.pipeline        ] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::Elasticsearch index=>"fulltextmg_33", id=>"3d2d80a0e02debd1b54d39b3e6b88b54a1ea45fe2c8ae8ddf2b0ec42e080ff61", hosts=>["pbauci01"], query=>"{ \"query\": { \"term\": { \"_id\": \"http://www.facebook.com/114701051917886_2073179089403396\"} } }", enable_metric=>true, codec=><LogStash::Codecs::JSON id=>"json_149580ae-80e8-4f8f-8728-66db3890cf1f", enable_metric=>true, charset=>"UTF-8">, size=>1000, scroll=>"1m", docinfo=>false, docinfo_target=>"@metadata", docinfo_fields=>["_index", "_type", "_id"], ssl=>false>
  Error: invalid byte sequence in UTF-8
  Exception: MultiJson::ParseError
  Stack: /opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/jrjackson-0.4.6-java/lib/jrjackson/jrjackson.rb:91:in `is_time_string?'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/jrjackson-0.4.6-java/lib/jrjackson/jrjackson.rb:36:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json/adapters/jr_jackson.rb:11:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json/adapter.rb:21:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json.rb:122:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/serializer/multi_json.rb:24:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/base.rb:322:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/client.rb:131:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-api-5.0.5/lib/elasticsearch/api/actions/search.rb:183:in `search'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/logstash-input-elasticsearch-4.2.1/lib/logstash/inputs/elasticsearch.rb:200:in `do_run'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/logstash-input-elasticsearch-4.2.1/lib/logstash/inputs/elasticsearch.rb:188:in `run'
/opt/logstash-6.6.1/logstash-core/lib/logstash/pipeline.rb:426:in `inputworker'
/opt/logstash-6.6.1/logstash-core/lib/logstash/pipeline.rb:420:in `block in start_input'

[...]

Unfortunately, the document pasted into the original bug report is valid UTF-8, likely a result of the GitHub UI's form auto-coercing from the pasted encoding to UTF-8.

@matteogrolla would you be able to paste the response into a file, and upload the file without any character encoding?


Potentially related:

@matteogrolla
Copy link

Hi Ry,
I don't understand why you stripped my workaround when you moved the issue.
At minimum It clearly exhibits where the problem comes from.
The workaround isn't the proper solution, since it modifies jrjackson, but it works and could help those who need an urgent solution.

@yaauie
Copy link
Contributor Author

yaauie commented Apr 3, 2019

@matteogrolla there was no malicious intent on my part; the issue was initially filed in the wrong place and I attempted to move and link to it in the places where it would be better addressed, but failed to also copy along the commentary.

We are still waiting on a follow-up from you with a document that exhibits the symptom:

Unfortunately, the document pasted into the original bug report is valid UTF-8, likely a result of the GitHub UI's form auto-coercing from the pasted encoding to UTF-8.

@matteogrolla would you be able to paste the response into a file, and upload the file without any character encoding?

@matteogrolla
Copy link

I've downloaded the content with

curl -X POST http://pbauci01:9200/fulltext_33/_s85f59a70' -H 'cache-control: no-cache' -d '{ 4-965b8
"query": {
"term": { "url": "http://www.facebook.com/114701051917886_2073179089403396"}
}
}' > logstash_problematic_doc.json

and edited the file to keep only the _source field value

logstash_problematic_doc.txt

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants