Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

What to do about commits? #130

Open
tokee opened this issue Oct 4, 2017 · 0 comments
Open

What to do about commits? #130

tokee opened this issue Oct 4, 2017 · 0 comments

Comments

@tokee
Copy link
Collaborator

tokee commented Oct 4, 2017

During a batch index by @ruebot, using warc-indexer, Solr failed multiple times. It turned out that the problem was commits: Unless the option -d / --disable-commit is given, each call to warc-indexer starts with a commit and ends with a commit. Those commits are both hard & soft, meaning that the data are flushed to storage and a new searcher is opened. @ruebot's 24 workers meant that about 1 WARC was processed each minute, meaning 2 new searcher-events each minute. That does not sound bad, until we remember that X shards with Y replicas means that X*Y shards goes through the new searcher-process each time.

Running a lot of warc-indexer-processes in parallel is a very easy way of speeding up indexing. Doing so without specifying -d leads to problems. Unfortunately we cannot reliably detect if multiple warc-indexers are running at the same time, so maybe warc-indexer-driven commits should be off by default?

This leads us to the next problem: Without explicit commits, changes will not show up in Solr. curl "http://mysolrcloud:8983/solr/update?commit=true&openSearcher=true" is a simple way of triggering that, but it is not very user-friendly to require such a call and if the user forgets, there will be worrying & debugging.

I suggest that we

  • Set autoSoftCommit/maxTime in solrconfig.xml to 10+ minutes so that documents will be visible automatically at some point
  • Make a note in the log, if -d has been specified, that there will be delayed visibility
  • Write clearly in the documentation that for multi-threaded indexing, -d and issuing a explicit post-index commit is the recommended workflow
  • Make a note in the command line arguments list to check the documentation if one runs multiple parallel jobs
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants