You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During a batch index by @ruebot, using warc-indexer, Solr failed multiple times. It turned out that the problem was commits: Unless the option -d / --disable-commit is given, each call to warc-indexer starts with a commit and ends with a commit. Those commits are both hard & soft, meaning that the data are flushed to storage and a new searcher is opened. @ruebot's 24 workers meant that about 1 WARC was processed each minute, meaning 2 new searcher-events each minute. That does not sound bad, until we remember that X shards with Y replicas means that X*Y shards goes through the new searcher-process each time.
Running a lot of warc-indexer-processes in parallel is a very easy way of speeding up indexing. Doing so without specifying -d leads to problems. Unfortunately we cannot reliably detect if multiple warc-indexers are running at the same time, so maybe warc-indexer-driven commits should be off by default?
This leads us to the next problem: Without explicit commits, changes will not show up in Solr. curl "http://mysolrcloud:8983/solr/update?commit=true&openSearcher=true" is a simple way of triggering that, but it is not very user-friendly to require such a call and if the user forgets, there will be worrying & debugging.
I suggest that we
Set autoSoftCommit/maxTime in solrconfig.xml to 10+ minutes so that documents will be visible automatically at some point
Make a note in the log, if -d has been specified, that there will be delayed visibility
Write clearly in the documentation that for multi-threaded indexing, -d and issuing a explicit post-index commit is the recommended workflow
Make a note in the command line arguments list to check the documentation if one runs multiple parallel jobs
The text was updated successfully, but these errors were encountered:
During a batch index by @ruebot, using
warc-indexer
, Solr failed multiple times. It turned out that the problem was commits: Unless the option-d
/--disable-commit
is given, each call towarc-indexer
starts with a commit and ends with a commit. Those commits are both hard & soft, meaning that the data are flushed to storage and a new searcher is opened. @ruebot's 24 workers meant that about 1 WARC was processed each minute, meaning 2new searcher
-events each minute. That does not sound bad, until we remember that X shards with Y replicas means that X*Y shards goes through thenew searcher
-process each time.Running a lot of
warc-indexer
-processes in parallel is a very easy way of speeding up indexing. Doing so without specifying-d
leads to problems. Unfortunately we cannot reliably detect if multiplewarc-indexer
s are running at the same time, so maybewarc-indexer
-driven commits should be off by default?This leads us to the next problem: Without explicit commits, changes will not show up in Solr.
curl "http://mysolrcloud:8983/solr/update?commit=true&openSearcher=true"
is a simple way of triggering that, but it is not very user-friendly to require such a call and if the user forgets, there will be worrying & debugging.I suggest that we
autoSoftCommit/maxTime
insolrconfig.xml
to 10+ minutes so that documents will be visible automatically at some point-d
has been specified, that there will be delayed visibility-d
and issuing a explicit post-index commit is the recommended workflowThe text was updated successfully, but these errors were encountered: