Distributed backend to make OpenGrok scale on large deployments #779

MetaGerCodeSearch · 2014-03-06T08:13:52Z

While working on the MetaGer CodeSearch deployment ( http://code.metager.de/source/ ), we seem to have hit some road blocks for a single machine.

We cover over 3100 repos with about 500GB of sources total, all done on a machine with 48GB RAM and 24 cores. Often times the Java/Tomcat/OpenGrok combination will simply break down and hog all available CPUs and RAM, probably due to some bugs that haven't received much exposure yet. Another factor may be that indexsearchers are spawned for every repo as far as my understanding goes. I understand this might be a deployment size not everyone is willing to reach (or test for). :)

Using a distributed/clustering approach of several Tomcats could prove beneficial to OpenGrok installations such as ours, but maybe others with a smaller number of repos, files and hard drive space might find it helpful as well.

Thanks,
Chris, CodeSearch project

cdgz · 2014-04-20T20:01:53Z

Hi,

As you might know, the goal of indexer is to process only fresh files. The fact that it takes all available CPU/RAM on every cron is definitely a misconfiguration - unless the activity on your projects is incredibly high, and nightly deltas are constantly big.

I have dealt with some big OpenGrok installations, similar to what you've mentioned (even a bit heavier). From my experience, with perfect tuning usual indexing of fresh code took from 10 to 20 mins, with minimal resources consumption (RAM jumped a bit during indexing, that's all).

The only advice is to track indexer logs ({{${DATA_ROOT}/../log/opengrok.1.0.log}}) during it's work. Output is rather verbose, and can give precious information on where it's stuck/passes most of the time. Maybe some files/paths should be added to IGNORE_PATTERNS?

Note that in old versions, if you are not using derby to store history cache - it's regenerated from scratch, every night. This might be painful if your repos have long VCS history. I have faced it in 0.10, not sure if it is fixed now (more details in this discussion). Take in account Trond's notes about flip-flop indexing pattern (having 2 copies of index, switching between them only during reindex).

The last and maybe the least: what is your FS? OpenGrok stores and process all lucene queries on disk, that's why some combinations (Solaris//ZFS) are more preferable than others (Linux/ext3) in terms of global perf.

vladak · 2014-04-22T07:14:04Z

File based history index is regenerated incrementally since #305 which is in 0.12 except it needs a fix for #818 so 0.12 will need a respin.

vladak · 2016-10-10T09:54:39Z

The problem with not reusing indexsearchers will be addressed in 0.13 (#1186 and others).

vladak · 2017-03-10T13:30:36Z

It seems to me that in order to do this the history cache would have to be moved to the Lucene index as well (otherwise some distributed file-system would have to be used). After all, documents have 1:1 mapping to source code files for both index and history cache so why not to unite these two ? @tarzanek any insights ?

tarzanek · 2019-09-22T05:29:34Z

the Q is whether lucene as backend can be configured to properly support distributed nosql, but solr and elasticsearch do it, so how to do it is obvious(or look at scylla, aerospike, couchdb, hbase, mongo, redis, ...)
doing so will need to have bigger RF for data, do proper sharding, so distributed opengrok depends on distributing data and configuring lucene or leveraging solr(or alike)

idodeclare · 2019-09-22T14:10:50Z

@tarzanek, solr seems most tractable, but I would hope to see reworked the internal APIs so either local Lucene or distributed solr is a choice of the user.

vladak · 2019-09-23T08:57:58Z

I never operated distributed back end however I quite like what is presented on https://github.blog/2019-03-05-vulcanizer-a-library-for-operating-elasticsearch/

tarzanek self-assigned this Mar 6, 2014

tarzanek added this to the 0.13 milestone Mar 6, 2014

vladak removed this from the 1.0 milestone Mar 9, 2017

vladak mentioned this issue Mar 10, 2017

consider EOL'ing JDBC #1271

Closed

brunoborges unassigned tarzanek Dec 18, 2017

vladak added the enhancement label Feb 4, 2019

vladak mentioned this issue Jun 27, 2019

Clustered indexes #2839

Closed

vladak mentioned this issue Sep 19, 2019

Does opengrok Support cluster based installation #1123

Closed

vladak mentioned this issue Apr 26, 2021

different serialization scheme for history #3539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed backend to make OpenGrok scale on large deployments #779

Distributed backend to make OpenGrok scale on large deployments #779

MetaGerCodeSearch commented Mar 6, 2014

cdgz commented Apr 20, 2014

vladak commented Apr 22, 2014

vladak commented Oct 10, 2016

vladak commented Mar 10, 2017

tarzanek commented Sep 22, 2019

idodeclare commented Sep 22, 2019

vladak commented Sep 23, 2019

Distributed backend to make OpenGrok scale on large deployments #779

Distributed backend to make OpenGrok scale on large deployments #779

Comments

MetaGerCodeSearch commented Mar 6, 2014

cdgz commented Apr 20, 2014

vladak commented Apr 22, 2014

vladak commented Oct 10, 2016

vladak commented Mar 10, 2017

tarzanek commented Sep 22, 2019

idodeclare commented Sep 22, 2019

vladak commented Sep 23, 2019