Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Distributed backend to make OpenGrok scale on large deployments #779

Open
MetaGerCodeSearch opened this issue Mar 6, 2014 · 7 comments
Open

Comments

@MetaGerCodeSearch
Copy link

While working on the MetaGer CodeSearch deployment ( http://code.metager.de/source/ ), we seem to have hit some road blocks for a single machine.

We cover over 3100 repos with about 500GB of sources total, all done on a machine with 48GB RAM and 24 cores. Often times the Java/Tomcat/OpenGrok combination will simply break down and hog all available CPUs and RAM, probably due to some bugs that haven't received much exposure yet. Another factor may be that indexsearchers are spawned for every repo as far as my understanding goes. I understand this might be a deployment size not everyone is willing to reach (or test for). :)

Using a distributed/clustering approach of several Tomcats could prove beneficial to OpenGrok installations such as ours, but maybe others with a smaller number of repos, files and hard drive space might find it helpful as well.

Thanks,
Chris, CodeSearch project

@tarzanek tarzanek self-assigned this Mar 6, 2014
@tarzanek tarzanek added this to the 0.13 milestone Mar 6, 2014
@cdgz
Copy link

cdgz commented Apr 20, 2014

Hi,

As you might know, the goal of indexer is to process only fresh files. The fact that it takes all available CPU/RAM on every cron is definitely a misconfiguration - unless the activity on your projects is incredibly high, and nightly deltas are constantly big.

I have dealt with some big OpenGrok installations, similar to what you've mentioned (even a bit heavier). From my experience, with perfect tuning usual indexing of fresh code took from 10 to 20 mins, with minimal resources consumption (RAM jumped a bit during indexing, that's all).

The only advice is to track indexer logs ({{${DATA_ROOT}/../log/opengrok.1.0.log}}) during it's work. Output is rather verbose, and can give precious information on where it's stuck/passes most of the time. Maybe some files/paths should be added to IGNORE_PATTERNS?

Note that in old versions, if you are not using derby to store history cache - it's regenerated from scratch, every night. This might be painful if your repos have long VCS history. I have faced it in 0.10, not sure if it is fixed now (more details in this discussion). Take in account Trond's notes about flip-flop indexing pattern (having 2 copies of index, switching between them only during reindex).

The last and maybe the least: what is your FS? OpenGrok stores and process all lucene queries on disk, that's why some combinations (Solaris//ZFS) are more preferable than others (Linux/ext3) in terms of global perf.

@vladak
Copy link
Member

vladak commented Apr 22, 2014

File based history index is regenerated incrementally since #305 which is in 0.12 except it needs a fix for #818 so 0.12 will need a respin.

@vladak
Copy link
Member

vladak commented Oct 10, 2016

The problem with not reusing indexsearchers will be addressed in 0.13 (#1186 and others).

@vladak vladak removed this from the 1.0 milestone Mar 9, 2017
@vladak
Copy link
Member

vladak commented Mar 10, 2017

It seems to me that in order to do this the history cache would have to be moved to the Lucene index as well (otherwise some distributed file-system would have to be used). After all, documents have 1:1 mapping to source code files for both index and history cache so why not to unite these two ? @tarzanek any insights ?

@tarzanek
Copy link
Contributor

the Q is whether lucene as backend can be configured to properly support distributed nosql, but solr and elasticsearch do it, so how to do it is obvious(or look at scylla, aerospike, couchdb, hbase, mongo, redis, ...)
doing so will need to have bigger RF for data, do proper sharding, so distributed opengrok depends on distributing data and configuring lucene or leveraging solr(or alike)

@idodeclare
Copy link
Contributor

@tarzanek, solr seems most tractable, but I would hope to see reworked the internal APIs so either local Lucene or distributed solr is a choice of the user.

@vladak
Copy link
Member

vladak commented Sep 23, 2019

I never operated distributed back end however I quite like what is presented on https://github.blog/2019-03-05-vulcanizer-a-library-for-operating-elasticsearch/

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

5 participants