-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Distributed backend to make OpenGrok scale on large deployments #779
Comments
Hi, As you might know, the goal of indexer is to process only fresh files. The fact that it takes all available CPU/RAM on every cron is definitely a misconfiguration - unless the activity on your projects is incredibly high, and nightly deltas are constantly big. I have dealt with some big OpenGrok installations, similar to what you've mentioned (even a bit heavier). From my experience, with perfect tuning usual indexing of fresh code took from 10 to 20 mins, with minimal resources consumption (RAM jumped a bit during indexing, that's all). The only advice is to track indexer logs ({{${DATA_ROOT}/../log/opengrok.1.0.log}}) during it's work. Output is rather verbose, and can give precious information on where it's stuck/passes most of the time. Maybe some files/paths should be added to IGNORE_PATTERNS? Note that in old versions, if you are not using derby to store history cache - it's regenerated from scratch, every night. This might be painful if your repos have long VCS history. I have faced it in 0.10, not sure if it is fixed now (more details in this discussion). Take in account Trond's notes about flip-flop indexing pattern (having 2 copies of index, switching between them only during reindex). The last and maybe the least: what is your FS? OpenGrok stores and process all lucene queries on disk, that's why some combinations (Solaris//ZFS) are more preferable than others (Linux/ext3) in terms of global perf. |
The problem with not reusing indexsearchers will be addressed in 0.13 (#1186 and others). |
It seems to me that in order to do this the history cache would have to be moved to the Lucene index as well (otherwise some distributed file-system would have to be used). After all, documents have 1:1 mapping to source code files for both index and history cache so why not to unite these two ? @tarzanek any insights ? |
the Q is whether lucene as backend can be configured to properly support distributed nosql, but solr and elasticsearch do it, so how to do it is obvious(or look at scylla, aerospike, couchdb, hbase, mongo, redis, ...) |
@tarzanek, solr seems most tractable, but I would hope to see reworked the internal APIs so either local Lucene or distributed solr is a choice of the user. |
I never operated distributed back end however I quite like what is presented on https://github.blog/2019-03-05-vulcanizer-a-library-for-operating-elasticsearch/ |
While working on the MetaGer CodeSearch deployment ( http://code.metager.de/source/ ), we seem to have hit some road blocks for a single machine.
We cover over 3100 repos with about 500GB of sources total, all done on a machine with 48GB RAM and 24 cores. Often times the Java/Tomcat/OpenGrok combination will simply break down and hog all available CPUs and RAM, probably due to some bugs that haven't received much exposure yet. Another factor may be that indexsearchers are spawned for every repo as far as my understanding goes. I understand this might be a deployment size not everyone is willing to reach (or test for). :)
Using a distributed/clustering approach of several Tomcats could prove beneficial to OpenGrok installations such as ours, but maybe others with a smaller number of repos, files and hard drive space might find it helpful as well.
Thanks,
Chris, CodeSearch project
The text was updated successfully, but these errors were encountered: