Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Don't depend on rethinkdb #159

Open
traverseda opened this issue Aug 18, 2019 · 5 comments
Open

Don't depend on rethinkdb #159

traverseda opened this issue Aug 18, 2019 · 5 comments

Comments

@traverseda
Copy link

Rethinkdb seems to be essentially dead in the water. There are some attempts to get back on track, but right now it sadly doesn't seem to be being maintained.

@gravypod
Copy link

I've been looking at Brozzler, the mitm warc proxy, and some of the other IA things. I have been interested in deploying these as well but Rethink, and most importantly the configuration tied up within it, make that difficult to do.

If someone had some high-level documentation about what rethink is used for (other than just Service Discovery) I might be able to extract that code, put it into some ABC class, and allow the selection of multiple backend configuration modes. From a high level it looks like much of what rethink is being used for can be done with redis and a lot of the service discovery it's doing could be managed (for most people) with config files.

Was there a specific design decision that coupled all of these services together (job submission, job state tracking, warc serving, scraping) into rethinkdb? If there is a document that covers this I'd be really happy to take a look at it to get a better picture of what the motivations were for these design choices.

@traverseda
Copy link
Author

I mean honestly rethinkdb used to be very easy to deploy, and was one DB that could handle all those different services nicely. It's a shame that development stalled like it did, and that no one was able to perform maintenance releases after they lost funding.

@nlevitt
Copy link
Contributor

nlevitt commented Aug 23, 2019

It is a shame that the rethinkdb company folded, and that the community hasn't really gotten the project on track at this point. But it's still a really solid piece of software.

Brozzler stores all the crawl state in rethinkdb - jobs, sites, and pages, finished, in progress, and queued. We chose rethinkdb primarily because it is truly distributed (implements raft consensus, thus has no single point of failure) and because it supports secondary indexes (unlike a key-value store). It's also very easy to deploy and cluster. Even now, I think it was a pretty good choice.

The parts of brozzler that interact with rethinkdb are mostly in frontier.py. We didn't really design the database backend to be modular. But the code is simple enough that replacing rethinkdb should be pretty doable.

We're certainly open to pull requests adding support for another database backend, but have no plans to do this work ourselves.

Since we're not planning to address issue as written ("don't depend on rethinkdb"), I'm gonna close it. Feel free to continue discussing the topic here though.

@gravypod
Copy link

@nlevitt is there a IRC/slack that the maintainers of this code use? I'd like to get a higher level feel for how this, warcproxy, and some of the other tooling in this group fit together.

@nlevitt
Copy link
Contributor

nlevitt commented Aug 26, 2019

There's a channel on iipc.slack.com but unfortunately that's not completely open to the public. I just created a channel #brozzler on freenode and I'll hang out there.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants