Don't depend on rethinkdb #159

traverseda · 2019-08-18T20:09:22Z

Rethinkdb seems to be essentially dead in the water. There are some attempts to get back on track, but right now it sadly doesn't seem to be being maintained.

gravypod · 2019-08-23T17:23:47Z

I've been looking at Brozzler, the mitm warc proxy, and some of the other IA things. I have been interested in deploying these as well but Rethink, and most importantly the configuration tied up within it, make that difficult to do.

If someone had some high-level documentation about what rethink is used for (other than just Service Discovery) I might be able to extract that code, put it into some ABC class, and allow the selection of multiple backend configuration modes. From a high level it looks like much of what rethink is being used for can be done with redis and a lot of the service discovery it's doing could be managed (for most people) with config files.

Was there a specific design decision that coupled all of these services together (job submission, job state tracking, warc serving, scraping) into rethinkdb? If there is a document that covers this I'd be really happy to take a look at it to get a better picture of what the motivations were for these design choices.

traverseda · 2019-08-23T18:35:30Z

I mean honestly rethinkdb used to be very easy to deploy, and was one DB that could handle all those different services nicely. It's a shame that development stalled like it did, and that no one was able to perform maintenance releases after they lost funding.

nlevitt · 2019-08-23T20:25:05Z

It is a shame that the rethinkdb company folded, and that the community hasn't really gotten the project on track at this point. But it's still a really solid piece of software.

Brozzler stores all the crawl state in rethinkdb - jobs, sites, and pages, finished, in progress, and queued. We chose rethinkdb primarily because it is truly distributed (implements raft consensus, thus has no single point of failure) and because it supports secondary indexes (unlike a key-value store). It's also very easy to deploy and cluster. Even now, I think it was a pretty good choice.

The parts of brozzler that interact with rethinkdb are mostly in frontier.py. We didn't really design the database backend to be modular. But the code is simple enough that replacing rethinkdb should be pretty doable.

We're certainly open to pull requests adding support for another database backend, but have no plans to do this work ourselves.

Since we're not planning to address issue as written ("don't depend on rethinkdb"), I'm gonna close it. Feel free to continue discussing the topic here though.

gravypod · 2019-08-24T02:01:21Z

@nlevitt is there a IRC/slack that the maintainers of this code use? I'd like to get a higher level feel for how this, warcproxy, and some of the other tooling in this group fit together.

nlevitt · 2019-08-26T18:50:58Z

There's a channel on iipc.slack.com but unfortunately that's not completely open to the public. I just created a channel #brozzler on freenode and I'll hang out there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't depend on rethinkdb #159

Don't depend on rethinkdb #159

traverseda commented Aug 18, 2019

gravypod commented Aug 23, 2019

traverseda commented Aug 23, 2019

nlevitt commented Aug 23, 2019

gravypod commented Aug 24, 2019

nlevitt commented Aug 26, 2019

Don't depend on rethinkdb #159

Don't depend on rethinkdb #159

Comments

traverseda commented Aug 18, 2019

gravypod commented Aug 23, 2019

traverseda commented Aug 23, 2019

nlevitt commented Aug 23, 2019

gravypod commented Aug 24, 2019

nlevitt commented Aug 26, 2019