-
Notifications
You must be signed in to change notification settings - Fork 265
Home
Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations. So, while a word-counting aggregation in pure Scala might look like this:
def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.foreach { case (k, v) => store.update(k, store.get(k) + v) }
Counting words in Summingbird looks like this:
def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.sumByKey(store)
The logic is exactly the same, and the code is almost the same. The main difference is that you can execute the Summingbird program in “batch mode” (using Scalding), in “realtime mode” (using Storm), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.
Summingbird provides you with the primitives you need to build rock solid production systems.
- History and Motivation
- The Producer API
- Batch and Realtime
- The (deprecated) Builder API
- Frequently Asked Questions
- LambdaJam 2013 Summingbird Workshop
- Summingbird: StreamingMapReduce at Twitter (Sam’s talk from the AK Data Science Summit)
The Summingbird projects spawned a number of related subprojects, notably:
Algebird is an abstract algebra library for Scala. Many of the data structures included in Algebird have Monoid implementations, making them ideal to use as values in Summingbird aggregations.
Summingbird uses the Bijection project’s Injection
typeclass to share serialization between different execution platforms and clients.
Summingbird’s Storm and Scalding platforms both use the Kryo library for serialization. Chill augments Kryo with a number of helpful configuration options, and provides modules for use with Storm, Scala, Hadoop. Chill is also used by the Berkeley Amp Lab’s Spark project.
Tormenta provides a type-safe layer over Storm’s Scheme
and Spout
interfaces.
Summingbird’s client is implemented using Storehaus’s async key-value store traits. The Storm platform makes use of Storehaus’s MergeableStore
trait to perform real-time aggregations into a number of commonly used backing stores, including Memcached and Redis.