Home

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations. So, while a word-counting aggregation in pure Scala might look like this:

def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
  source.flatMap { sentence =>
    toWords(sentence).map(_ -> 1L)
  }.foreach { case (k, v) => store.update(k, store.get(k) + v) }

Counting words in Summingbird looks like this:

def wordCount[P <: Platform[P]]
  (source: Producer[P, String], store: P#Store[String, Long]) =
    source.flatMap { sentence =>
      toWords(sentence).map(_ -> 1L)
    }.sumByKey(store)

The logic is exactly the same, and the code is almost the same. The main difference is that you can execute the Summingbird program in “batch mode” (using Scalding), in “realtime mode” (using Storm), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.

Summingbird provides you with the primitives you need to build rock solid production systems.

Pages

Around the Web

LambdaJam 2013 Summingbird Workshop
Summingbird: StreamingMapReduce at Twitter (Sam’s talk from the AK Data Science Summit)

Future Plans

We’re very excited about Summingbird’s development as we move beyond our initial release. Some of our future plans include:

Support for more execution platforms (Spark and Akka seem like clear candidates)
Pluggable optimizations for the Producer graph layer
- Projection and filter pushdown
Support for filter-aware data sources, like Parquet
Libraries of higher-level mathematics and machine learning code on top of Summingbird’s Producer primitives
More extensions to Summingbird’s related projects (listed below)
- More data structures with Monoid instances via Algebird
- More key-value stores implementations via Storehaus
- More Storm data sources, via Tormenta
More tutorials and examples with public data sources

More information on these issues and many more can be found on the Summingbird issue tracker.

Getting Involved

Discussion about development and new features occurs primarily on the Summingbird mailing list. To join the mailing list, email [summingbird@librelist.com](summingbird@librelist.com). The same address is used for posting once you’ve joined.

Issues should be reported on the GitHub issue tracker. Simpler issues appropriate for first-time contributors looking to help out are tagged “newbie”. To see which issues are slated for a planned release, check out the milestones page.

Follow @summingbird on Twitter for updates.

Related Projects

The Summingbird projects spawned a number of related subprojects, notably:

Algebird

Algebird is an abstract algebra library for Scala. Many of the data structures included in Algebird have Monoid implementations, making them ideal to use as values in Summingbird aggregations.

Bijection

Summingbird uses the Bijection project’s Injection typeclass to share serialization between different execution platforms and clients.

Chill

Summingbird’s Storm and Scalding platforms both use the Kryo library for serialization. Chill augments Kryo with a number of helpful configuration options, and provides modules for use with Storm, Scala, Hadoop. Chill is also used by the Berkeley Amp Lab’s Spark project.

Tormenta

Tormenta provides a type-safe layer over Storm’s Scheme and Spout interfaces.

Storehaus

Summingbird’s client is implemented using Storehaus’s async key-value store traits. The Storm platform makes use of Storehaus’s MergeableStore trait to perform real-time aggregations into a number of commonly used backing stores, including Memcached and Redis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Pages

Around the Web

Future Plans

Getting Involved

Related Projects

Documentation TODO:

Getting Help

Background

Documentation

Implementation Details

Clone this wiki locally