-
Notifications
You must be signed in to change notification settings - Fork 266
Core concepts
This page introduces the core concepts of Summingbird. Summingbird jobs produce two main types of data: streams and snapshots. A stream is a full history of data, whereas a snapshot captures system state at a specific time. Below it will help to ask: is this data a stream or a snapshot?
The Producer
is Summingbird’s data stream abstraction. Every job begins by creating an initial Producer[P, T]
node out of a P#Source[T]
by calling the Producer.source
method:
def source[P <: Platform[P], T: Manifest](s: P#Source[T]): Producer[P, T]
Once you’ve created a Producer
, all operations in the Producer API are fair game. After building up your desired MapReduce workflow, you’ll need to hand your Producer
to an instance of Platform
to compile the MapReduce workflow down to that particular Platform
’s notion of a “plan”. More on the “plan” later.
A Summingbird Platform
instance can be implemented for any streaming MapReduce library that knows how to make sense of the operations one can perform on a Producer
. The Summingbird repository contains Platform
implementations for Storm, Scalding and in-memory. Here’s the Platform
trait:
trait Platform[P <: Platform[P]] {
type Source[_]
type Store[_, _]
type Sink[_]
type Service[_, _]
type Plan[_]
def plan[T](completed: Producer[P, T]): Plan[T]
}
Each of the type variables locks down one of Summingbird’s core concepts for a particular execution platform. As you build up a graph of operations on your Producer
, you’ll pull in various instances of these types. Let’s discuss each type variable in turn.
A Source
represents a stream of data. Each execution platform has its own notion of a data source. For example, the Memory platform defines a Source[T]
to be any TraversableOnce[T]
:
type Source[T] = TraversableOnce[T]
As a result, any Scala sequence is fair game:
import com.twitter.summingbird._
import com.twitter.summingbird.memory._
val producer: Producer[Memory, Int] = Producer.source(Seq(1,2,3))
Storm’s sources are often backed by realtime distributed queues, so the Storm platform’s Source
type is a bit different:
type Source[+T] = com.twitter.tormenta.spout.Spout[(Long, T)]
This source is a Tormenta Spout tuned to produce instances of T
along with an associated timestamp. While different, the Memory platform’s TraversableOnce[T]
and the Storm platform’s Spout[(Long, T)]
are, logically, generators of data with sensical implementations of the methods in the Producer API.
The Store
is where the “reduce” of Summingbird’s streaming MapReduce comes into play. A Store[K, V]
contains a snapshot of the aggregated value for each of its keys. When a Producer
instance contains (K, V)
pairs, calling producer.sumByKey(store)
will group all V
(by K
), sum all V
’s using an instance of Monoid[V]
and push a snapshot representation of all resulting (key, summed-value) pairs together into the supplied Store[K, V]
.
Because Summingbird uses a Monoid[V]
, updates into a Store
are always associative:
newStore = assoc(oldStore, value)
where assoc(assoc(a, b), c) == assoc(a, assoc(b, c))
. The associative property is exploited by the Storm and Scalding execution platforms for parallelism and fault tolerance. Associativity is also helpful when merging results from separate batch and realtime computations; see batch and realtime for more information.
Unlike the Store
, the Sink
allows you to materialize an un-aggregated “stream” representation of the Producer
’s values. A Sink
is a stream, not a snapshot. In Storm
and Memory
, a Sink
is just a function call. In Storm
that that function call might populate a log stream, or another realtime queue that a further Summingbird topology could pull in as a Source
.
In the word count example displayed on Summingbird’s README, inserting a call to write
and sinking before the call to sumByKey
would send a stream of (word, 1L)
pairs into the sink:
source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }
.write(sink)
.sumByKey(store)
Calling producer.write(sink)
after a sumByKey
will produce a stream of monotonically increasing counts for each word, effectively a scanLeft
output of the updates to every key in the stream.
A Service
allows the user to perform a “lookup join”, or leftJoin
, against the current values within a Producer
’s stream.
The joined values can come from another Store
’s snapshot, another Sink
’s stream, or even some other asynchronous function call that requires non-negligible time to compute.
A Service
can also provide access to data from a Store
that is being materialized earlier in the job, or in a dependent job. Here at Twitter, one might imagine joining a Service[TCOShortenedURL, ExpandedURL]
against a Producer[(TCOShortenedURL, V)]
to create a stream of (TCOShortenedURL, (V, Option[ExpandedURL]))
.
On the Memory or Storm platforms, a Service is a key-value store against which the stream performs lookups. Before a service join, a producer must have type Producer[P, (K, V)]
. A join against a P#Service[K, RightV]
will return a Producer[P, (K, (V, Option[RightV]))]
.
The Scalding platform allows you to join against the stream output by another Summingbird job’s sink. The join is implemented in a way that prevents any key on the left side of the join from seeing the right side’s value at a future point in time. Put another way, the Scalding platform’s service join is non-clairvoyant.
The plan is the final representation of the MapReduce flow produced by a Platform
after a call to platform.plan(producer)
. For Storm, the plan is a StormTopology
instance that the user can execute using Storm’s supplied methods. For the Memory
platform, the plan is an in-memory Stream[T]
containing the output of the supplied producer. Once you have a plan, you can jump back out to the APIs backing the specific Platform
for job customization and submission.