Skip to content
Tom Switzer edited this page Aug 26, 2014 · 47 revisions

Framian has a few fundamental, internal structures that are used to build up the 2 main data structures: Series and Frame.

Index

An Index[Key] provides fast access to an ordered sequence of (Key, Int) pairs. We call the Key value the key and the Int value the row, so an Index is a set of key-row pairs. The job of the index is just to provide fast, key-related operations for Frame and Series (lookups, joins, grouping, co-grouping, etc.) on top of this sequence. An Index does not need to be in sorted order; it will maintain the original traversal order of the pairs. It is neither surjective or injective mapping; the keys can be duplicated and multiple keys can point to the same "row" (the Int).

Basics

To get a feel for this, let's create an Index.

scala> val index = Index("b" -> 0, "c" -> 2, "a" -> 0, "b" -> 1)
index: Index[String] = UnorderedIndex((b,0), (c,2), (a,0), (b,1))

Now we can traverse this Index, in the original order, using foreach.

scala> index.foreach((k, v) => println(s"$k -> $v"))
b -> 0
c -> 2
a -> 0
b -> 1

Any Scala collection method, such as foreach, map, flatMap, filter, iterator, etc, will use this order. While the Index does sort the key-pairs internally, this is purely an implementation detail of the index - it gives us fast access to the rows by key via binary search.

scala> index.get("a")
res0: Option[Int] = Some(0)

scala> index.get("c")
res1: Option[Int] = Some(2)

scala> index.get("d")
res2: Option[Int] = None

In those examples, each key was mapped to at most 1 row. However, as noted above, a key may be mapped to more than 1 row. In this case, get(key) will return the first result, in traversal order. Index provides the more general getAll(key) method that will return all the key-row mappings for key as another Index.

scala> index.get("b") // There is more than 1 answer, so we get the first!
res3: Option[Int] = Some(0)

scala> index.getAll("b")
res4: Index[String] = OrderedIndex((b,0), (b,1))

The astute reader may have noticed that there are 2 types of Index returned by our methods: OrderedIndex and UnorderedIndex. An OrderedIndex guarantees that its traversal order is also its sorted order. An UnorderedIndex does not make this guarantee. Framian attempts to use OrderedIndex where possible, since it is somewhat cheaper to maintain (smaller memory footprint and less indirection). For instance, the Index#sorted method returns an OrderedIndex.

scala> index.sorted
res5: OrderedIndex[String] = OrderedIndex((a,0),(b,0),(b,1),(c,2))

However, other than sorted, there is no way to explicitly construct an OrderedIndex outside of Framian. Framian's Index constructors will choose the appropriate sub-type automatically, based on your data.

scala> Index("a" -> 0, "b" -> 1, "c" -> 2)
res0: Index[String] = OrderedIndex((a,0), (b,1), (c,2))

scala> Index("b" -> 1, "c" -> 2, "a" -> 0)
res1: framian.Index[String] = UnorderedIndex((b,1), (c,2), (a,0))

Grouping

One of the primary purposes of an Index is to provide fast grouping capabilities for both Frame and Series. This is accomplished by creating a custom Index.Grouper[Key] and using Index.group(index)(grouper). The Index.group method will call the group method in Grouper once for each unique row key, feeding the State output from the previous call into the next call. The result of Index.group is the final State returned from the Grouper.

An example of an implementation of Index.Grouper is Reduction, which will reduce each group using the values from the column and a reducer, and return the results as a map from group key to reduced value. This is then used as the basis for all of Series *ByKey reduction methods, such as meanByKey and sumByKey.

Another example is Frame's mapRowGroups method, which let's you map each group (set of rows with the same key) in a frame to another frame.

Cogrouping

Another primary purpose of Index is to provide fast cogrouping. Much like Grouper, Index.Cogrouper groups together identical keys, but it does this for 2 Indexes at the same time. For each unique key in the union of keys of 2 indices, Cogrouper will call cogroup exactly once. For each cogroup, it is passed the matching rows on each side, with the State output from the previous call to cogroup passed as the input to this one. Either group of rows in a cogroup call may be empty, but at least one side will be non-empty.

Framian's Joiner (table joins) and Merger (table merges) are both implemented as Cogroupers. These 2 instances are the basis behind Frame's joins and merges, as well as Series's merge and zipMap.

Clone this wiki locally