Skip to content
Tom Switzer edited this page Aug 26, 2014 · 47 revisions

This page focuses on the internal details of Framian and is intended for the curious and those who want to start contribute to Framian itself.

Data Structures

Framian has a few fundamental, internal structures that are used to build up the 2 main data structures: Series and Frame. The main data structures are

Index

An Index[Key] provides fast access to an ordered sequence of (Key, Int) pairs. We call the Key value the key and the Int value the row, so an Index is a set of key-row pairs. The job of the index is just to provide fast, key-related operations for Frame and Series (lookups, joins, grouping, co-grouping, etc.) on top of this sequence. An Index does not need to be in sorted order; it will maintain the original traversal order of the pairs. It is neither surjective or injective mapping; the keys can be duplicated and multiple keys can point to the same "row" (the Int).

Lookups

To get a feel for this, let's create an Index.

scala> val index = Index("b" -> 0, "c" -> 2, "a" -> 0, "b" -> 1)
index: Index[String] = UnorderedIndex((b,0), (c,2), (a,0), (b,1))

Now we can traverse this Index, in the original order, using foreach.

scala> index.foreach((k, v) => println(s"$k -> $v"))
b -> 0
c -> 2
a -> 0
b -> 1

Any Scala collection method, such as foreach, map, flatMap, filter, iterator, etc, will use this order. While the Index does sort the key-pairs internally, this is purely an implementation detail of the index - it gives us fast access to the rows by key via binary search.

scala> index.get("a")
res0: Option[Int] = Some(0)

scala> index.get("c")
res1: Option[Int] = Some(2)

scala> index.get("d")
res2: Option[Int] = None

In those examples, each key was mapped to at most 1 row. However, as noted above, a key may be mapped to more than 1 row. In this case, get(key) will return the first result, in traversal order.

scala> index.get("b") // There is more than 1 answer, so we get the first!
res3: Option[Int] = Some(0)

Index provides the more general getAll(key) method that will return all the key-row mappings for key as another Index. This is useful when you want all the mappings and not just the first.

scala> index.getAll("b")
res4: Index[String] = OrderedIndex((b,0), (b,1))

OrderedIndex vs UnorderedIndex

The astute reader may have noticed that there are 2 types of Index returned by our methods: OrderedIndex and UnorderedIndex. An OrderedIndex guarantees that its traversal order is also its sorted order. An UnorderedIndex does not make this guarantee. Framian attempts to use OrderedIndex where possible, since it is somewhat cheaper to maintain (smaller memory footprint and less indirection). For instance, the Index#sorted method returns an OrderedIndex.

scala> index.sorted
res5: OrderedIndex[String] = OrderedIndex((a,0),(b,0),(b,1),(c,2))

However, other than sorted, there is no way to explicitly construct an OrderedIndex outside of Framian. Framian's Index constructors will choose the appropriate sub-type automatically, based on your data.

scala> Index("a" -> 0, "b" -> 1, "c" -> 2)
res0: Index[String] = OrderedIndex((a,0), (b,1), (c,2))

scala> Index("b" -> 1, "c" -> 2, "a" -> 0)
res1: framian.Index[String] = UnorderedIndex((b,1), (c,2), (a,0))

Grouping

One of the primary purposes of an Index is to provide fast grouping capabilities for both Frame and Series. This is accomplished by creating a custom Index.Grouper[Key] and using Index.group(index)(grouper). The Index.group method will call the group method in Grouper once for each unique row key, feeding the State output from the previous call into the next call. The result of Index.group is the final State returned from the Grouper.

An example of an implementation of Index.Grouper is Reduction, which will reduce each group using the values from the column and a reducer, and return the results as a map from group key to reduced value. This is then used as the basis for all of Series *ByKey reduction methods, such as meanByKey and sumByKey.

Another example is Frame's mapRowGroups method, which let's you map each group (set of rows with the same key) in a frame to another frame.

Cogrouping

Another primary purpose of Index is to provide fast cogrouping. Much like Grouper, Index.Cogrouper groups together identical keys, but it does this for 2 Indexes at the same time. For each unique key in the union of keys of 2 indices, Cogrouper will call cogroup exactly once. For each cogroup, it is passed the matching rows on each side, with the State output from the previous call to cogroup passed as the input to this one. Either group of rows in a cogroup call may be empty, but at least one side will be non-empty.

Framian's Joiner (table joins) and Merger (table merges) are both implemented as Cogroupers. These 2 instances are the basis behind Frame's joins and merges, as well as Series's merge and zipMap.

Column

The Column[A] is the next major component in Framian. It provides an "infinite" mapping from rows (eg. Int) to values* of type A. Essentially, it is a wrapper around the total function Int => Cell[A] - you can even construct a Column from such a function and Column provides an apply method so you can treat it like a function.

You may notice that if you have an Index[K], which provides a finite multimap from K to Int, and a Column[V], which provides an "infinite" map from Int to Cell[V], then you end up with a finite multimap from K to Cell[V]. This is intentional. The index/column pairing is the essential structure in Framian. We've even given it a name; a Series[K, V] is simply just a wrapper around a pair of (Index[K], Column[V]). In almost all cases in Framian, a Column is meant to be paired with an Index in some way or another.

Columns are Infinite

Well, sort of - they are defined for all rows from Int.MinValue to Int.MaxValue. This let's us be quite flexible when manipulating and "reindexing" columns (a common operation used in joins, merges, etc). It also means we can define things like,

scala> val one = Column.const(1)
one: framian.Column[Int] = Column(Value(1), Value(1), Value(1), Value(1), Value(1), ...)

This has some nice mathematical properties (makes it easier to define certain algebras on columns). There are plans to add better support for "finite" columns (columns where we can bound the range of non-NA values). This support would be mostly aimed at optimizing columns and allowing us to force columns (see the section on laziness).

Unboxing Columns

Columns aim to support boxless access, so, while I stated earlier that a Column[A] was just Int => Cell[A], that isn't the whole story. The problem with Int => Cell[A] is that it means every access (eg apply) force the value into a Cell.

scala> val col = Column.fromMap(Map(0 -> "x", 3 -> "y"))
col: framian.Column[String] = Column(Value(x), NA, NA, Value(y), NA, ...)

scala> col(0)
res0: framian.Cell[String] = Value(x)

scala> col(1)
res1: framian.Cell[String] = NA

In this case, we wrapped a Map, so perhaps we don't care much about boxing overhead. But Framian provides Column constructors whose values aren't boxed. For instance, Column.fromArray provides a way to construct a column that wraps a dense array of values.

scala> Column.fromArray(Array(1,2,3,4))
res2: framian.Column[Int] = Column(Value(1), Value(2), Value(3), Value(4), NA, ...)

Another way we can construct a Column containing unboxed values is with a ColumnBuilder. This will store values in an array and use bit sets to store flags indicating if a row is NA or NM.

scala> :paste
// Entering paste mode (ctrl-D to finish)

(Column.builder[Int]
  += Value(1)
  += Value(2)
  += NA
  += NM
  += Value(5)
).result()

// Exiting paste mode, now interpreting.

res3: framian.Column[Int] = Column(Value(1), Value(2), NA, NM, Value(5), ...)

We wish to be able to access these values without having to explicitly box them in a Cell. This desire is why we end up with Column's interface:

trait Column[@spec(Int,Long,Float,Double) +A] extends ColumnLike[Column[A]] {
  // Returns true if apply(row) returns a Value.
  def isValueAt(row: Int): Boolean

  // Returns the NonValue returned by apply(row)
  def nonValueAt(row: Int): NonValue

  // Returns the Value returned by apply(row)
  def valueAt(row: Int): A

  def foldRow[B](row: Int)(f: A => B, g: NonValue => B): B =
    if (isValueAt(row)) f(valueAt(row)) else g(nonValueAt(row))

  def apply(row: Int): Cell[A] =
    foldRow(row)(Value(_), m => m)

  ...
}

The API gives us 2 ways to access unboxed values in a Column. First, we can use check if a row is a Value by using isValueAt(row), and then use the appropriate accessor inside an if statement.

val col: Column[A] = ...

if (col.isValueAt(row)) {
  val value: A = col.valueAt(row)
  println(value.toString)
} else {
  val nonValue: NonValue = col.nonValueAt(row) // NA | NM
  nonValue match {
    case NA => "N/A"
    case NM => "N/M"
  }
}

The other way is to use foldRow,

val asString: A => String = _.toString
val printNonValue: NonValue => String = {
 case NA => "N/A"
 case NM => "N/M"
}

println(col.foldRow(asString, printNonValue))

Columns are (too) Lazy

The other thing is that columns, currently, are very lazy. All operations are deferred and evaluated on each value every time you access them. So, col.map(n => expensiveOperation(n)) means that we may end up executing expensiveOperation more times than we'd like. This is not ideal and this should be addressed in the near future.

UntypedColumn

An UntypedColumn is a column that has lost its value type information. This means we also lose all operations that work with the values (map, zipMap, filter, etc.). However, we can still manipulate the column index with methods like shift, reindex, mask, etc. The main method on UntypedColumn is really cast[A] though. It provides a way to extract a typed Column from the UntypedColumn.

trait UntypedColumn extends ColumnLike[UntypedColumn] {
  def cast[A: ColumnTyper]: Column[A]
}

As you can see, cast method depends on an implicit ColumnTyper being available. The ColumnTyper actually does the magic to convert an UntypedColumn back into a typed Column. If a conversion doesn't make sense (eg converting Strings to Ints), then the Column[Int] returned by cast[Int] will contain NM - because the conversion was Not Meaningful.

val col: UntypedColumn = TypedColumn(Column.fromArray(Array("a", "b", "c")))
col: UntypedColumn = TypedColumn(Column(Value(a), Value(b), Value(c), NA, NA, ...))

scala> col.cast[Int]
res0: framian.Column[Int] = Column(NM, NM, NM, NA, NA, ...)

ColumnTyper

Currently, a ColumnTyper really only has ClassTag information available for the type of things stored in the underlying TypedColumns (there may be more than 1) that back most UntypedColumns. So, generally we resort to simple isAssignableFrom-style checks. This is a bit janky and there are plans to change this to something more principled. One exception to the simple check is numeric types.

There are special ColumnTypers defined for most Scala/Spire number types that allows safe conversions between all these types. This means that we can convert between, eg, Int, BigDecimal, Rational, Real, Double, etc. Here is an example of using these conversions:

scala> import spire.math._
import spire.math._

scala> val col = TypedColumn(Column.fromArray(Array(0.5, 1, -2.8)))
col: framian.TypedColumn[Double] = TypedColumn(Column(Value(0.5), Value(1.0), Value(-2.8), NA, NA, ...))

scala> col.cast[Int] // 0.5 and -2.8 can't be "safely" converted to Int
res0: framian.Column[Int] = Column(NM, Value(1), NM, NA, NA, ...)

scala> col.cast[Double]
res1: framian.Column[Double] = Column(Value(0.5), Value(1.0), Value(-2.8), NA, NA, ...)

scala> col.cast[Rational]
res2: framian.Column[spire.math.Rational] = Column(Value(1/2), Value(1), Value(-3152519739159347/1125899906842624), NA, NA, ...)

scala> col.cast[BigDecimal]
res3: framian.Column[BigDecimal] = Column(Value(0.5), Value(1.0), Value(-2.8), NA, NA, ...)