-
Notifications
You must be signed in to change notification settings - Fork 29
Internals
This page focuses on the internal details of Framian and is intended for the curious and those who want to start contribute to Framian itself.
Framian has a few fundamental, internal structures that are used to build up the 2 main data structures: Series and Frame. The main data structures are
-
Index[K]
, -
Column[V]
, and -
UntypedColumn
.
An Index[Key]
provides fast access to an ordered sequence of (Key, Int)
pairs. We call the Key
value the key and the Int
value the row, so an Index
is a set of key-row pairs. The job of the index is just to provide fast, key-related operations for Frame
and Series
(lookups, joins, grouping, co-grouping, etc.) on top of this sequence. An Index
does not need to be in sorted order; it will maintain the original traversal order of the pairs. It is neither surjective or injective mapping; the keys can be duplicated and multiple keys can point to the same "row" (the Int
).
To get a feel for this, let's create an Index
.
scala> val index = Index("b" -> 0, "c" -> 2, "a" -> 0, "b" -> 1)
index: Index[String] = UnorderedIndex((b,0), (c,2), (a,0), (b,1))
Now we can traverse this Index
, in the original order, using foreach.
scala> index.foreach((k, v) => println(s"$k -> $v"))
b -> 0
c -> 2
a -> 0
b -> 1
Any Scala collection method, such as foreach
, map
, flatMap
, filter
, iterator
, etc, will use this order. While the Index
does sort the key-pairs internally, this is purely an implementation detail of the index - it gives us fast access to the rows by key via binary search.
scala> index.get("a")
res0: Option[Int] = Some(0)
scala> index.get("c")
res1: Option[Int] = Some(2)
scala> index.get("d")
res2: Option[Int] = None
In those examples, each key was mapped to at most 1 row. However, as noted above, a key may be mapped to more than 1 row. In this case, get(key)
will return the first result, in traversal order.
scala> index.get("b") // There is more than 1 answer, so we get the first!
res3: Option[Int] = Some(0)
Index
provides the more general getAll(key)
method that will return all the key-row mappings for key
as another Index
. This is useful when you want all the mappings and not just the first.
scala> index.getAll("b")
res4: Index[String] = OrderedIndex((b,0), (b,1))
The astute reader may have noticed that there are 2 types of Index
returned by our methods: OrderedIndex
and UnorderedIndex
. An OrderedIndex
guarantees that its traversal order is also its sorted order. An UnorderedIndex
does not make this guarantee. Framian attempts to use OrderedIndex
where possible, since it is somewhat cheaper to maintain (smaller memory footprint and less indirection). For instance, the Index#sorted
method returns an OrderedIndex
.
scala> index.sorted
res5: OrderedIndex[String] = OrderedIndex((a,0),(b,0),(b,1),(c,2))
However, other than sorted
, there is no way to explicitly construct an OrderedIndex
outside of Framian. Framian's Index constructors will choose the appropriate sub-type automatically, based on your data.
scala> Index("a" -> 0, "b" -> 1, "c" -> 2)
res0: Index[String] = OrderedIndex((a,0), (b,1), (c,2))
scala> Index("b" -> 1, "c" -> 2, "a" -> 0)
res1: framian.Index[String] = UnorderedIndex((b,1), (c,2), (a,0))
One of the primary purposes of an Index
is to provide fast grouping capabilities for both Frame
and Series
. This is accomplished by creating a custom Index.Grouper[Key]
and using Index.group(index)(grouper)
. The Index.group
method will call the group
method in Grouper
once for each unique row key, feeding the State
output from the previous call into the next call. The result of Index.group
is the final State
returned from the Grouper
.
An example of an implementation of Index.Grouper
is Reduction
, which will reduce each group using the values from the column and a reducer, and return the results as a map from group key to reduced value. This is then used as the basis for all of Series
*ByKey
reduction methods, such as meanByKey
and sumByKey
.
Another example is Frame
's mapRowGroups
method, which let's you map each group (set of rows with the same key) in a frame to another frame.
Another primary purpose of Index
is to provide fast cogrouping. Much like Grouper
, Index.Cogrouper
groups together identical keys, but it does this for 2 Index
es at the same time. For each unique key in the union of keys of 2 indices, Cogrouper
will call cogroup
exactly once. For each cogroup
, it is passed the matching rows on each side, with the State
output from the previous call to cogroup
passed as the input to this one. Either group of rows in a cogroup
call may be empty, but at least one side will be non-empty.
Framian's Joiner
(table joins) and Merger
(table merges) are both implemented as Cogrouper
s. These 2 instances are the basis behind Frame
's join
s and merge
s, as well as Series
's merge
and zipMap
.
The Column[A]
is the next major component in Framian. It provides an "infinite" mapping from rows (eg. Int
) to values* of type A
. Essentially, it is a wrapper around the total function Int => Cell[A]
- you can even construct a Column
from such a function and Column
provides an apply
method so you can treat it like a function.
You may notice that if you have an Index[K]
, which provides a finite multimap from K
to Int
, and a Column[V]
, which provides an "infinite" map from Int
to Cell[V]
, then you end up with a finite multimap from K
to Cell[V]
. This is intentional. The index/column pairing is the essential structure in Framian. We've even given it a name; a Series[K, V]
is simply just a wrapper around a pair of (Index[K], Column[V])
. In almost all cases in Framian, a Column
is meant to be paired with an Index
in some way or another.
Well, sort of - they are defined for all rows from Int.MinValue
to Int.MaxValue
. This let's us be quite flexible when manipulating and "reindexing" columns (a common operation used in joins, merges, etc). It also means we can define things like,
scala> val one = Column.const(1)
one: framian.Column[Int] = Column(Value(1), Value(1), Value(1), Value(1), Value(1), ...)
This has some nice mathematical properties (makes it easier to define certain algebras on columns). There are plans to add better support for "finite" columns (columns where we can bound the range of non-NA
values). This support would be mostly aimed at optimizing columns and allowing us to force columns (see the section on laziness).
Columns aim to support boxless access, so, while I stated earlier that a Column[A]
was just Int => Cell[A]
, that isn't the whole story. The problem with Int => Cell[A]
is that it means every access (eg apply) force the value into a Cell
.
scala> val col = Column.fromMap(Map(0 -> "x", 3 -> "y"))
col: framian.Column[String] = Column(Value(x), NA, NA, Value(y), NA, ...)
scala> col(0)
res0: framian.Cell[String] = Value(x)
scala> col(1)
res1: framian.Cell[String] = NA
In this case, we wrapped a Map
, so perhaps we don't care much about boxing overhead. But Framian provides Column
constructors whose values aren't boxed. For instance, Column.fromArray
provides a way to construct a column that wraps a dense array of values.
scala> Column.fromArray(Array(1,2,3,4))
res2: framian.Column[Int] = Column(Value(1), Value(2), Value(3), Value(4), NA, ...)
Another way we can construct a Column
containing unboxed values is with a ColumnBuilder
. This will store values in an array and use bit sets to store flags indicating if a row is NA
or NM
.
scala> :paste
// Entering paste mode (ctrl-D to finish)
(Column.builder[Int]
+= Value(1)
+= Value(2)
+= NA
+= NM
+= Value(5)
).result()
// Exiting paste mode, now interpreting.
res3: framian.Column[Int] = Column(Value(1), Value(2), NA, NM, Value(5), ...)
We wish to be able to access these values without having to explicitly box them in a Cell
. This desire is why we end up with Column
's interface:
trait Column[@spec(Int,Long,Float,Double) +A] extends ColumnLike[Column[A]] {
// Returns true if apply(row) returns a Value.
def isValueAt(row: Int): Boolean
// Returns the NonValue returned by apply(row)
def nonValueAt(row: Int): NonValue
// Returns the Value returned by apply(row)
def valueAt(row: Int): A
def foldRow[B](row: Int)(f: A => B, g: NonValue => B): B =
if (isValueAt(row)) f(valueAt(row)) else g(nonValueAt(row))
def apply(row: Int): Cell[A] =
foldRow(row)(Value(_), m => m)
...
}
The API gives us 2 ways to access unboxed values in a Column
. First, we can use check if a row is a Value
by using isValueAt(row)
, and then use the appropriate accessor inside an if
statement.
val col: Column[A] = ...
if (col.isValueAt(row)) {
val value: A = col.valueAt(row)
println(value.toString)
} else {
val nonValue: NonValue = col.nonValueAt(row) // NA | NM
nonValue match {
case NA => "N/A"
case NM => "N/M"
}
}
The other way is to use foldRow
,
val asString: A => String = _.toString
val printNonValue: NonValue => String = {
case NA => "N/A"
case NM => "N/M"
}
println(col.foldRow(asString, printNonValue))
The other thing is that columns, currently, are very lazy. All operations are deferred and evaluated on each value every time you access them. So, col.map(n => expensiveOperation(n))
means that we may end up executing expensiveOperation
more times than we'd like. This is not ideal and this should be addressed in the near future.
An UntypedColumn
is a column that has lost its value type information. This means we also lose all operations that work with the values (map
, zipMap
, filter
, etc.). However, we can still manipulate the column index with methods like shift
, reindex
, mask
, etc. The main method on UntypedColumn
is really cast[A]
though. It provides a way to extract a typed Column
from the UntypedColumn
.
trait UntypedColumn extends ColumnLike[UntypedColumn] {
def cast[A: ColumnTyper]: Column[A]
}
As you can see, cast
method depends on an implicit ColumnTyper
being available. The ColumnTyper
actually does the magic to convert an UntypedColumn
back into a typed Column
. If a conversion doesn't make sense (eg converting String
s to Int
s), then the Column[Int]
returned by cast[Int]
will contain NM
- because the conversion was Not Meaningful.
val col: UntypedColumn = TypedColumn(Column.fromArray(Array("a", "b", "c")))
col: UntypedColumn = TypedColumn(Column(Value(a), Value(b), Value(c), NA, NA, ...))
scala> col.cast[Int]
res0: framian.Column[Int] = Column(NM, NM, NM, NA, NA, ...)
Currently, a ColumnTyper
really only has ClassTag
information available for the type of things stored in the underlying TypedColumn
s (there may be more than 1) that back most UntypedColumn
s. So, generally we resort to simple isAssignableFrom
-style checks. This is a bit janky and there are plans to change this to something more principled. One exception to the simple check is numeric types.
There are special ColumnTyper
s defined for most Scala/Spire number types that allows safe conversions between all these types. This means that we can convert between, eg, Int
, BigDecimal
, Rational
, Real
, Double
, etc. Here is an example of using these conversions:
scala> import spire.math._
import spire.math._
scala> val col = TypedColumn(Column.fromArray(Array(0.5, 1, -2.8)))
col: framian.TypedColumn[Double] = TypedColumn(Column(Value(0.5), Value(1.0), Value(-2.8), NA, NA, ...))
scala> col.cast[Int] // 0.5 and -2.8 can't be "safely" converted to Int
res0: framian.Column[Int] = Column(NM, Value(1), NM, NA, NA, ...)
scala> col.cast[Double]
res1: framian.Column[Double] = Column(Value(0.5), Value(1.0), Value(-2.8), NA, NA, ...)
scala> col.cast[Rational]
res2: framian.Column[spire.math.Rational] = Column(Value(1/2), Value(1), Value(-3152519739159347/1125899906842624), NA, NA, ...)
scala> col.cast[BigDecimal]
res3: framian.Column[BigDecimal] = Column(Value(0.5), Value(1.0), Value(-2.8), NA, NA, ...)