-
Notifications
You must be signed in to change notification settings - Fork 29
Framian Guide
This tutorial is best if you follow along with me, trying the examples. To get started, clone the tutorial repo, and start up a Scala REPL from SBT.
$ git clone https://github.com/tixxit/framian-tutorial.git
$ cd framian-tutorial
$ sbt
> console
[info] Starting scala interpreter...
[info]
import framian._
import framian.csv.Csv
import framian.tutorial._
import spire.implicits._
import org.joda.time.LocalDate
Welcome to Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45).
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Great! We'll just be using the REPL in this tutorial. Aside from brining in some dependencies and some imports, this also adds some utility functions that'll let us fetch company information and share price data from Yahoo! Finance.
Among the imports we see above, there are some obvious ones, like framian.\_
(this a tutorial on Framian, after all), but less obvious is
spire.implicits.\_
. Spire is a Scala library that provides many
useful abstractions for working with numbers (and things that sort of behave
like numbers). Framian uses Spire so it can abstract its operations over the
actual number type used. This means you choose the number type, whether that is
Double
, BigDecimal
, or perhaps something more exotic like
spire.math.Rational
or spire.math.Number
. The implicits
import just
brings in the necessary implicit instances required to use Spire's
abstractions.
Framian was made to help us easily work with tabular, heterogeneous data. In
Framian, we call one of these heterogeneous tables a frame and it is
represented by the Frame
. So, we'll start by making some frames.
Frame
provides a number of constructors, but we'll start with one that builds
a frame from a list of case class instances. So, first, we can model a company
as a case class.
scala> case class Company(name: String, exchange: String, ticker: String, currency: String, marketCap: BigDecimal)
defined class Company
This case class includes some basic information we may find useful later, such as the company's reporting currency, their exchange, ticker, and a human readable name. Let's create a few fake companies to populate a frame with.
scala> val Acme = Company("Acme", "NASDAQ", "ACME", "USD", BigDecimal("123.00"))
Acme: Company = Company(Acme,NASDAQ,ACME,USD,123.00)
scala> val BobCo = Company("Bob Company", "NASDAQ", "BOB", "USD", BigDecimal("45.67"))
Acme: Company = Company(Bob Company,NASDAQ,BOB,USD,45.67)
scala> val Cruddy = Company("Cruddy Inc", "XETRA", "CRUD", "EUR", BigDecimal("1.00"))
Cruddy: Company = Company(Cruddy Inc,XETRA,CRUD,EUR,1.00)
And now we create the frame using Frame.fromRows
.
scala> Frame.fromRows(Acme, BobCo, Cruddy)
res0: framian.Frame[Int,Int] =
0 . 1 . 2 . 3 . 4
0 : Acme | NASDAQ | ACME | USD | 123.00
1 : Bob Company | NASDAQ | BOB | USD | 45.67
2 : Cruddy Inc | XETRA | CRUD | EUR | 1.00
Great! The fromRows
constructor will work with any simple case class, tuples
or Shapeless HList
s. These values are transformed into tabular form by creating
a row for each value, and a column for each field in the case class or tuple.
You'll notice that along the left side and top of our data frame, we see some
numbers that label the rows and columns. These are the Frame
's row and column
indexes. They define the primary way of selecting and manipulating rows,
columns, and groups of rows or columns in a Frame
. They also don't have to be
Int
- the type of our frame isn't just Frame
, but Frame[Int, Int]
. Those
2 type parameters define the type of our row and column index respectively.
Having our company frame's columns indexed by Int
s is silly, when we can give
them perfectly good names. A Frame
has many ways of changing how we index the
columns or rows. For now, let's create a custom index for them by mapping each
column index to a String.
scala> res0.mapColKeys {
| case 0 => "Name"
| case 1 => "Exchange"
| case 2 => "Ticker"
| case 3 => "Currency"
| case 4 => "Market Cap"
| }
res1: framian.Frame[Int,String] =
Name . Exchange . Ticker . Currency . Market Cap
0 : Acme | NASDAQ | ACME | USD | 123.00
1 : Bob Company | NASDAQ | BOB | USD | 45.67
2 : Cruddy Inc | XETRA | CRUD | EUR | 1.00
Here we used Frame
's mapColKeys
method. This keeps the columns in the
original order and simply changes the keys value. You'll see our result now has
type Frame[Int,String]
, since our column keys are now strings!
OK, great... so how do we actually use these fancy new column keys to access
the data? Let's start by trying to use the ticker symbol as the row keys.
Rather then using mapRowKeys
like we did above, we'd like to use an approach
that uses the ticker symbol that already exists as data in the frame.
In Framian, a Frame
doesn't know anything about the type of the data stored
in it. Rather, it depends on the user to know these types of details. When we
access data in a frame, we must provide a way to extract the data we want as the
type we want. We do this by using Cols
and Rows
.
Almost all methods on Frame
that manipulate the data in any meaningful way
will have at least one Cols
or Rows
argument. The choice of Cols
or Rows
defines the axis we are selecting along. In the case of our ticker symbol, we
know that it is in the column "Ticker"
and that it is a String
. So, let's
define a Cols
that can extract our tickers.
scala> val ticker = Cols("Ticker").as[String]
ticker: framian.Cols[String,String] = ...
When we construct a Cols
instance using Cols("Ticker")
, it will extract
each ticker symbol as a Rec
. A Rec
is rarely what we want, so we
use the as[A]
method to tell the Cols
instance to extract the column as
a type A
.
Using ticker
, we can now reindex our rows using the ticker symbol, instead
of the row index.
scala> val frame = res1.reindex(ticker)
Name . Exchange . Ticker . Currency . Market Cap
ACME : Acme | NASDAQ | ACME | USD | 123.00
BOB : Bob Company | NASDAQ | BOB | USD | 45.67
CRUD : Cruddy Inc | XETRA | CRUD | EUR | 1.00
Now we're getting somewhere. We have a frame with basic company information, where each company is indexed by that company's stock ticker and each field is indexed by the field name. Let's see how we can work with some of the data in this frame.
We can get a single cell out of the frame by using Frame
's apply
method.
This method requires a type, the row key and the column key, and will return a
single cell from the frame. Remember, a frame doesn't know anything about the
type of data it contains, so the type parameter must be supplied.
scala> frame[BigDecimal]("ACME", "Market Cap")
res2: framian.Cell[BigDecimal] = Value(123.00)
The first thing we can notice is that we didn't get back a value of type
BigDecimal
- we got a value of type Cell[BigDecimal]
. This is because
Framian does not assume your data is dense. A cell in a frame may contain
data, but it may also be missing or be invalid. Cell
is a data
type that has 3 cases (sub-classes):
-
Value(value)
- the cell's data exists and is valid, -
NA
- the cell's data is missing or not available, and -
MM
- the cell's data is invalid or not meaningful.
Whenever we extract a value out of a frame we will actually be working with
Cell
s instead. You can think of Cell
like Scala's Option
, except we have
2 cases of missing/invalid data (NA
and NM
), instead of just 1 (None
).
Aside from getting back a cell, you'll also note that we also requested the
value as a BigDecimal
. What would happen if we asked for it as something
else? Let's try!
scala> frame[Double]("BOB", "Market Cap")
res2: framian.Cell[Double] = Value(45.67)
You'll notice that for Acme, we asked for the cell as type BigDecimal
, but
for Bob Co, we asked for the market cap as a Double
. Both returned a value.
In Framian, we support conversions between most common numeric types found in
Scala and Spire. Framian attempts to keep numbers abstract, letting the user
decide what kind of precision/speed trade-off they want to make, rather than
forcing 1 type. Even though we had stored the data as a BigDecimal
, when we
requested it as a Double
Framian performed the conversion for us.
OK, well, what if we ask for a less sensible type?
scala> frame[LocalDate]("CRUD", "Market Cap")
res3: framian.Cell[LocalDate] = NM
We got back an NM
, which indicates that the data is invalid or not
meaningful. This makes sense, since we can not meaningfully convert a decimal
number to a LocalDate
(a LocalDate
is just a product of year, month and
day, like 2014-10-31).
Working with individual values has its uses, but we usually want to work with
an entire subset of the frame instead, such as a group of columns or rows. We
do this using the same Cols
/Rows
mechanism described above. Cols
(or Rows
) describe some selection of columns (or rows), along with a
their type. Previously, we had defined ticker
as Cols("Ticker").as[String]
and used it to reindex the frame by the companies' tickers. Let's define
another one to extract the market cap.
scala> val marketCap = Cols("Market Cap").as[BigDecimal]
marketCap: framian.Cols[String,BigDecimal] = ...
We can use marketCap
to extract a Series
from the frame.
scala> frame.get(marketCap)
res4: framian.Series[String,BigDecimal] = Series(ACME -> Value(123.00), BOB -> Value(45.67), XETRA -> Value(1.00))
Now we have the market caps as a series of numbers, indexed by their
tickers. Series
are how we work with typed, 1-dimensional. Cell
s are typed
too, but have no axis, so are 0-dimensional (just a value). A Frame
is
2-dimensional (has 2 axes), but is untyped. In Scala, we require a type to do
pretty much anything useful with a value, so you will find much of your work
with frames will require converting subsets of columns and rows to Series
first.
Much like Frame
s, Series
also don't assume the data is dense. Series are
actually an indexed, list of cells, rather than values. In the case of market
cap, the data is dense, so everything is wrapped in Value
. If any of
the data were missing or invalid, we would see NA
s and NM
s in the output
above.
Before we go further, let's work a slightly more interesting data set than 3 fake companies. The framian-tutorial project includes some nice utility methods that will fetch basic data from Yahoo! Finance for us and stuff 'em into frames. Let's fetch some basic company information about a few car companies.
scala> fetchCompanyInfo("GM", "HMC", "BMW.DE")
res5: framian.Frame[Int,String] =
Name . Stock Exchange . Ticker . Currency . Market Cap
0 : General Motors Co | NYSE | GM | USD | 50.097B
1 : Honda Motor Compa | NYSE | HMC | USD | 58.394B
2 : BMW | XETRA | BMW.DE | EUR | 55.959B
This looks very similar to our previous frame... almost like it was planned that way. Well, let's get this reindexed by ticker.
scala> val companies = res5.reindex(ticker)
companies: framian.Frame[String,String] =
Name . Stock Exchange . Ticker . Currency . Market Cap
GM : General Motors Co | NYSE | GM | USD | 50.097B
HMC : Honda Motor Compa | NYSE | HMC | USD | 58.394B
BMW.DE : BMW | XETRA | BMW.DE | EUR | 55.959B
That's great. We can also fetch their share price data for the last 5 years.
scala> val # = fetchSharePriceData("GM", "HMC", "BMW.DE").reindex(ticker)
#: framian.Frame[String,String] =
Adj Close . Close . Date . High . Low . Open . Ticker . Volume
GM : 31.40 | 31.40 | 2014-10-31 | 31.62 | 30.95 | 31.15 | GM | 15521200
GM : 30.78 | 30.78 | 2014-10-30 | 31.03 | 30.45 | 30.57 | GM | 10073900
GM : 30.72 | 30.72 | 2014-10-29 | 31.33 | 30.35 | 31.20 | GM | 11607900
GM : 31.17 | 31.17 | 2014-10-28 | 31.22 | 30.14 | 30.39 | GM | 26034400
GM : 30.08 | 30.08 | 2014-10-27 | 30.49 | 29.82 | 30.15 | GM | 12772000
GM : 30.04 | 30.04 | 2014-10-24 | 31.28 | 29.98 | 31.04 | GM | 30267600
GM : 30.93 | 30.93 | 2014-10-23 | 31.99 | 30.81 | 31.95 | GM | 25436900
GM : 31.31 | 31.31 | 2014-10-22 | 31.51 | 30.57 | 30.61 | GM | 17817800
GM ...
We also reindexed the frame by the ticker right away.