-
Notifications
You must be signed in to change notification settings - Fork 0
Loading and saving matrices
Matrices can be read with commands like this:
scala> val a:IMat = load("d:\data\sentiment\data1.mat","tokens") scala> val b:SMat = load("d:\data\sentiment\data2.mat","trigrams")
The load command takes a filename argument, followed by the name of a variable in the file. Assuming the data were created by Matlab (with the "-v7.3" option to save), the variable name is the name of the object saved in Matlab.
Note that each variable declaration includes a matrix type. This is important. The load function can return FMat, DMat, IMat, SMat, SDMat, CMat, CSMat or String objects, and its actual return type is AnyRef. Providing a type declaration for the assigned value or variable tells the compiler exactly what type to expect, and allows the variable to be bound to the correct type. Note that CSMat is similar to Matlab's "cell matrix" and its elements may be of any of the above types. Mostly commonly though, the CSMat will hold string data.
The underlying representation is HDF5, a widely-used format for storing matrices of scientific data, and the format now used by Matlab. Matlab's version of this format is prefaced by a 512-byte header. That is the only difference between Matlab's HDF5 files and non-Matlab HDF5 files. Without the header though, Matlab will not read a data file. It will also complain if certain metadata on each array are missing. So its best to use a save function that is compatible with Matlab.
Saving variables to a file is straightforward:
scala> saveAs("d:\data\sentiment\data1.mat", a, "tokens", b, "trigrams")
You can save an arbitrary number of variables to a file. The first argument to saveAs is a filename, and the remaining args form an alternating list of variables from the environment, and String names. The effect is that variable a is saved as "tokens", b is saved as "trigrams" etc. In fact a and b dont have to be references to matrices, they could be any expressions that return the appropriate matrix types.
You can load this data directly into Matlab with the load command (which doesnt need the "-v7.3" option). It will create variables named "tokens" and "trigrams" that are respectively a dense matrice of int32, and a sparse matrix of double.
Not all Matlab types are supported. Currently there are dense matrices of double, float and int32, and sparse double matrices (you can also save and load sparse matrices with float coefficients which do not exist in Matlab). String data are stored as uint16, which matches well with the internal formats of Matlab and Java/Scala, and will be read by Matlab as strings. A CSMat of string data will be read by Matlab as a cell array of Strings. Unfortunately, this is very inefficient in HDF5. Matlab really only has cell string arrays to handle variable-length strings. As in Matlab, the contents of each cell are stored as a separate array. In HDF5, compression only happens within a given array (i.e. within one string). Arrays of short strings, like dictionaries, cannot be compressed at all. It would be better to use another format, e.g. sparse array of uint16, that could hold variable-length strings for I/O and be converted to cellString array for manipulation.