Scalaz Analytics provides a high-performance, purely-functional library for doing computational analysis and statistics over data in a type-safe way.
Scalaz Analytics is a principled functional programming library for data processing and analytics.
- Simple and principled
- First class support for analytics and data science
- Pure type-safe, functional interface that integrates with other Scalaz projects
- Supports batch and streaming
- Efficient on both small and large data sets, single machine and distributed
- Can be used from a REPL for interactive analysis or as a library for applications
Below is a selection of Analytics/Data processing Libraries that we are being used as inspiration. Some of these metrics are somewhat subjective but they give an idea for what we are looking at from each library. Note that these metrics assume native support, so libraries that achieve these things via another library are not considered.
Library | Scales to Big Data | Supports Batch | Supports Streaming | FP | Easy to Debug | Out of the box analytics |
---|---|---|---|---|---|---|
Spark | ✔ | ✔ | ✔ (mini batch) | ✘ | ✘ | ✘ |
Flink | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ |
Pandas | ✘ | ✔ | ✘ | ✘ | ✔ | ✘ |
R | ✘ | ✔ | ✔ | ✘ | ? | ✔ |
Dask | ✔ | ✔ | ✔ | ✘ | ? | ✘ |
Apex | ✔ | ✔ | ✔ | ✘ | ? | ✘ |
Beam | ✔ | ✔ | ✔ | ✘ | ? | ✘ |
- Apache Spark - Internal Data Formats, Tungsten 1/Tungsten 2 and Catalyst optimisations, Mastering Apache Spark, Frameless - a library that adds type-safety to Spark
- Quark - Functional Data Processing DSL - presentation and video
- Flink documentation
- Apache Arrow - High performance in memory data platform
- Icicle - Optimisation of DSLs into efficient code for data processing talk
- Machine Fusion
- An algebra for distributed big data analytics - Leonidas Fegaras
- Compiling to categories - Conal Elliott
- Linear algebra for data processing and other similar work by José Nuno Oliveira
- Data cube as typed linear algebra operator
- Comprehensions syntax for query languages Monoid and others
- Efficient derivation of folds in haskells foldl
- Mining Massive Data Sets free online book
- Scipy stats package
- Concurrent commutative folds in Tesser
- Explanation of zero copy