-
Notifications
You must be signed in to change notification settings - Fork 13
Overview
Dstats is a library for statistics and (eventually) machine learning, written in the D programming language. Its goal are as follows:
-
Be as easy to use as statistics packages like R for statistics and machine learning, yet be perfectly integrated with a general purpose, efficient, statically typed, close-to-the-metal programming language. The existing solutions I've looked at fall short. R is not a general purpose programming language because it was never meant to be. Python with RPy is usable, but the friction between the Python parts of a program and the R parts is ever-present. Dstats already covers more functionality than Python plus SciPy.
-
Be efficient. At first glance, this shouldn't matter much for a statistics library, but when you're doing -omics research or monte carlo simulations, it starts to.
-
Gradually integrate with SciD as it matures to provide a more comprehensive library for scientific code in D. SciD provides matrices, vectors and expression templates. Dstats was largely written before SciD existed and therefore often takes an ad-hoc approach to dealing with matrix operations, since solving that problem for the general case in D was beyond the scope of the project. So far, Dstats can optionally use SciD to accelerate some matrix computations, enable a few extra features and share an instance of the custom memory allocator that both use under the hood. This happens when it's compiled with -version=scid.
Dstats currently consists of the following modules:
dstats.all: A module that publicly imports the rest of Dstats. For convenience only.
dstats.alloc: Contains an implementation of the RegionAllocator memory allocator and a a hash table, hash set and AVL tree implementation designed specifically for this allocator.
dstats.base: Utility functionality that higher level Dstats functionality builds on.
dstats.cor: Pearson, Spearman and Kendall correlation and covariance.
dstats.distrib: Probability distribution functions (CDFs, PDFs, inverse CDFs, etc.)
dstats.infotheory: Does basic information theory calculations such as entropy, joint and conditional entropy, and (conditional), mutual information.
dstats.kerneldensity: A module for non-parametric kernel estimation of probability densities.
dstats.pca: Performs principal component analysis.
dstats.random: Random number generators for various probability distributions, ported from NumPy.
dstats.regress: Linear and logistic regression, ordinary and L1 and L2 penalized.
dstats.sort: Implementations of sorting algorithms that are highly optimized and contain added features for statistical use cases.
dstats.summary: Summary statistics such as mean, median, interquartile range, standard deviation, skewness, and kurtosis.
dstats.tests: Hypothesis testing, such as T-tests, Wilcoxon tests, and various one-way ANOVAs. Also contains several multiple testing corrections, such as Benjamini-Hochberg.