Skip to content
dsimcha edited this page Dec 23, 2011 · 5 revisions

Dstats is a library for statistics and (eventually) machine learning, written in the D programming language. Its goal are as follows:

  1. Be as easy to use as statistics packages like R for statistics and machine learning, yet be perfectly integrated with a general purpose, efficient, statically typed, close-to-the-metal programming language. The existing solutions I've looked at fall short. R is not a general purpose programming language because it was never meant to be. Python with RPy is usable, but the friction between the Python parts of a program and the R parts is ever-present. Dstats already covers more functionality than Python plus SciPy.

  2. Be efficient. At first glance, this shouldn't matter much for a statistics library, but when you're doing -omics research or monte carlo simulations, it starts to.

  3. Gradually integrate with SciD as it matures to provide a more comprehensive library for scientific code in D. SciD provides matrices, vectors and expression templates. Dstats was largely written before SciD existed and therefore often takes an ad-hoc approach to dealing with matrix operations, since solving that problem for the general case in D was beyond the scope of the project. So far, Dstats can optionally use SciD to accelerate some matrix computations, enable a few extra features and share an instance of the custom memory allocator that both use under the hood. This happens when it's compiled with -version=scid.


Dstats currently consists of the following modules:

dstats.all: A module that publicly imports the rest of Dstats. For convenience only.

dstats.alloc: Contains an implementation of the RegionAllocator memory allocator and a a hash table, hash set and AVL tree implementation designed specifically for this allocator.

dstats.base: Utility functionality that higher level Dstats functionality builds on.

dstats.cor: Pearson, Spearman and Kendall correlation and covariance.

dstats.distrib: Probability distribution functions (CDFs, PDFs, inverse CDFs, etc.)

dstats.infotheory: Does basic information theory calculations such as entropy, joint and conditional entropy, and (conditional), mutual information.

dstats.kerneldensity: A module for non-parametric kernel estimation of probability densities.

dstats.pca: Performs principal component analysis.

dstats.random: Random number generators for various probability distributions, ported from NumPy.

dstats.regress: Linear and logistic regression, ordinary and L1 and L2 penalized.

dstats.sort: Implementations of sorting algorithms that are highly optimized and contain added features for statistical use cases.

dstats.summary: Summary statistics such as mean, median, interquartile range, standard deviation, skewness, and kurtosis.

dstats.tests: Hypothesis testing, such as T-tests, Wilcoxon tests, and various one-way ANOVAs. Also contains several multiple testing corrections, such as Benjamini-Hochberg.

Clone this wiki locally