notebooks and related material from my data-intelligence.ai talk:
Audience level: Novice
Topic area: Misc
Description:
Medicare payments, UPC code descriptions, fertility rate and fires. All of it is data, some of which is erroneous and some of which is anomalous. Seeking Exotics introduces the audience to the world of outliers and anomaly detection through the use of metrics, visualizations and open source machine learning tools.
Abstract:
In 1777, Daniel Bernoulli wrote in his paper on the Most Likely Induction (maximum likelihood): "Is it right to hold that the several observations are of the same weight or moment, or equally prone to any and every error?"
Ever since, mankind has been struggling as to what to do with erroneous and anomalous data. Finding them seems like
a simple problem of clustering, and labeling them as such, like a simple problem of classification, but that would
be oversimplifying the problem.
The "Seeking Exotics" talk will start with a light introduction to the world of outliers and anomaly detection. For more historical and background information, the audience is kindly invited to listen before the talk to episode 2 of the podcast "Something for Your Mind".
Through several sets of data covering various fields and types of data, several visualization techniques will be
demonstrated. This will range from static box and stemgraphic plots to interactive mpld3 scatter plots. These will
be combined with dimensionality reduction and clustering techniques (beyond PCA) in order to derive more insight
from the data.
Finally, one class classifiers (such as Isolation Forest) will do some heavy lifting for us with the
easy to use scikit-learn giving us some results, ranging from sobering to surprising.
Besides installing Jupyter notebook and having Python 3.4 or greater (all of that installable through Anaconda, if you are new to python), you will need a few extra packages.
If you want to reproduce the results exactly as I demonstrated at the conference, here are the versions I used of each packages (I had an incompatibility issue between seaborn and a more recent matplotlib):
- matplotlib==1.5.0
- mpld3==0.3
- numpy==1.12.0
- pandas==0.19.2
- scikit-learn==0.18.2
- scikit-sos==0.1.10
- scipy==0.18.1
- git+https://github.com/mwaskom/seaborn.git
- statsmodels==0.8.0
- stemgraphic==0.3.6
Optionally, install python package rise to use the slide deck presentation mode in Jupyter for the first two notebooks (that is how they were presented). Following is a breakdown of all of this by area.
See also:
During my talk, I brought up a few paper and book titles (the first 10). I also talked about a few more during the questions at the end (continuing over lunch time and the afternoon even). So for the benefit of many, here are some of them (related to outliers, anomalies and visualization):
- Daniel Bernoulli, The Most Probable Choice Between Several Discrepant Observations and the Formation Therefrom of the Most Likely Induction (1777), translated from latin by C. G. Allen, republished in Biometrika 48, 1 and 2 p.1 (1961) by M.G. Kendall
- K.E. Basford and J.W. Tukey, Graphical Analysis of Multiresponse Data Illustrated With a Plant Breeding Trial (1999), Chapman & Hall
- J. W. Tukey, Exploratory Data Analysis (1977), Addison-Wesley
- F. Dion, Stemgraphic a Stem-and-Leaf Plot for the Age of Big Data (2016)
- P. J. Rousseeuw, A. M. Leroy, Robust Regression & Outlier Detection (1987), Wiley
- W.J. Youdon, Experimentation and Measurement (1961), NIST special publication 672
- F. J. Anscombe, "Graphs in Statistical Analysis" (1973)
- G. Lemaitre, F. Nogueira, C. K. Aridas, "Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning" (2017), Journal of Machine Learning Research
- F. T. Liu, K. M. Ting, "Isolation Forest" (2008), ACM
- A. Cairo, The Functional Art an Introduction to Information Graphics and Visualization (2013), New Riders
- E. J. Candes, X. Li, Y. Ma, J. Wright, "Robust Principal Component Analysis?" (2009), Journal of ACM 58, 1, p.1-37
- M. Lima, Visual Complexity Mapping Patterns of Information (2011), Princeton Architectural Press
- N. N. Taleb, Fooled by Randomness (2001), Random House
- V. Barnett, T. Lewis, Outliers in Statistical Data (1978), Wiley
- C. Aggarwal, Outlier Analysis (2013), Springer
- D. Salzberg, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (2001), Henry Holt and Co
- R. D. Cook, S. Weisberg, Residuals and Influence in Regression (1982), Chapman and Hall
- M. Kantardzic, Data Mining Concepts, Models, Methods and Algorithms (2003), IEEE Press Wiley Interscience
And of course the various other publications I have listed in my "ex-libris" series on LinkedIn
(part I,
part II,
part III and
part IV - part V on
visualization and VI on communication are not yet available)