Machine learning is when a computer has been taught to recognize patterns by providing it with data and an algorithm to help understand that data.
Machine Learning and Data Science is becoming more prevalent in healthcare.
- Classification
- The purpose of the Classification model is to determine a label or category – it is either one thing or another. We train the model using a set of labelled data.
- Regression
- A Regression model is created when we want to find out a number – for example how many days before a patient discharged from hospital with a chronic condition such as diabetes will return.
- Clustering
- We would create a Clustering model if we had a whole bunch of data but we didn’t have a determined outcome, we just wanted to see if there were any distinctive patterns.
GOAL: to create an algorithm that draws a line between the two labeled groups, called a decision boundary.
Note: the ‘x’s on the graph above represent a data point, here a two-dimensional point (x,y) where x refers to the item’s color and y the item’s size.
In general, data points will have some n amount of features (here, n = 2), and will thus lie on some *n-*dimensional space.
When we are classifying data, we want to find some boundary that divides our space in two regions: one where, say, all data points that are ‘apples’ live, and the other region is where all data points that are ‘oranges’ live. For now, we will only consider the case where we have two classes two separate, but we can extend this to any amount of classes!
In general, our model needs to have 2 things:
- a predictor function:
$f$ , that maps input data to a predicted label - a loss function:
$L$ , that maps a predicted label to a loss value
Once we have these two things defined for us, we can optimize our predictor function so that it predicts a label as well as possible. In other words, we want to minimize our loss.
In order to find the minimum of our loss function, we’ll use an algorithm called gradient descent.
Imagine we our problem is something more realistic than classifying fruits as apples and oranges. Let’s say we want to predict if Arjun will wear a Patagonia for based on today’s forecast.
This could be as easy as just finding a threshold temperature and claiming if the input temperature is below a certain threshold (say, 70˚F), we will output 1 (yes). If the temperature is above the threshold, we will output a 0 (no).
However, life is not as black and white as the above model would suggest.
In reality, Arjun won’t immediately put on a Patagonia the moment it drops below some predefined temperature. It’s more as if at any temperature he has a certain “chance” of putting on a Patagonia. Maybe at 45 F he would have a 95% chance of putting on a Patagonia, and at 60 F he would have a 30% chance of putting on a Patagonia.
To better model this, we use logistic regression to find these probabilities. This involves fitting a logistic curve (like the one below) to our data. To do this, we again use gradient descent to choose the best parameters for the model.
The general form of the logistic model is:
which represents the probability that
(we will get to the reasoning another day).
So then we can take the gradient of the loss function, and perform gradient descent to find the best set of parameters to help us understand when Arjun will wear his Patagonia.
(I realize we haven’t actually gone over the gradient descent algorithm at all, we’ll save that for next week don’t worry)
Let’s go with the simplest model that requires NO training!!!
The algorithm: Preprocessing: Split data up into 2 main groups:
- training set
- test set
Training time: do nothing. Test time: you are given an unlabeled point (from the test set). To predict its label, look at the labels of the k nearest training points (in the training set) and make an estimate.
- Categorical data: majority vote on the class.
- Numerical data: take an average. Perhaps weighted by distance.
What does is mean to be “near”? The K-Nearest Neighbors algorithm generally uses euclidean distance to quantify how near or far two points in the data are from each other.
Let
What is K?? K is a hyperparameter. You choose it. Here's a nice visualization of the effect of K on classification.
- If K = 1, classify point based on only 1 nearest. Moving a little can cause you to ip classes (high variance). But each training point classified correctly (low bias).
- If K = N (# training points), every point gets the same classification as it looks at entire training set. Low variance. But you will surely get many things wrong (high bias).
So… how do we determine the best value for K?
Solution: try a bunch of them…
We are going to tackle the task of predicting whether or not someone has diabetes.
We will be using the diabetes data set which originated from UCI Machine Learning Repository.
Go to https://github.com/medtech-berkeley/ML-for-Healthcare and clone the repository.
Make sure you have Python and Jupyter installed on your computer.
python3 -m pip install --upgrade pip
python3 -m pip install jupyter
Once you have cloned the repository, navigate to it inside your terminal and run:
pip3 install -r requirements.txt
Now you should be good to go!!
(if this part is proving to be difficult then you can download Python through Anaconda and that should take care of everything)