ML/Data Science for Healthcare

What is Machine Learning?

Machine learning is when a computer has been taught to recognize patterns by providing it with data and an algorithm to help understand that data.

Why are we doing this?

Machine Learning and Data Science is becoming more prevalent in healthcare.

Types of Problems

Classification

The purpose of the Classification model is to determine a label or category – it is either one thing or another. We train the model using a set of labelled data.

Regression

A Regression model is created when we want to find out a number – for example how many days before a patient discharged from hospital with a chronic condition such as diabetes will return.

Clustering

We would create a Clustering model if we had a whole bunch of data but we didn’t have a determined outcome, we just wanted to see if there were any distinctive patterns.

Classification

GOAL: to create an algorithm that draws a line between the two labeled groups, called a decision boundary.

Note: the ‘x’s on the graph above represent a data point, here a two-dimensional point (x,y) where x refers to the item’s color and y the item’s size.

In general, data points will have some n amount of features (here, n = 2), and will thus lie on some *n-*dimensional space.

When we are classifying data, we want to find some boundary that divides our space in two regions: one where, say, all data points that are ‘apples’ live, and the other region is where all data points that are ‘oranges’ live. For now, we will only consider the case where we have two classes two separate, but we can extend this to any amount of classes!

How to actually do this…

In general, our model needs to have 2 things:

a predictor function: $f$, that maps input data to a predicted label
a loss function: $L$, that maps a predicted label to a loss value

Once we have these two things defined for us, we can optimize our predictor function so that it predicts a label as well as possible. In other words, we want to minimize our loss.

In order to find the minimum of our loss function, we’ll use an algorithm called gradient descent.

Methods for Classification

Imagine we our problem is something more realistic than classifying fruits as apples and oranges. Let’s say we want to predict if Arjun will wear a Patagonia for based on today’s forecast.

This could be as easy as just finding a threshold temperature and claiming if the input temperature is below a certain threshold (say, 70˚F), we will output 1 (yes). If the temperature is above the threshold, we will output a 0 (no).

Logistic Regression

However, life is not as black and white as the above model would suggest.

In reality, Arjun won’t immediately put on a Patagonia the moment it drops below some predefined temperature. It’s more as if at any temperature he has a certain “chance” of putting on a Patagonia. Maybe at 45 F he would have a 95% chance of putting on a Patagonia, and at 60 F he would have a 30% chance of putting on a Patagonia.

To better model this, we use logistic regression to find these probabilities. This involves fitting a logistic curve (like the one below) to our data. To do this, we again use gradient descent to choose the best parameters for the model.

The general form of the logistic model is:

$$\displaystyle\hat{y} = f(x) = \frac{1}{1 + e^{-w^Tx}} \in [0,1]$$

which represents the probability that $x$ results in a yes. Our loss function is Cross Entropy Loss:

$$\displaystyle L = -\sum_{i=1}^ny_i\log(\hat{y_i}) - (1 - y_i)\log(1 - \hat{y_i})$$

(we will get to the reasoning another day).

So then we can take the gradient of the loss function, and perform gradient descent to find the best set of parameters to help us understand when Arjun will wear his Patagonia.

(I realize we haven’t actually gone over the gradient descent algorithm at all, we’ll save that for next week don’t worry)

Since we haven’t really delved into the details of how to train a model…

Let’s go with the simplest model that requires NO training!!!

K-Nearest Neighbors Classifier

The algorithm: Preprocessing: Split data up into 2 main groups:

training set
test set

Training time: do nothing. Test time: you are given an unlabeled point (from the test set). To predict its label, look at the labels of the k nearest training points (in the training set) and make an estimate.

Categorical data: majority vote on the class.
Numerical data: take an average. Perhaps weighted by distance.

What does is mean to be “near”? The K-Nearest Neighbors algorithm generally uses euclidean distance to quantify how near or far two points in the data are from each other.

Let $x\in\mathbb{R}^n$ be one data point and $x'\in\mathbb{R}^n$ be another data point (note: these are both vectors). The euclidean distance between these two points would be:

$$\displaystyle d(x, x') = \sqrt{\sum_{i=1}^n(x_i - x'_i)^2}$$

What is K?? K is a hyperparameter. You choose it. Here's a nice visualization of the effect of K on classification.

If K = 1, classify point based on only 1 nearest. Moving a little can cause you to ip classes (high variance). But each training point classified correctly (low bias).
If K = N (# training points), every point gets the same classification as it looks at entire training set. Low variance. But you will surely get many things wrong (high bias).

So… how do we determine the best value for K?

Solution: try a bunch of them…

Now to do it for ourselves!!

We are going to tackle the task of predicting whether or not someone has diabetes.

We will be using the diabetes data set which originated from UCI Machine Learning Repository.

Go to https://github.com/medtech-berkeley/ML-for-Healthcare and clone the repository.

Make sure you have Python and Jupyter installed on your computer.

python3 -m pip install --upgrade pip
python3 -m pip install jupyter

Once you have cloned the repository, navigate to it inside your terminal and run:

pip3 install -r requirements.txt

Now you should be good to go!!

(if this part is proving to be difficult then you can download Python through Anaconda and that should take care of everything)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
ML_for_Diabetes.ipynb		ML_for_Diabetes.ipynb
README.md		README.md
diabetes.csv		diabetes.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML/Data Science for Healthcare

What is Machine Learning?

Why are we doing this?

Types of Problems

Classification

How to actually do this…

Methods for Classification

Logistic Regression

Since we haven’t really delved into the details of how to train a model…

K-Nearest Neighbors Classifier

Now to do it for ourselves!!

About

Releases

Packages

Languages

medtech-berkeley/ML-for-Healthcare

Folders and files

Latest commit

History

Repository files navigation

ML/Data Science for Healthcare

What is Machine Learning?

Why are we doing this?

Types of Problems

Classification

How to actually do this…

Methods for Classification

Logistic Regression

Since we haven’t really delved into the details of how to train a model…

K-Nearest Neighbors Classifier

Now to do it for ourselves!!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages