MachineLearningSamples

This repo hosts variety of examples based on Apache Spark MLIB.

Databricks Notebooks

Decision Tree

Census Income Decision Tree

Census Income Random Decision Forest

Scala IDE Based Examples

Decision Tree

A vanilla decision tree example.

Decision Tree with Stratified Sampling

How to get a stratified sample so the test and train datasets are sampled accross possible values.

Decision Tree with Categorical Feature in the DataSet

How to index and encode categorical features.

Predicting Income Based on Census Data Using Decision Tree

How to handle multiple categorical and continuous features on a real-life data set. Uses the Census Income data set.

Predicting Income Based on Census Data Using Random Decision Forest

How to handle multiple categorical and continuous features on a real-life data set. Uses the Census Income data set.

Predicting Income Based on Census Data Using Random Decision Forest With Surrogate Decision Tree

How to handle multiple categorical and continuous features on a real-life data set. Uses the Census Income data set.

Data Sets References

Census Income DataSet

First line from adult.test file removed for loading into Spark.

Census Income data set citation: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Default of Credit Card Clients DataSet

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
databricks		databricks
jupyterScala		jupyterScala
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MachineLearningSamples

Databricks Notebooks

Decision Tree

Census Income Decision Tree

Census Income Random Decision Forest

Scala IDE Based Examples

Decision Tree

Decision Tree with Stratified Sampling

Decision Tree with Categorical Feature in the DataSet

Predicting Income Based on Census Data Using Decision Tree

Predicting Income Based on Census Data Using Random Decision Forest

Predicting Income Based on Census Data Using Random Decision Forest With Surrogate Decision Tree

Data Sets References

Census Income DataSet

Default of Credit Card Clients DataSet

About

Releases

Packages

Contributors 2

Languages

aosama/MachineLearningSamples

Folders and files

Latest commit

History

Repository files navigation

MachineLearningSamples

Databricks Notebooks

Scala IDE Based Examples

Data Sets References

About

Topics

Resources

Stars

Watchers

Forks

Languages