GitHub - datudar/data-prep-pipeline: :wrench: A useful preprocessing pipeline that handles heterogeneous data such as binary, categorical, and numerical features

Data Preprocessing Pipeline

This is a preprocessing pipeline for handling heterogeneous data such as binary, categorical, and numerical data. The steps in this particular pipeline are purely for demonstration purposes, so it is highly recommended you modify the pipeline to suit the needs of your analysis.

Data

The example input file contains ten made-up samples of one target column, y, and eight feature columns, X, which are of various data categories.

Target (y)

The target column has two classes: the positive class and the negative class, which are labeled 1 and 0, respectively

Features (X)

Binary (features 1 and 2)
- These are features of ones and zeros and we keep their values as they are
Categorical
- Numerical (features 3 and 4): These are features that have at least three numerical classes and have no order
- Textual (features 5 and 6): These are features that have at least three textual classes
- We transform these features into dummy variables of ones and zeros
- Due to multi-collinearity concerns, we also drop one of the dummy variables so that we are left with n-1 dummy variables
Numerical (features 7 and 8)
- These features are typically integers or floats
- We apply normalization on these values

How It Works

The pipeline reads in the input file and performs a few basic preprocessing steps on the features. First, the data is "upsampled" as it is intentionally imbalanced (i.e., there are only two examples of the positive class). Then, the data is fed through a pipeline which performs:

imputation of missing values
feature engineering by creating dummy variables, adding polynomial features, and adding interaction features
transformation using normalization scaling
feature selection by placing a minimum variability requirement

Finally, the pipeline outputs thirteen samples with sixteen features. The output, y and X, are now ready for further analysis.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
input		input
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
preprocessing_pipeline.py		preprocessing_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Preprocessing Pipeline

Data

How It Works

License

About

Releases

Packages

Languages

License

datudar/data-prep-pipeline

Folders and files

Latest commit

History

Repository files navigation

Data Preprocessing Pipeline

Data

How It Works

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages