Skip to content

🔧 A useful preprocessing pipeline that handles heterogeneous data such as binary, categorical, and numerical features

License

Notifications You must be signed in to change notification settings

datudar/data-prep-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Preprocessing Pipeline

This is a preprocessing pipeline for handling heterogeneous data such as binary, categorical, and numerical data. The steps in this particular pipeline are purely for demonstration purposes, so it is highly recommended you modify the pipeline to suit the needs of your analysis.

Data

The example input file contains ten made-up samples of one target column, y, and eight feature columns, X, which are of various data categories.

Target (y)

  • The target column has two classes: the positive class and the negative class, which are labeled 1 and 0, respectively

Features (X)

  • Binary (features 1 and 2)
    • These are features of ones and zeros and we keep their values as they are
  • Categorical
    • Numerical (features 3 and 4): These are features that have at least three numerical classes and have no order
    • Textual (features 5 and 6): These are features that have at least three textual classes
    • We transform these features into dummy variables of ones and zeros
    • Due to multi-collinearity concerns, we also drop one of the dummy variables so that we are left with n-1 dummy variables
  • Numerical (features 7 and 8)
    • These features are typically integers or floats
    • We apply normalization on these values

How It Works

The pipeline reads in the input file and performs a few basic preprocessing steps on the features. First, the data is "upsampled" as it is intentionally imbalanced (i.e., there are only two examples of the positive class). Then, the data is fed through a pipeline which performs:

  1. imputation of missing values
  2. feature engineering by creating dummy variables, adding polynomial features, and adding interaction features
  3. transformation using normalization scaling
  4. feature selection by placing a minimum variability requirement

Finally, the pipeline outputs thirteen samples with sixteen features. The output, y and X, are now ready for further analysis.

License

This project is licensed under the MIT License.

About

🔧 A useful preprocessing pipeline that handles heterogeneous data such as binary, categorical, and numerical features

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages