Simple Data Workflow

A simple end-to-end data workflow (preprocessing, modelling, visualization) orchestrated using Prefect tasks and flows.

🚀 Quickstart

The easiest way to get started is to clone the repo:

git clone git@github.com:topher-lo/simple-data-workflow.git

Then install its dependencies using pip:

pip install -r requirements.txt

✨ Quick Example

`instance` `e2e_pipeline`

An end-to-end data data workflow that:

Downloads data from an URL;
Cleans data;
Runs linear regression;
Plots regression results as a box-and-whisker chart.

from src.flow import e2e_pipeline

# Flow parameters
kwargs = {
    'url': 'https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/fakeTSD.csv',
    'cat_cols': ['year'],  # List of categorical variables in dataset
    'na_strategy': 'mice',  # Method to deal with missing values
    'transf_cols': ['x1', 'x2'],  # Variables to apply transformation on
    'transf_func': 'arcsinh',  # Transformation function
    'endog': 'y',  # Endogenous (outcome) variable
    'exog': ['x1', 'x2']  # Exogenous (feature) variables
}

# Execute flow
state = e2e_pipeline.run(**kwargs)

# Check if flow run was successful
if state.is_successful():
    
    # Get task's reference ID from its name
    task_name = 'plot_confidence_intervals'
    task_ref = e2e_pipeline.get_tasks(name=task_name)[0]
    
    # Get altair chart
    conf_int_chart = state.result[task_ref].result

🎛 Tasks API

These are individual data tasks that make up each part (i.e. preprocessing, modelling, post-processing) of the end-to-end data flow.

Preprocessing:

`function` `sanitize_col_names`

Sanitizes strings in list by: 1. stripping all white-spaces at start and end; 2. replaces any excess whitespace with an underscore; and 3. lower-cases all characters.

Parameters:

cols (List[str]): List of string (e.g. column names) to sanitize.

Returns: Sanitized list of strings (e.g. column names).

`function` `retrieve_data`

Reads data (from url string) into a DataFrame.

Parameters:

url (str): URL to data. Data is a delimiter-separated text file.
sep (str): Delimiter to use.
nrows (int): Number of rows of the file to read.

Returns: The delimiter-separated text file as a Pandas DataFrame.

`function` `_column_wrangler`

Returns DataFrame with columns transformed into a consistent format (see sanitize_col_names).

Parameters:

data (pd.DataFrame): The data.

Returns: DataFrame with sanitized column names.

`function` `_obj_wrangler`

Converts columns with object dtype into StringDtype.

Parameters

data (pd.DataFrame): The data.

Returns: A copy of the inputted Pandas DataFrame with any object dtype columns cast as StringDtype.

`function` `_factor_wrangler`

Converts columns in is_cat into CategoricalDtype.

Parameters

data (pd.DataFrame): The data.
cat_cols (list of str): List of columns to convert to CategoricalDtype.
ordered_cols (list of str): List of categorical columns to declare to have an ordered relationship between its categories.
categories (dict of [str, int, float]): Dictionary with column names as keys and list of str, int, or float as values.
str_to_cat (bool): If True, converts all StringDtype columns to CategoricalDtype.
dummy_to_bool (bool): If True, converts all columns with integer [0, 1] values or float [0.0, 1.0] values into BooleanDtype.

Returns: A copy of the inputted Pandas DataFrame. Converts specified columns to CategoricalDtype, both ordered and unordered, and sets specified categorical columns' categories. All other columns' dtypes are unchanged.

`function` `_check_model_assumptions`

Empty function to be implemented.

`function` `clean_data`

Data preprocessing pipeline. Runs the following data wranglers on data:

convert_dtypes
_replace_na
_column_wrangler
_obj_wrangler
_factor_wrangler
_check_model_assumptions.

Parameters:

data (pd.dataFrame): The data.
na_values (list of str, int, or float): List of values to replace with NA.
kwargs: keyword arguments in _factor_wrangler.

Returns: The preprocessed data.

`function` `encode_data`

Transforms columns with unordered CategoricalDtype into dummy columns. Dummy columns are cast as BooleanDtype columns. Transforms columns with ordered CategoricalDtypeinto their category integer codes.

Parameters:

data (pd.dataFrame): The data.

Returns: The encoded data.

`function` `wrangle_na`

Wrangles missing values. 5 available strategies: complete case ("cc"), fill-in ("fi"), fill-in with indicators ("fii"), grand model ("gm"), and MICE ("mice").

Parameters:

data (pd.dataFrame): The data.
strategy (str): Strategy to deal with missing values.
cols (list of str): columns to wrangle.

Returns: The data with missing data wrangled according to the specified strategy.

`function` `transform_data`

Applies either log or arcsine transformations on data.

Parameters:

data (pd.dataFrame): The data.
cols (list of str): Columns to transform.
func (str): log transform ("log") or inverse hyperbolic sine transform ("arcsinh").

Returns: The data with transformation applied to specified columns.

`function` `gelman_standardize_data`

Standardizes data by dividing by 2 standard deviations and mean-centering them.

Parameters:

data (pd.dataFrame): The data.

Returns: The standardized data.

Modelling:

`function` `run_model`

statsmodels linear regression implementation.

Parameters:

data (pd.dataFrame): The data.
y (str): Endogenous (outcome) variable.
X (list of str): Exogenous (feature) variables.

Returns: The data with missing data wrangled according to the specified strategy.

Post-processing:

`function` `plot_confidence_intervals`

Given a fitted OLS model in statsmodels, returns a box and whisker regression coefficient plot.

Parameters:

res (RegressionResultsWrapper): regression results from statsmodels OLS.

Returns: A matplotlib axes containing a box and whisker Altair plot of regression coefficients' point estimates and confidence intervals.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simple Data Workflow

🚀 Quickstart

✨ Quick Example

`instance` `e2e_pipeline`

🎛 Tasks API

Preprocessing:

`function` `sanitize_col_names`

`function` `retrieve_data`

`function` `_column_wrangler`

`function` `_obj_wrangler`

`function` `_factor_wrangler`

`function` `_check_model_assumptions`

`function` `clean_data`

`function` `encode_data`

`function` `wrangle_na`

`function` `transform_data`

`function` `gelman_standardize_data`

Modelling:

`function` `run_model`

Post-processing:

`function` `plot_confidence_intervals`

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

topher-lo/simple-data-workflow

Folders and files

Latest commit

History

Repository files navigation

Simple Data Workflow

🚀 Quickstart

✨ Quick Example

instance e2e_pipeline

🎛 Tasks API

Preprocessing:

function sanitize_col_names

function retrieve_data

function _column_wrangler

function _obj_wrangler

function _factor_wrangler

function _check_model_assumptions

function clean_data

function encode_data

function wrangle_na

function transform_data

function gelman_standardize_data

Modelling:

function run_model

Post-processing:

function plot_confidence_intervals

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`instance` `e2e_pipeline`

`function` `sanitize_col_names`

`function` `retrieve_data`

`function` `_column_wrangler`

`function` `_obj_wrangler`

`function` `_factor_wrangler`

`function` `_check_model_assumptions`

`function` `clean_data`

`function` `encode_data`

`function` `wrangle_na`

`function` `transform_data`

`function` `gelman_standardize_data`

`function` `run_model`

`function` `plot_confidence_intervals`

Packages