Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add Data analysis steps: data-cleaning, data-outlier-detection #30

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

memona008
Copy link

@memona008 memona008 commented Mar 19, 2024

This pull request introduces two new library steps aimed at enhancing data preprocessing and outlier detection capabilities within our project.

Step 1: Data Cleaning

Implemented a data cleaning step capable of handling various parameters:
remove_null: Removes null values from the dataset if enabled.
null_lookup_columns: Allows specifying columns for null value lookup, providing flexibility in data cleansing.
duplicate_lookup_columns: Facilitates specifying columns for duplicate value lookup, enhancing data integrity checks.
clear_formatting: Offers an option to clear formatting from the dataset for consistency.
output_file_name: Enables customization of the cleaned output file name and path.
remove_duplicate_rows: Incorporates functionality to eliminate duplicate rows for streamlined data processing.

Step 2: Outlier Detection

Developed an outlier detection step employing four methods:

Z-score:

Identifies outliers based on standard deviation from the mean.

IQR (Interquartile Range):

Detects outliers using the range between the first and third quartiles.

Isolation Forest:

Implements an ensemble method for detecting anomalies in data points.

Autoencoder:

Utilizes deep learning techniques to reconstruct input data, flagging outliers based on reconstruction error.


Additionally, the step generates visualizations including

  • Scatter plot
  • Box plot
  • Histogram
    to aid in outlier analysis and interpretation via visualizing the data

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant