Data Science Experiment 2 - Data Preprocessing and Cleaning

This repository contains the code and results for the second experiment in the Data Science Fundamentals with Python course. The objective of this experiment is to preprocess and clean a dataset obtained from the UCI Machine Learning Repository.

Experiment Overview

Steps:

Loading the Dataset: The dataset is loaded using the Pandas library.
Inspecting the Dataset: Key statistical measures, such as the mean of 'age' and 'marks', are calculated.
Cleaning Column Names: Leading or trailing spaces in column names are removed.
Standardizing Column Data: The 'name' column is standardized by capitalizing the first letter of each name.
Handling Duplicate Rows: Duplicate rows are identified and removed.
Handling Missing Values: Missing values are handled through imputation using the mean, median, or mode, as appropriate.
Forward and Backward Fill: Missing values are also handled using forward fill and backward fill techniques.
Saving the Cleaned Dataset: The cleaned dataset is saved to a CSV file.

Concepts Used

Pandas Library: Used for data manipulation and analysis.
Data Cleaning: Techniques such as removing duplicates, handling missing values, and standardizing data.
Imputation: Filling missing data using statistical measures like mean, median, or mode.
Forward Fill and Backward Fill: Techniques to fill missing values by propagating previous or subsequent values.
DataFrame Operations: Methods to inspect and clean the dataset.

Steps to Reproduce

Set up Google Colab:
- Open Google Colab.
- Create a new notebook.
Import Necessary Libraries:
- Start by importing the required libraries.
```
import pandas as pd
import matplotlib.pyplot as plt
```

Load the Dataset:

Use the path of the dataset.

dataset1 = '/content/drive/MyDrive/Ds Data Sets/student3.csv'
df = pd.read_csv(dataset1)

Inspect the Dataset:
- Display the first few rows of the dataset and calculate key statistics.
```
df
display(df['age'].mean())
display(df['marks '].mean())
```
Clean Column Names:
- Remove leading or trailing spaces from column names.
```
df.columns = df.columns.str.strip()
df.columns
```
Standardize Column Data:
- Capitalize the first letter of each name in the 'name' column.
```
df['name'] = df['name'].str.title()
df['name']
```

Handle Duplicate Rows:

Identify and remove duplicate rows.

df.duplicated()
df.drop_duplicates(inplace=True)
df.duplicated()

Check and Impute Missing Values:

Handle missing values by filling them with appropriate statistical measures.

df.count()
df.isnull().sum()
df['marks'].fillna(df['marks'].mean(), inplace=True)
df['age'].fillna(df['age'].median(), inplace=True)
df['class'].fillna(df['class'].mode()[0], inplace=True)

Forward Fill and Backward Fill Missing Values:
- Apply forward fill and backward fill techniques to handle missing values.
```
df_ffill = df.ffill()
df_bfill = df.bfill()
df_ffill.head()
df_bfill.head()
```

Save the Cleaned Dataset:

Save the cleaned dataset to a CSV file.

cleaned_dataset = df_ffill
cleaned_dataset.to_csv('cleaned_dataset.csv', index=True)
print("Cleaned Dataset saved as 'cleaned_dataset.csv'")

How to Use

Clone this repository.
Run the notebook Experiment_2_AIDS_SUJAL_SOKANDE.ipynb in Google Colab or locally.

Dataset

The dataset used in this experiment is sourced from the UCI Machine Learning Repository.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Experiment_2_AIDS_SUJAL_SOKANDE_23(DS).ipynb		Experiment_2_AIDS_SUJAL_SOKANDE_23(DS).ipynb
README.md		README.md
cleaned_dataset.csv		cleaned_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Experiment 2 - Data Preprocessing and Cleaning

Experiment Overview

Steps:

Concepts Used

Steps to Reproduce

How to Use

Dataset

License

About

Releases

Packages

Languages

SokandeSujal/Data-Science-Experiment-2

Folders and files

Latest commit

History

Repository files navigation

Data Science Experiment 2 - Data Preprocessing and Cleaning

Experiment Overview

Steps:

Concepts Used

Steps to Reproduce

How to Use

Dataset

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages