This repository contains the code and results for the second experiment in the Data Science Fundamentals with Python course. The objective of this experiment is to preprocess and clean a dataset obtained from the UCI Machine Learning Repository.
- Loading the Dataset: The dataset is loaded using the Pandas library.
- Inspecting the Dataset: Key statistical measures, such as the mean of 'age' and 'marks', are calculated.
- Cleaning Column Names: Leading or trailing spaces in column names are removed.
- Standardizing Column Data: The 'name' column is standardized by capitalizing the first letter of each name.
- Handling Duplicate Rows: Duplicate rows are identified and removed.
- Handling Missing Values: Missing values are handled through imputation using the mean, median, or mode, as appropriate.
- Forward and Backward Fill: Missing values are also handled using forward fill and backward fill techniques.
- Saving the Cleaned Dataset: The cleaned dataset is saved to a CSV file.
- Pandas Library: Used for data manipulation and analysis.
- Data Cleaning: Techniques such as removing duplicates, handling missing values, and standardizing data.
- Imputation: Filling missing data using statistical measures like mean, median, or mode.
- Forward Fill and Backward Fill: Techniques to fill missing values by propagating previous or subsequent values.
- DataFrame Operations: Methods to inspect and clean the dataset.
-
Set up Google Colab:
- Open Google Colab.
- Create a new notebook.
-
Import Necessary Libraries:
- Start by importing the required libraries.
import pandas as pd import matplotlib.pyplot as plt
-
Load the Dataset:
- Use the path of the dataset.
dataset1 = '/content/drive/MyDrive/Ds Data Sets/student3.csv' df = pd.read_csv(dataset1)
-
Inspect the Dataset:
- Display the first few rows of the dataset and calculate key statistics.
df display(df['age'].mean()) display(df['marks '].mean())
-
Clean Column Names:
- Remove leading or trailing spaces from column names.
df.columns = df.columns.str.strip() df.columns
-
Standardize Column Data:
- Capitalize the first letter of each name in the 'name' column.
df['name'] = df['name'].str.title() df['name']
-
Handle Duplicate Rows:
- Identify and remove duplicate rows.
df.duplicated() df.drop_duplicates(inplace=True) df.duplicated()
-
Check and Impute Missing Values:
- Handle missing values by filling them with appropriate statistical measures.
df.count() df.isnull().sum() df['marks'].fillna(df['marks'].mean(), inplace=True) df['age'].fillna(df['age'].median(), inplace=True) df['class'].fillna(df['class'].mode()[0], inplace=True)
-
Forward Fill and Backward Fill Missing Values:
- Apply forward fill and backward fill techniques to handle missing values.
df_ffill = df.ffill() df_bfill = df.bfill() df_ffill.head() df_bfill.head()
-
Save the Cleaned Dataset:
- Save the cleaned dataset to a CSV file.
cleaned_dataset = df_ffill cleaned_dataset.to_csv('cleaned_dataset.csv', index=True) print("Cleaned Dataset saved as 'cleaned_dataset.csv'")
- Clone this repository.
- Run the notebook
Experiment_2_AIDS_SUJAL_SOKANDE.ipynb
in Google Colab or locally.
The dataset used in this experiment is sourced from the UCI Machine Learning Repository.
This project is licensed under the MIT License - see the LICENSE file for details.