Goal: Data analysis using google collaboratory python pandas on university ranking dataset
Before doing data analysis, we have to do data processing. It is always a good practice to verify your data preprocessing before you start your analysis.
The dataset of "Top 2000 Universities of the World" has been provided in the link below.
https://drive.google.com/file/d/1C0jgmQlO5xps06WOalrnrbmZ_l2tclgU/view?usp=sharing
https://colab.research.google.com/drive/1S-Z0yBo5T4rZ6CZAbP7mbag49dducyY_?usp=sharing
import pandas as pd
df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Top 2000 Universities of the World.csv")
df.head()
After having looked at the dataset, it is expected that null or NA values are verified within the dataset, and the corresponding rows are to be dropped as needed to prepare the data for analysis.
Count of Null / NA values:
df.isnull().sum()
After verifying the data for missing values and formulating our research questions. We always observe our dataset, describe it and check the dimensions of the dataset.
Data description provides us a quick view of the data columns and the recorded values in our dataset. We need to make a quick view of our dataset to see how many rows, columns we have in our dataset, and how the data is distributed.
df.shape
Number of Rows: 2000 Number of Columns: 9
df.info
Number of Categorical Columns: 2 Number of Numerical Columns: 7
df.describe()
As a data analyst, after having the initial description of the dataset, we need to understand the columns in the dataset so that we can choose our data aggregation and data summary strategies accordingly for generating insights from our dataset.
We will do some analysis for 4 of the data columns
- Type (Numerical, Categorical) of each column,
- the description of each column, and
- values in each column (list out 3 sample values for categorical column, and the range of value for numerical values)
A student wants to get into the best university. This dataset provides us information about the Universities and their world ranking of different areas (e.g. Employment, Research etc.).
Let’s he/she you want to find a university that has good alumni employment opportunities. From the dataset given,
- list out the top 10 universities based on employment ability
- list out the countries where these universities come from
using data aggregation strategies such as sorting, filtering etc.
df_work = df.filter(["World Rank","Institution", "Country", "Alumni Employment Rank"])
sorted_df = df_work.sort_values(by=['Alumni Employment Rank'] , ascending= True )
sorted_df.head(10)
sorted_df["Country"].head(10).nunique()
7
sorted_df["Country"].head(10).unique()
'China', 'USA', 'Greece', 'Egypt', 'Hungary', 'Bulgaria', 'Spain'
While performing Data Analysis, we might not need to consider all the rows and columns. For example, if we want to have a high level overview of the universities in the USA based on their world rank and overall scores, we can make a subset of data having World Rank, Score and Institution only. We can also further filter out universities that are not in the USA.
To achieve that, below steps have taken
- Create a subset data frame using filters on above mentioned columns
- And from this subset, filter out the Universities which belong to the USA (the output data frame should only contain universities from the USA).
- Create a filtered variable.
df_score = df.filter(["World Rank","Institution", "Country", "Score"])
df_score.head()
- Sort the filtered variable.
sorted_df_score = df_score.sort_values(by=['Score'] , ascending= False )
Display the result in the country USA
sorted_df_score[(sorted_df["Country"] == "USA")].head(5)
Next, we would like to analyse which countries are having more high ranking and high quality universities and which countries have less.
In order to generate these insights, we are required to make a summarised data frame with all the countries and their average score of the universities. (Hint: consider using the “groupby” function)
Create a group by table for country then include the mean function
grouped_df = df.groupby('Country')
grouped_df.mean()
By using the filter function in google collaboratory, we can find and filter for each individual country The average score of
- Ireland:72.02
- United Kingdom:73.63
- Pakistan: 68.38
- Germany: 74.47
Two lowest performing countries:
- North Macedonia
- Kazakhstan
We have studied the Ranges, Quartiles and Interquartile range in detail. We can use it to help us in finding the outliers in the university dataset (i.e. those extremely good or bad universities).
In order to perform this analysis, we will use the 1.5IQR method
Score
df["Score"].min()
df["Score"].max()
65.7, 100.0
q1, q3 = df["Score"].quantile([0.25,0.75])
q1,q3
(67.7, 74.1),
iqr = q3- q1
iqr
6.40
lower_min = q1 - (1.5*iqr)
upper_max = q3 + (1.5*iqr)
print("Lower expected min of IQR = ", lower_min)
print("Upper expected max of IQR = ", upper_max)
Lower expected min of IQR = 58.100000000000016 Upper expected max of IQR = 83.69999999999999
df[(df["Score"] > 83.69999999999999) | (df["Score"] < 58.100000000000016 )]
63 outliers found
Taking help from the Interquartile Range, we have identified 63 Institutions scored higher than the maximum of IQR or the Upper bound threshold of data distribution. 63 universities score higher than the max interquartile range.