Job-Salaries-Estimator-for-Different-Data-Science-Positions-

Data Science positions are unique across the country so we can try and predict the salary of data science positions based on Job Title, Company, and Geography, etc. This project estimates salaries for different data science positions, so if anyone is trying to negotiate then this is a pretty cool tool for them to use.

Web Scraping

Scraped the job descriptions from glassdoor.com using python and selenium. With each job, we obatin the following:

Job Title
Salary Estimate
Job Description
Rating
Company Name
Location
Headquarters
Size
Founded
Type of Ownernship
Industry
Sector
Revenue
Competitors

Data Cleaning

After scraping the data I needed to clean it so that it can be usable for our model. I made the following modifications and created the following variables:

Parsed only the numeric data out of Salary
Made seperate columns for employer provided salary and hourly wages
Salary column contained few empty values so removed the rows without Salary
Parsed rating out of company text
Made a seperate column for the Company State
Made a new column to check if the job is at the company’s headquarters
Added a new column age of company by using the founded date
Added columns to check if the different skills were listed in the job description:
Made a new column for simplified job title and Seniority
Made a new column for description length

Exploratory Data Analysis

EDA plays a very important role at this stage as the summarization of clean data helps in identifying the structure, outliers, anomalies, and patterns in data. I looked at the distributions of the data and the value counts for the various categorical variables. Have done the univariate, bivariate analysis, and plotted histograms,boxplots,bar graphs,pivot tables etc. to represent the data.

Model Building

First, I modified all the categorical variables into dummy variables. Then I splited the data into training and test sets with a test size of 20%. I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is kind off easy to interpret and outliers aren’t particularly bad in for this type of model.

Multiple Linear Regression: Base Model
Lasso Regression: As there are any 0s and 1s(because of the sparse data from the many categorical variables), I have chosen a normalized regression like Lasso and thought it would be effective.
Random Forest Regression – With the sparsity associated with the data, I thought that this would be a good fit for our data.

Model Performance

The Random Forest model perfored better than the other models on the test set.

Linear Regression MAE: 18.885
Lasso Regression MAE: 19.665
Random Forest Regression MAE: 11.142

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
EDA_Job.ipynb		EDA_Job.ipynb
Glassdoor Salaries.csv		Glassdoor Salaries.csv
Glassdoor_scraper.py		Glassdoor_scraper.py
README.md		README.md
chromedriver		chromedriver
data_cleaning.ipynb		data_cleaning.ipynb
data_collection.py		data_collection.py
eda_data.csv		eda_data.csv
glassdoor_jobs.csv		glassdoor_jobs.csv
model.py		model.py
model1.pkl		model1.pkl
model2.pkl		model2.pkl
modelbuilding.py		modelbuilding.py
salary_data_cleaned.csv		salary_data_cleaned.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job-Salaries-Estimator-for-Different-Data-Science-Positions-

Web Scraping

Data Cleaning

Exploratory Data Analysis

Model Building

Model Performance

About

Releases

Packages

Languages

guptapriya-83900/Job-Salaries-Estimator-for-Different-Data-Science-Positions

Folders and files

Latest commit

History

Repository files navigation

Job-Salaries-Estimator-for-Different-Data-Science-Positions-

Web Scraping

Data Cleaning

Exploratory Data Analysis

Model Building

Model Performance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages