Skip to content
Hardik Sharma edited this page May 20, 2024 · 3 revisions

Welcome to the Stackoverflow-Analysis wiki!

About the project: Stack overflow is a professional community for developers. They conduct developer surveys every year since 2011, and the collected data is available open-source on the web. The latest dataset 2020 was released on March 5th, 2021. With proper analysis, the Dataset would help us to answer real-world questions. For instance, we can find the most popular language that the developers use.We also can find the developer role which pays the highest salary. Our project is to analyze the last three years of the developer survey and gather meaningful insights from it.

As a first step, we will clean the data by removing null values and outliers in each column. Then, refactor the columns in such a way that help us in analysis. Then we performed data analysis and machine learning on the cleaned dataset. We used machine learning to understand the growth of languages and the salary for data scientists in the upcoming years.

The project is licensed under the MIT License.

Project Goals:

  1. Perform Analysis on the last 3 years' Stack Overflow Dataset to extract insights.
  2. Analyze the impact of higher education, experience, and responsibilities on salary and gender inequalities.
  3. Investigate participation rates based on ethnicity and differences in income between men and women.
  4. Explore the popularity of programming languages and predict their growth based on survey responses.

Data Source and Background

The dataset comes from the annual Stack Overflow developer survey, covering responses from developers in 180 countries. The data are available in CSV format, ranging from 40 to 150 MB, with responses from 1.5 Lakh survey participants.

Data Format

The data is in a CSV file format with 252,199 observations and 62 variables.

Clone this wiki locally