Skip to content

This notebook contains PySpark code that manipulated and analyzed dataset for Stack Overflow.

Notifications You must be signed in to change notification settings

eshentong/pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Uncovering Stack Overflow's Business Problems

Collaborators: Dhruv Shah, Jessica Tong, Celeste Chen, Estelle Yang

Tech Stack: PySpark, GCP, DataProc

Founded in 2008, Stack Overflow is a cornerstone of the online developer community, providing a platform for knowledge sharing and problem-solving. Understanding user behavior and content trends is crucial for optimizing the platform and keeping users engaged. This project proposes a comprehensive analysis of Stack Overflow user data to uncover valuable insights for improving the platform's user experience and overall effectiveness.

We'll delve into user engagement patterns, identify content trends like popular programming languages, and analyze user expertise through badges and reputation scores. By employing techniques like frequency analysis, we'll uncover valuable insights to inform platform improvements. These include targeted support based on peak posting times, content creation focused on popular languages, and strategies to optimize user onboarding and retention. Ultimately, the project will deliver a report with actionable recommendations, data visualizations, and a public code repository for further exploration.

This dataset is from Google Cloud's BigQuery public data. It contains 16 data tables under Stack Overflow, including tables for badges, comments, users, votes, etc. Through analyzing its historical data, which ranges from 2008 to 2022, of the users, stackoverflow_posts, posts_questions, posts_answers, comments, badges and post history tables from Google Cloud's BigQuery public data, we hope to uncover the answer to those questions, and to provide valuable business insights to avoid potential threats and risks.

About

This notebook contains PySpark code that manipulated and analyzed dataset for Stack Overflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published