Skip to content

Built an unsupervised clustering model using K-Prototypes clustering and anomaly detection algorithms to discover patterns in the dataset containing 13000+ projects.

Notifications You must be signed in to change notification settings

knayyar0416/unsupervised-model-clustering

Repository files navigation

Exploring distinctive features for various projects using unsupverised clustering algorithm

In this project, I applied K-Prototypes clustering and anomaly detection to discover patterns in the dataset keeping in mind the potential benefits the stakeholders can get from the model.

πŸ’‘ Instead of using K-Means clustering which is best suited for contuinuous variables, I used K-Prototypes clustering to handle a mix of categorical and continuous variables present in my dataset.

🌐 About Kickstarter

Kickstarter is a platform where creators share their project visions with the communities that will come together to fund them.

πŸ† What did I discover?

By grouping the projects using clustering algorithm, the management could uncover distinct characteristics within each cluster. Below are the unique characteristics of 8 clusters.

βš–οΈ Moderate Goals, Quick Launchers

These projects showed a preference for swift project initiation, and the pledged amount tended to increase as the launch- to-deadline period extended.

πŸ“† High Goals, Recent Projects

Projects with high fundraising goals and recent deadlines. Interestingly, some projects in this cluster managed to attract a high number of backers, even with lofty fundraising goals.

πŸ‘₯ Backer-Friendly Projects

This cluster stood out for its projects' ability to attract a significant number of backers.

πŸ’‘ Modest Achievers

These projects struck a balance between goal and backers, with an average duration between project creation and launch.

πŸ‘ Well-Supported Initiatives

Projects in this cluster enjoyed high backers and pledged amounts. These projects maintained a relatively low goal and were picked by staff, potentially for the spotlight.

πŸš€ Rapid Projects

This cluster embodied a preference for rapid project development and execution.

🎯 Ambitious Newcomers

With a 24% success rate, Cluster 7 features projects with relatively higher fundraising goals compared to the amount pledged.

πŸ•°οΈ Long-Term High-Stakes

Cluster 8, with a 29% success rate, represents projects with the highest fundraising goals across clusters. Notably, these projects were created a long time back, suggesting a lower success rate over time. image

πŸ› οΈ How did I achieve this?

Below is the detailed process for model building.

  1. 🧹 Data preprocessing:
    • Removed non-informative columns including 𝑖𝑑, π‘›π‘Žπ‘šπ‘’, π‘›π‘Žπ‘šπ‘’_𝑙𝑒𝑛, π‘π‘™π‘’π‘Ÿπ‘_𝑙𝑒𝑛, and 𝑝𝑙𝑒𝑑𝑔𝑒𝑑.
    • Created a new feature, π‘”π‘œπ‘Žπ‘™_𝑒𝑠𝑑, by multiplying π‘”π‘œπ‘Žπ‘™ and π‘ π‘‘π‘Žπ‘‘π‘–π‘_𝑒𝑠𝑑_π‘Ÿπ‘Žπ‘‘π‘’.
    • Replaced non-US countries with 'Non-US' and filled missing values in π‘π‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘¦ with 'No Category'.
    • Dropped irrelevant columns like original date columns and hour-specific columns.
  2. πŸ” Anomaly detection:
    • Before running any clustering algorithm, I ran an Isolation Forest Model for Anomaly Detection to identify and remove anomalies. The model deteted 1,344 anomalies with unusually high π‘”π‘œπ‘Žπ‘™_𝑒𝑠𝑑, π‘π‘Žπ‘π‘˜π‘’π‘Ÿπ‘ _π‘π‘œπ‘’π‘›π‘‘ and 𝑒𝑠𝑑_𝑝𝑙𝑒𝑑𝑔𝑒𝑑.
  3. πŸ€– Clustering model:
    • Applied K-Prototypes clustering to accommodate both the numerical and categorical features.
    • By testing cost function for different values of K for K-Prototypes Clustering, I could observe K=8 and K=10 are the elbow points at which the cost drops drastically. I chose K=8 since it was giving me better business interpretations.

πŸŽ‰ Conclusion

In summary, the K-Prototypes clustering algorithm provided valuable insights into diverse project profiles, offering a comprehensive understanding of Kickstarter projects. Overall, successful projects have high number of backers, pledged amounts, and are staff-picked, ultimately securing a place on Kickstarter spotlight page.

❗️ Challenges faced

When it comes to computation of performance, it is hard to measure β€˜distance’ between categorical variables. Instead, it is more suitable to assess the "dissimilarity" between categorical variables. To address this, I attempted to use 'kprototypes.matching_dissim' and 'kprototypes.check_distance' to compute dissimilarity for categorical variables and distance for continuous ones. My goal was to obtain the silhouette score, but unfortunately, I encountered errors in the process.

πŸ”— Supporting files

About

Built an unsupervised clustering model using K-Prototypes clustering and anomaly detection algorithms to discover patterns in the dataset containing 13000+ projects.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages