Skip to content

Latest commit

 

History

History
40 lines (32 loc) · 1.81 KB

README.md

File metadata and controls

40 lines (32 loc) · 1.81 KB

Measuring toxicity in Dota 2

Introduction

Dota 2 is toxic af

Purpose

Hypothesis: early and late game in dota 2 are more toxic than mid game

To run the project please download and extract the data set https://www.kaggle.com/romovpa/gosuai-dota-2-game-chats in the folder data (if it doesn't exist create it mkdir data). Then create a folder for the models mkdir models. Finally run:

sh run.sh

Important

You'll also need a file called token.json in ./src which should contain your API key for the perspectiveAPI. To get information on how to obtain it go to https://github.com/conversationai/perspectiveapi/tree/master/1-get-started

Format of token.json:

{"token": "your_token_here"}

TODO

DONE

  • Add script to get only text in english.
  • Get some statistics about the dataset.
  • Define a time for early, mid and late game: Since the early, mid and late game depends on the pace on each game and all of them differ, the early, mid and late game times are define using the tertiles of the time distribution of the matches. It's assumed that on average this will reflect the appropiate times for each stage of the game.
  • Research about toxicity measuring models on academia and kaggle competitions.
  • Look for toxicity models from kaggle. Using those, decide whether to clean all data from non english commentaries. The answer is yes, I'm gonna remove all non english messages.
  • Remove languages that aren't handled by the toxicity model.
  • Research about topic modelling and use the best model for this case.
  • Create and train 3 models for the early, mid and late game.
  • Use the generated topics on the early, mid and late game and measure it's toxicity given the toxicity model above.
  • Explore the topics and the toxicity scores to add to the analysis.
  • Write the report
  • Upload project to github