The main goal of this project is to provide a useful tool for keeping track of events related to live chat on Twitch using Sentiment Analysis.
Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
Source: Wikipedia
Live Twitch chats, especially if there are a lot of spectators, are really difficult to follow and moderate. Moderators are people that help the Streamer relies to prevent their chat from becoming a jungle of frustrated monkeys.
This tool aims to help moderators and streamers keeping track of the interactions between the streamer and its audience, making use of Sentiment Analysis.
-
Ingestion: Kafka Connect with custom connector and PircBotX
-
Streaming: Apache Kafka
-
Processing: Spark Streaming, Spark SQL, PySpark(2.4.6)
-
Machine Learning/Sentiment Analysis: Vader Sentiment Analysis
-
Indexing: ElasticSearch
-
Visualization: Kibana
Containerization: Docker
The project workflow follows the structure above.
- The bot created by the PircBotX interface receives the messages sent in a chat selected by the user through the IRC protocol (Internet Relay Chat).
- A JSON is built with data and metadata provided by the bot. It is inserted into a Message Queue from which the connector structure picks them up and inserts them into the Kafka-topic: twitch.
- From there they are taken via a python script from the Spark Steaming interface.
- Spark SQL reconstructs the JSON, it consumes the message, making a Dataframe available. Spark SQL also communicates with the Vader Sentiment Analysis library which provides a result on the analysis of the message.
- "sentiment" field is added with the result obtained by Vader's elaboration, reported among one of the following classes: very_positive, positive_opinion, neutral_opinion, negative_opinion, very_negative, ironic.
- The newly built RDD is indexed through the product of the elastic family, Elasticsearch.
- Kibana (another elastic tool) deals with aggregating and placing metrics making data available through a user interface.
There is doc file similar to this in each folder to get information for each specific component.
In the /bin folder there are shell scripts that allow you to start the following project.
N.B. This project uses Docker as a containerization tool. Make sure you have it installed. Look online to understand how to install it in your system.
N.B. when files are downloaded to Linux machines, many versions remove execution permission for security reason, to add it to all sh files in this project folder, run:
$ cd path_to_cloned_repo
$ find ./ -type f -iname "*.sh" -exec chmod +x {} \;
Fist time running: In the Kafka/Kafka-Settings folder, rename chat-channel.properties.dist to chat-channel.properties and set all parameters required by Twitch connection, instructions in the same file. Once set up, continue.
N.B. You need to download and insert the tgz file in the Kafka/Kafka-Settings folder, you can download it from here.
N.B. You need to download and insert the tgz file in the Spark/Python folder, you can download it from here.
Other running:
use bin/set-observed-channel.sh CHANNELNAME
to change observed Twitch channel.
In the bin folder, start the following script:
$ bin/docker-compose.sh
In the bin folder, start in order the following scripts, in different bash:
$ bin/create-network.sh
"~~~ Wait until end logging ~~~"
$ bin/zookeeper-start.sh
"~~~ Wait until end logging ~~~"
$ bin/kafka-start.sh
"~~~ Wait until end logging ~~~"
$ bin/elasticsearch-start.sh
"~~~ Wait until end logging ~~~"
$ bin/kibana-start.sh
"~~~ Wait until end logging ~~~"
$ bin/spark-consumer-start.sh
"~~~ Wait until end logging ~~~"
It will start individual components such as Zookeeper, Kafka, Elastisearch, Kibana. When Spark starts, follow the instructions on the screen and choose Python.
To stop all running container, just ctrl + C in their own shell.
In the browser, enter the following address: http://10.0.100.52:5601 . To set up Kibana, see its guide in the Kibana folder.
Docker volume is used to keep data when container is deleted or pruned. If you do not want to use it, just delete it from docker-compose.yml
. If you are running long solution, delete -v parameter in the bin/elasticsearch-start.sh
file. Then you can run bin/drop-elasticsearch-volume.sh
to delete the volume permanently.