Sentiment analysis involves analyzing digital text to determine whether the emotional tone is positive, negative, or neutral. This analysis provides valuable insights for companies to enhance customer service and increase brand reputation.
The project aims to explore, train, and compare multiple machine learning models for sentiment analysis using textual data sourced from social media. Specifically, it investigates the effectiveness of different pretrained language models, comparing encoder-only and decoder-only models (known as LLM) variants, and evaluates their performance, data requirements and efficiency based on parameter count.
Beyond model training, the project includes comprehensive exploratory data analysis, which provides actionable intelligence for marketing strategies using insights such as popular hashtags or emerging trends. A general analysis and detailed hashtag analysis is performed, integrating various supervised and unsupervised techniques, including graph analysis and semantic clustering.
This project uses TweetEval dataset, which is a collection of English datasets for seven multi-class classification tasks, all utilizing Twitter data. The tasks include irony detection, hate speech detection, offensive content identification, stance detection, emoji prediction, emotion recognition and sentiment analysis.
Specifically, we focus on the sentiment analysis task, which aims to classify the sentiment of a tweet as negative, neutral or positive.
To download the data and load it into the project, the following code can be used:
from datasets import load_dataset
dataset = load_dataset("tweet_eval", "sentiment")
Additionally, the dataset can be manually downloaded from the HuggingFace datasets hub.
We apply different preprocessing steps to the data to prepare it for training and evaluation depending on the type of model.
For machine learning models, we apply the following preprocessing steps:
- Cleaning and normalization (removing non-alphabetic characters, unicode normalization, lowercasing, etc.).
- Tokenization (word tokenization).
- Stopwords removal.
- Stemming (Porter stemming).
- Vectorization (TF-IDF, word embeddings).
For encoder-only language models, we apply the following preprocessing steps:
- Tokenization.
- Padding.
- Truncation.
- Token to ID conversion.
- Special tokens addition.
- Attention mask creation.
For decoder-only language models (LLM), we apply the following preprocessing steps:
- Prompt creation.
- Stratified sampling to create subsets of the data.
- Mapping of the data to the prompt (including system prompts).
- Special tokens addition.
- Tokenization.
We experiment with multiple types of models to perform sentiment analysis on the TweetEval dataset. The models are divided into two main categories: machine learning models and transformer models.
The explored machine learning models are:
- Logistic Regression
- Random Forest
Additionally, we experiment with VADER.
We experiment with two types of transformer models, encoder-only models and decoder-only models.
The selected encoder-only models are:
Additionally, we run the evaluation process for the following reference encoder-only models:
The selected decoder-only models are:
Additionally, we run the evaluation process using few-shot learning for the following reference decoder-only models:
-
Clone this repository to your local machine. To clone it from the command line run
git clone https://github.com/acampillos/social-media-nlp
. -
(Optional but recommended) Create a virtual environment. To create a virtual environment, run
python3.10.9 -m venv venv
in the root directory of the project. To activate the virtual environment, runsource venv/bin/activate
on Unix systems orvenv\Scripts\activate
on Windows. -
Install dependencies. Install by running
poetry install
. Poetry can be installed by runningpip install poetry
. -
Install flash-attention. To install, run
pip install flash-attn --no-build-isolation
.
The available experiments are located in the experiments
directory and divided by type of model. These are also listed in the following table:
Model Type | Task | Command |
---|---|---|
Language Model (decoder-only) | Training | python .src/social_media_nlp/experiments/llm/train.py model_id -r r -a a -e e -t -v v |
Model and LoRA merging | python .src/social_media_nlp/experiments/llm/merge.py model_path subfolder |
|
Evaluation | python .src/social_media_nlp/experiments/llm/evaluate.py model_path |
|
Language Model (encoder-only) | Training | python .src/social_media_nlp/experiments/lm/train.py model_id |
Evaluation | python .src/social_media_nlp/experiments/lm/evaluation.py model_path |
|
Machine Learning | Training and evaluation | python .src/social_media_nlp/experiments/ml/tune.py |
The parameters used are as follows:
r
: Rank for fine-tuning using LoRA.a
: Alpha for fine-tuning using LoRA.e
: Number of epochs for training.t
: Number of entries to select for the training set.v
: Number of entries to select for the validation set.model_path
: Path to the selected model for the experiment.subfolder
: Path to a specific version of the model after fine-tuning.model_id
: Hugging Face identifier of the selected model or its local path
To visualize the hyperparameter tuning logs from Optuna, follow these steps:
- Start Optuna dashboard by running
optuna-dashboard sqlite:///mlflow.db
in the root directory of the project. - Access dashboard on port 8080.
- View studies and their details (e.g., best parameters, best value, hyperparameter relationships, etc.).
To visualize the logs from MLFlow used in the training and hyperparameter tuning of the models, follow these steps:
- Start MLFlow UI by running
mlflow ui
in the root directory of the project. - Access dashboard on port 5000
- Explore dashboard (e.g., experiments, runs, metrics, etc.)
├── mlruns <- MLflow logs for experiments.
├── models <- Trained models predictions.
├── notebooks <- Jupyter notebooks for EDA and experiments.
│ └── eda.ipynb
├── src
│ └── social_media_nlp <- Source code for the project.
│ ├── data <- Data processing scripts.
│ │ ├── cleaning.py
│ │ └── preprocessing.py
│ ├── experiments <- Experiments scripts.
│ │ ├── llm <- Decoder-only transformer experiments.
│ │ │ ├── evaluate_few_shot.py
│ │ │ ├── evaluate.py
│ │ │ ├── merge_models.py
│ │ │ └── train.py
│ │ ├── ml <- Machine learning experiments.
│ │ │ └── tune.py
│ │ └── seq_lm <- Encoder-only transformer experiments.
│ │ ├── evaluation.py
│ │ └── train.py
│ ├── models <- Model training, evaluation and inference scripts.
│ │ ├── evaluation.py <- Metrics evaluation.
│ │ ├── hyperparameter_tuning.py <- Optuna hyperparameter tuning.
│ │ ├── ml <- Machine learning models' scripts.
│ │ │ └── vader.py
│ │ ├── transformers <- Transformers models' scripts.
│ │ │ ├── inference.py
│ │ │ ├── train.py
│ │ │ └── utils.py
│ │ └── utils.py
│ ├── prompts.py <- Prompts for the LLMs.
│ └── visualization <- Scripts for exploratory analysis and results plots.
│ ├── features.py
│ ├── graph.py
│ ├── hashtag_clustering.py
│ └── similarity.py
├── mlflow.db <- Optuna database for hyperparameter tuning.
├── poetry.lock
├── pyproject.toml <- Project dependencies.
└── README.md
-
Rosenthal, S., Farra, N., & Nakov, P. (2017). SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) (pp. 502-518).
-
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186). Association for Computational Linguistics. https://aclanthology.org/N19-1423, doi: 10.18653/v1/N19-1423
-
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv. http://arxiv.org/abs/1907.11692
-
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv, abs/1910.01108. https://api.semanticscholar.org/CorpusID:203626972
-
Camacho-Collados, J., Rezaee, K., Riahi, T., Ushio, A., Loureiro, D., Antypas, D., Boisson, J., Espinosa Anke, L., Liu, F., Martínez Cámara, E., et al. (2022). TweetNLP: Cutting-edge natural language processing for social media. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38-49). Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-demos.5
-
Loureiro, D., Barbieri, F., Neves, L., Espinosa Anke, L., & Camacho-Collados, J. (2022). TimeLMs: Diachronic language models from Twitter. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 251-260). Association for Computational Linguistics. https://aclanthology.org/2022.acl-demo.25, doi: 10.18653/v1/2022.acl-demo.25
-
Pérez, J. M., Giudici, J. C., & Luque, F. (2021). pysentimiento: A Python toolkit for sentiment analysis and SocialNLP tasks. arXiv. https://arxiv.org/abs/2106.09462
-
Abdin, M. I., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., et al. (2024). Phi-3 technical report: A highly capable language model locally on your phone. Microsoft. https://www.microsoft.com/en-us/research/publication/phi-3-technical-report-a-highly-capable-language-model-locally-on-your-phone/
-
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7B. arXiv. https://arxiv.org/abs/2310.06825
-
Gemma Team, Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., et al. (2024). Gemma: Open models based on Gemini research and technology. arXiv. https://arxiv.org/abs/2403.08295