"Glad I could help. (Side tip: hit return twice to break out of the '>>')"
Emotions: Pride, Relief, Gratitude, Joy
"Yawn. They’re toxic together and their only trait seemed to be getting naked together."
Emotions: Fear, Disgust
- This project demonstrates a multi-label text classification pipeline using Google's GoEmotions dataset, which comprises Reddit comments labeled with multiple emotions. Each instance can be assigned multiple labels simultaneously in a multi-label classification task.
- Overview
- Data Preparation
- Data Exploration & Visualization
- Data Preprocessing
- Feature Extraction & Modeling
- Model Evaluation
- Future Work
- Data Source: Google's GoEmotions Dataset (A Dataset for Fine-Grained Emotion Classification)
- Techniques: Data Exploration, TF-IDF Feature Extraction, Perceptron, SVM, and Logistic Regression Models, DistilBERT and RoBERTa for Feature Extraction
- Goal: Predict what emotions are associated with a text input.
- The GoEmotions dataset is downloaded and stored locally.
- The file is then read into a Pandas DataFrame, keeping only the required columns (i.e., text and emotion columns).
- Invalid rows are removed (i.e., rows marked as unclear or with no positive emotions labeled).
- Randomly sample and print a few data samples.
- Calculate and display (via plots):
- Positive value counts of each emotion label (distribution of emotions).
- Positive and negative value counts per emotion label.
This step provides an understanding of the dataset’s class distribution and reveals significant class imbalances.
- Convert text to lowercase.
- Expand contractions.
- Replace URLs with [URL] and @handles with [USER]
- Multi-Label stratified 80/20 split to maintain balanced class proportions.
- Features are extracted from the text using TF-IDF, and embeddings are learned from DistilBERT and RoBERTa models. These feature vector representations are then used to train and evaluate classical ML models (e.g., SVM and Logistic Regression models).
- TF-ID is a statistical measure used to evaluate how important a word is within a text sample relative to a collection (corpus) of samples. It is calculated as the product of two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF).
- This simple method effectively highlights words that are frequent in a given document but rare across the corpus, helping distinguish the unique vocabulary of each document. While TF-IDF performs well in many text classification tasks, it does not capture semantical or contextual relationships between words.
- DistilBERT is designed to be lighter and faster while retaining much of BERT's performance. It achieves this by using a distillation process where a smaller model learns to mimic the behavior of a larger one.
- RoBERTa is a robustly optimized BERT approach that modifies key hyperparameters, removes the next-sentence pretraining objective, and trains with much larger mini-batches and learning rates.
- Unlike TD-IDF, these models can capture contextual information and semantic relationships between words, leading to a more feature-dense text representation.
-
Once the features are extracted using both methods, they are used to train and evaluate classic ML models. To perform a multi-label classification task, the following approach is adopted:
- Binary Relevance: This is the most straightforward strategy, which treats each label as a separate binary classification problem. Separate binary classifiers are trained for each label, and the final prediction is the union of the predictions of all classifiers.
- After training the models, each model’s performance is evaluated on training and testing data. The following metrics are computed and compared:
Definition: Accuracy measures the proportion of correctly predicted instances (positive and negative) out of the total instances.
Formula:
Use Case: Useful when the dataset is balanced (similar numbers of positive and negative instances).
Definition: Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive (i.e., of all the positive predictions made by the model, what percentage of them are truly positive?).
Formula:
Use Case: Important in scenarios where minimizing false positives is critical (e.g., spam detection).
Definition: Recall (or Sensitivity) measures the proportion of correctly predicted positive instances out of all actual positive instances (i.e., out of all the truly positive instances that the model was tested on, what percentage did the model correctly identify as positive?).
Formula:
Use Case: Crucial in scenarios where minimizing false negatives is important (e.g., disease detection).
Definition: F1-Score is the harmonic mean of Precision and Recall, providing a single metric that balances both.
Formula:
Use Case: Useful when there is an imbalance in class distribution.
- The table below shows the performance (F1-Score only) of the models on the test set using different feature extraction methods.
Model | Feature Extraction Method | F1-Score |
---|---|---|
Logistic Regression | TF-IDF | 13.04 |
SVM | TF-IDF | 12.19 |
Logistic Regression | DistilBERT | 32.82 |
SVM | DistilBERT | 0.45 |
Perceptron | DistilBERT | 36.26 |
Logistic Regression | RoBERTa | 8.83 |
SVM | RoBERTa | 17.09 |
Perceptron | RoBERTa | 13.34 |
- The Perceptron model with DistilBERT features outperforms all other models with an F1-Score of 36.26%. This is up from 13.04% using TF-IDF features.
- Note: Additional metrics were computed, and other models were evaluated. The top results for brevity are shown above.
- Other classical machine learning models, such as the Perceptron, can be trained using TF-IDF features, and alternative problem formulation methods beyond Binary Relevance are worth exploring.
- Additionally, more straightforward feature extraction techniques like Word2Vec or GloVe can be compared to transformer-based methods to evaluate their performance differences.
- Furthermore, embeddings obtained from DistilBERT and RoBERTa can serve as inputs for more advanced models, such as MLP, LSTM, or GRU, potentially enhancing overall performance.
- Finally, fine-tuning an advanced embedding model on the dataset may lead to better, task-specific representations.