This project is designed to classify text using embeddings and a fast classifier.
To install the required dependencies, run:
pip install -r requirements.txt
-
Prepare your dataset:
- Ensure your dataset is in a CSV format with columns for text and corresponding labels.
- Place the dataset file in the
data
directory.
-
Fetch embeddings:
- Set up an OpenAI API key by setting the
OPENAI_API_KEY
environment variable. - Run the
embeddings.py
script to fetch embeddings for your text data.
- Set up an OpenAI API key by setting the
-
Train and evaluate classifiers:
- Open the
how_to_use.ipynb
notebook. - Update the dataset file path in the notebook to point to your dataset.
- Run the notebook cells to train and evaluate different classifiers (logistic regression, random forest, SVM).
- The notebook will select the best classifier based on the F1-score.
- Open the
-
Save and use the trained classifier:
- The best classifier will be saved as
best_classifier.pkl
in the project directory. - To use the trained classifier in your own code, load the saved model using the
TextClassifier.load_model()
method.
- The best classifier will be saved as
-
Assess the need for more data:
- The
how_to_use.ipynb
notebook includes a section to plot the learning curve for the best classifier. - Analyze the learning curve plot to determine if the model would benefit from more training data.
- If the model is overfitting (high training score, low cross-validation score), consider collecting more data to improve generalization.
- The
Thanks to the magic of embeddings, a binary classificator on a medium complexity NLP task requires around 300 curated data points to reach very high precision levels (multilingual). My suggestion is, especially if you have a small data source, to try checking at various levels if the model is strongly overfitting, as that works as a good indicator of the need for more data.
Consider the following:
- Collect a diverse and representative dataset covering various text patterns and labels.
- Start with a small dataset and gradually increase its size while monitoring the model's performance.
- Use cross-validation to assess the model's generalization ability and detect overfitting.
- If overfitting occurs, prioritize collecting more data over tuning the model architecture or hyperparameters.
- Regularly evaluate the model's performance on a held-out test set to ensure it generalizes well to unseen data.
By iteratively assessing the model's performance and gathering more data as needed, you can build a robust and accurate text classifier using embeddings and a fast classifier.