Skip to content

MBassi91/embedding_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Overview and Instructions

This project is designed to classify text using embeddings and a fast classifier.

Installation

To install the required dependencies, run: pip install -r requirements.txt

Usage

  1. Prepare your dataset:

    • Ensure your dataset is in a CSV format with columns for text and corresponding labels.
    • Place the dataset file in the data directory.
  2. Fetch embeddings:

    • Set up an OpenAI API key by setting the OPENAI_API_KEY environment variable.
    • Run the embeddings.py script to fetch embeddings for your text data.
  3. Train and evaluate classifiers:

    • Open the how_to_use.ipynb notebook.
    • Update the dataset file path in the notebook to point to your dataset.
    • Run the notebook cells to train and evaluate different classifiers (logistic regression, random forest, SVM).
    • The notebook will select the best classifier based on the F1-score.
  4. Save and use the trained classifier:

    • The best classifier will be saved as best_classifier.pkl in the project directory.
    • To use the trained classifier in your own code, load the saved model using the TextClassifier.load_model() method.
  5. Assess the need for more data:

    • The how_to_use.ipynb notebook includes a section to plot the learning curve for the best classifier.
    • Analyze the learning curve plot to determine if the model would benefit from more training data.
    • If the model is overfitting (high training score, low cross-validation score), consider collecting more data to improve generalization.

Suggestions

Thanks to the magic of embeddings, a binary classificator on a medium complexity NLP task requires around 300 curated data points to reach very high precision levels (multilingual). My suggestion is, especially if you have a small data source, to try checking at various levels if the model is strongly overfitting, as that works as a good indicator of the need for more data.

Consider the following:

  • Collect a diverse and representative dataset covering various text patterns and labels.
  • Start with a small dataset and gradually increase its size while monitoring the model's performance.
  • Use cross-validation to assess the model's generalization ability and detect overfitting.
  • If overfitting occurs, prioritize collecting more data over tuning the model architecture or hyperparameters.
  • Regularly evaluate the model's performance on a held-out test set to ensure it generalizes well to unseen data.

By iteratively assessing the model's performance and gathering more data as needed, you can build a robust and accurate text classifier using embeddings and a fast classifier.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published