Skip to content

Model benchmarks on SDGi Corpus, a multilingual dataset for text classification by Sustainable Development Goals.

License

Notifications You must be signed in to change notification settings

UNDP-Data/dsc-sdgi-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dsc-sdgi-corpus

python Licence Website

Introduction

Model benchmarks on SDGi Corpus, a multilingual dataset for text classification by Sustainable Development Goals.

Getting Started

Python Environment

The codebase has been developed and tested in Python 3.11. To create a local python environment, clone the repository and run the following commands in the project directory:

python -m venv .venv/
source .venv/bin/activate
pip install -r requirements.txt

Environment Variables

The following environment variables may need to be set in .env file:

# Location for an MLflow database
MLFLOW_TRACKING_URI="sqlite:///mlruns.db"

# The below is only required for GPT experiments or OOD data
AZURE_OPENAI_API_KEY="<Azure OpenAI API Key>"
AZURE_OPENAI_ENDPOINT="<Azure OpenAI API Endpoint>"
AZURE_OPENAI_EMBEDDING_MODEL="<Azure OpenAI Embedding Model Deployment>"

Running Experiments

For running out-of-domain (OOD) experiments, one needs to first prepare it using a function from src. This requires access Azure OpenAI and setting the env variables mentioned above. To create and save a dataset run:

from src import prepare_ood_dataset

dataset = prepare_ood_dataset()
dataset.save_to_disk("data/sdg-meter")

To replicate supervised results from the paper, you can run the Shell script:

chmod 755 main.sh
./main.sh

If you prefer running individual experiments, you can use main.py:

python main.py --size s --language xx

Results are saved to a local SQLite database you specify in the enviroment variables. To view the results in MLflow, run:

mlflow ui --port 8080 --backend-store-uri sqlite:///mlruns.db
# open http://127.0.0.1:8080

Contribute

If you have any questions or notice any issues, feel free to open an issue.

About

Model benchmarks on SDGi Corpus, a multilingual dataset for text classification by Sustainable Development Goals.

Topics

Resources

License

Stars

Watchers

Forks