Model benchmarks on SDGi Corpus, a multilingual dataset for text classification by Sustainable Development Goals.
The codebase has been developed and tested in Python 3.11
. To create a local python environment, clone the repository
and run the following commands in the project directory:
python -m venv .venv/
source .venv/bin/activate
pip install -r requirements.txt
The following environment variables may need to be set in .env
file:
# Location for an MLflow database
MLFLOW_TRACKING_URI="sqlite:///mlruns.db"
# The below is only required for GPT experiments or OOD data
AZURE_OPENAI_API_KEY="<Azure OpenAI API Key>"
AZURE_OPENAI_ENDPOINT="<Azure OpenAI API Endpoint>"
AZURE_OPENAI_EMBEDDING_MODEL="<Azure OpenAI Embedding Model Deployment>"
For running out-of-domain (OOD) experiments, one needs to first prepare it using a function from src
. This requires
access Azure OpenAI and setting the env variables mentioned above. To create and save a dataset run:
from src import prepare_ood_dataset
dataset = prepare_ood_dataset()
dataset.save_to_disk("data/sdg-meter")
To replicate supervised results from the paper, you can run the Shell script:
chmod 755 main.sh
./main.sh
If you prefer running individual experiments, you can use main.py
:
python main.py --size s --language xx
Results are saved to a local SQLite database you specify in the enviroment variables. To view the results in MLflow, run:
mlflow ui --port 8080 --backend-store-uri sqlite:///mlruns.db
# open http://127.0.0.1:8080
If you have any questions or notice any issues, feel free to open an issue.