Use python script to do LLM Model Evaluation.
- Introduction from paper with code: Paper-with-code
-
Introduction: Medium Article
-
huggingface dataset: Huggingface Dataset
- Step 1: please download the model from huggingface The following command line is the example of mistral-7B-v0.1 model:
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1
-
Step 2: Please arrange the dataset from tmmluplus data folder to data_arrange folder.
-
Step 3: Please run the following code to predict the result:
python3 evaluation_hf_testing.py \
--model ./models/llama2-7b-hf \
--data_dir ./llm_evaluation_tmmluplus/data_arrange/ \
--save_dir ./llm_evaluation_tmmluplus/results/
- Step 4: Please run the evaluation code to get the output json file.
!python /content/llm_model_evaluation/catogories_result_eval.py \
--catogory "mmlu" \
--model ./models/llama2-7b-hf \
--save_dir "./results/results_llama2-7b-hf"
- mmlu dataset:
- Google Colab - mmlu
- Google Colab - mmlu in phi-2 model [Colab free tier can use this Google Colab example]
- tmmluplus dataset:
- mmlu dataset:
模型 | Weighted Accuracy | STEM | humanities | social sciences | other | Inference Time(s) |
---|---|---|---|---|---|---|
Mistral-7B-v0.1 | 0.6254094858282296 | 0.5251822398939695 | 0.5636556854410202 | 0.7357816054598635 | 0.703578038247995 | 15624.038010835648 |
- tmmluplus dataset:
模型 | Weighted Accuracy | STEM | humanities | social sciences | other | Inference Time(s) |
---|---|---|---|---|---|---|
Mistral-7B-v0.1 | - | - | - | - | - | - |