Study how LM Evaluation Harness works and try to implement it #231

ggerganov · 2023-03-17T08:32:33Z

Update 10 Apr 2024: #231 (comment)

It would be great to start doing this kind of quantitative analysis of ggml-based inference:

It looks like Fabrice evaluates the models using something called LM Evaluation Harness:

https://github.com/EleutherAI/lm-evaluation-harness

I have no idea what this is yet, but would be nice to study it and try to integrate it here and in other ggml-based projects.
This will be very important step needed to estimate the quality of the generated output and see if we are on the right track.

The text was updated successfully, but these errors were encountered:

anzz1 · 2023-03-17T08:45:51Z

Half the fun in AI though is not completely understanding why the results are what they are.

I'm only (half) joking though, this will obviously be a good thing. Pitting various models against each other in a common environment seems the right way forward. This would not only help in training better models but also present more options varying in quality, speed and the amount of resources required to run them.

Green-Sky · 2023-04-20T22:46:04Z

as far as i can tell, you just have to implement a python class for the model.

eg:
https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/models/gpt2.py

edit: or here is the "model" apiusage for bellard's textsynth-api.

edit2: someone created an issue on their end EleutherAI/lm-evaluation-harness#417

StellaAthena · 2023-04-30T18:58:18Z

Hi! We are quite interested in supporting ggml, but nobody on our team has experience with Python bindings for C AFAIK.

Copying from the issue on our side,

The process would look something like:
-make a new file in lm_eval/models called “ggml_model.py” or similar

in that file make a BaseLM subclass called GGMLLM or similar
This class should do the following:

In initialization, instantiate a model using the Python bindings

Implement the loglikelihood_rolling(), loglikelihood(), and greedy_until() class methods to support all 3 completion types (see gpt3.py or BaseLM for a template to compare to)

add any helper methods for those functions!

We’d be happy to help however we can!

github-actions · 2024-04-10T01:08:07Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

StellaAthena · 2024-04-10T14:47:03Z

For the record, we successfully integrated this into the eval harness via llama-cpp-python). Currently it's llama.cpp specific and extending it to the entire ggml ecosystem would be awesome. Our real bottleneck is not being very familiar with using Python bindings (also manpower).

farbodbj · 2025-03-25T13:25:54Z

Having the model evaluation feature can be of great value to researchers who want to evaluate models. Along with the original mentioned llm-evaluation-harness, other tools provide this feature, e.g., Google's BIG-bench.
The core working of most evaluation tools is simple, they use an API (mostly OpenAI-like) along with standard academic LLM evaluation datasets like HellaSwag, MMLU, ARC, and other publicly available datasets to ask certain questions from the model and verify the answer. Since the evaluator is a separate module and only communicates with the API I believe available tools can be added as submodules or dependencies to llama.cpp and to provide easy and on-hand evaluation for users of llama.cpp

farbodbj · 2025-04-03T06:53:31Z

@ggerganov is this issue still releavant?

ggerganov · 2025-04-03T07:00:56Z

It's relevant.

Since the evaluator is a separate module and only communicates with the API I believe available tools can be added as submodules or dependencies to llama.cpp and to provide easy and on-hand evaluation for users of llama.cpp

Most evaluations that I've seen (e.g. HellaSwag, MMLU, PPL) are quite simple to implement as tools in the repository. Don't think there is need to bring external dependencies for that.

farbodbj · 2025-04-04T07:31:34Z

It's relevant.

Since the evaluator is a separate module and only communicates with the API I believe available tools can be added as submodules or dependencies to llama.cpp and to provide easy and on-hand evaluation for users of llama.cpp

Most evaluations that I've seen (e.g. HellaSwag, MMLU, PPL) are quite simple to implement as tools in the repository. Don't think there is need to bring external dependencies for that.

Agreed, so I don't really know much about how you guys manage your backlog, but I think creating an issue requesting this feature and tagging it "good-first-issue" can be helpful. I would be more than glad to implement it but I couldn't guarantee doing it in next 2 month, maybe someone else can do it till then

ggerganov added enhancement New feature or request high priority Very important issue generation quality Quality of model output labels Mar 17, 2023

ggerganov pinned this issue Mar 17, 2023

ggerganov mentioned this issue Mar 19, 2023

Use RMSNorm #173

Closed

ggerganov added the research 🔬 label Mar 22, 2023

gjmulder unpinned this issue Mar 27, 2023

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

ggerganov reopened this Apr 10, 2024

ggerganov added help wanted Extra attention is needed and removed stale labels Apr 10, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study how LM Evaluation Harness works and try to implement it #231

Study how LM Evaluation Harness works and try to implement it #231

ggerganov commented Mar 17, 2023 •

edited

Loading

anzz1 commented Mar 17, 2023

Green-Sky commented Apr 20, 2023 •

edited

Loading

StellaAthena commented Apr 30, 2023

github-actions bot commented Apr 10, 2024

StellaAthena commented Apr 10, 2024

farbodbj commented Mar 25, 2025

farbodbj commented Apr 3, 2025

ggerganov commented Apr 3, 2025

farbodbj commented Apr 4, 2025

Study how LM Evaluation Harness works and try to implement it #231

Study how LM Evaluation Harness works and try to implement it #231

Comments

ggerganov commented Mar 17, 2023 • edited Loading

anzz1 commented Mar 17, 2023

Green-Sky commented Apr 20, 2023 • edited Loading

StellaAthena commented Apr 30, 2023

github-actions bot commented Apr 10, 2024

StellaAthena commented Apr 10, 2024

farbodbj commented Mar 25, 2025

farbodbj commented Apr 3, 2025

ggerganov commented Apr 3, 2025

farbodbj commented Apr 4, 2025

ggerganov commented Mar 17, 2023 •

edited

Loading

Green-Sky commented Apr 20, 2023 •

edited

Loading