Fix `TGI` (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation #502

cpcdoy · 2025-01-15T14:31:42Z

Description

While implementing a custom task using lighteval, I needed to use constrained grammar generation with TGI and it seems that TGI integration is not up-to-date and not working.

Fixes for TGI Endpoint Inference

The /info route of TGI 3.0.1 doesn't always return required fields such as model_dtype, so it was set to None by default if not found:

$ curl http://localhost:8080/info
{"model_id":"unsloth/Qwen2.5-0.5B-Instruct","model_sha":"6a7b5090fc11df0706c796b7ba76762d7beb688b","model_pipeline_tag":"text-generation","max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":32767,"max_total_tokens":32768,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"3.0.1","sha":"bb9095aae339579fbf3b4e7be3909932de26a7ee","docker_label":"sha-bb9095a"}

AsyncClient from TGI has a generate function that expects multiple parameters and not a structure.
- I've set do_sample, return_full_text and watermark parameters as False by default since they come from huggingface_hub which accepts a None default parameters but TGI doesn't accept them
  - Question for a maintainer : Should they be set as such by default? I don't see them being provided to _async_process_request anyway and maybe this should be fixed in another PR. Same for adapter_id for LoRA heads.
ModelClient's usage has been fixed to use the config: TGIModelConfig by default instead of named parameters

Fixes for TGI JSON Grammar Generation

Updated text_generation to 0.7.0
Added support for the grammar field to enable JSON grammar generation

Environment

Command

uv run lighteval endpoint tgi tgi.yaml "custom|...|0|0" --custom-tasks "ner_eval.py" --output-dir "results" --max-samples 10 --override-batch-size 1 --use-chat-template --save-details --no-public-run

Dependencies

dependencies = [
    "datasets>=3.2.0",
    "huggingface-hub>=0.27.1",
    "lighteval[tgi]>=0.7.0",
    "numpy>=1.26.4",
    "pandas>=2.2.3",
    "pydantic>=1.10.21",
    "text-generation==0.6.0",
    "torch>=2.4.1",
    "torchvision>=0.19.1",
]

[tool.uv.sources]
lighteval = { path = "../../../../lighteval", editable = true } # This branch

`model_config_path` argument for TGI

tgi.yaml:

model:
  instance:
    inference_server_address: "http://localhost:8080"
    inference_server_auth: null
    model_id: null # Optional, only required if the TGI container was launched with model_id pointing to a local directory

Test Results

It works as can be seen from the logs.

TGI Logs with JSON Grammar Generation

2025-01-15T17:09:34.811955Z  INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-geforce-rtx-3060"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(128), return_full_text: Some(false), stop: ["\n\n", "<|im_end|>"], truncate: None, watermark: false, details: true, decoder_input_details: true, seed: None, top_n_tokens: None, grammar: Some(Json(Object {"type": String("object"), "properties": Object {"entities": Object {"type": String("array"), "items": Object {"type": String("object"), "properties": Object {"entity": Object {"type": String("string")}, "classification": Object {"type": String("string"), "enum": Array [String("merchant"), String("bank"), String("individual"), String("date"), String("location"), String("unknown")]}}, "required": Array [String("entity"), String("classification")]}}}, "required": Array [String("entities")]})), adapter_id: None } total_time="428.587752ms" validation_time="716.935µs" queue_time="82.504µs" inference_time="427.788413ms" time_per_token="25.164024ms" seed="None"}: text_generation_router::server: router/src/server.rs:422: Success

Lighteval Logs

(py3.11.3) cpcdoy@cpcdoy-desktop:~/projects/.../llm_tasks_eval$ uv run lighteval endpoint tgi tgi.yaml "custom|...|0|0" --custom-tasks "ner_eval.py" --output-dir "results" --max-samples 10 --override-batch-size 1 --use-chat-template --save-details --no-public-run
warning: `VIRTUAL_ENV=/home/cpcdoy/py3.11.3` does not match the project environment path `.venv` and will be ignored
[2025-01-15 15:11:24,861] [    INFO]: PyTorch version 2.4.1 available. (config.py:54)
[2025-01-15 15:11:28,418] [ WARNING]: --max_samples WAS SET. THESE NUMBERS ARE ONLY PARTIAL AND SHOULD NOT BE USED FOR COMPARISON UNLESS YOU KNOW WHAT YOU ARE DOING. (pipeline.py:132)
[2025-01-15 15:11:28,418] [    INFO]: --- LOADING MODEL --- (pipeline.py:168)
[2025-01-15 15:11:28,418] [    INFO]: Load model from inference server: http://localhost:8080 (model_loader.py:110)
[2025-01-15 15:11:28,846] [    INFO]: --- LOADING TASKS --- (pipeline.py:195)
[2025-01-15 15:11:28,858] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:136)
[2025-01-15 15:11:28,858] [    INFO]: Found 1 custom tasks in /home/cpcdoy/.cache/huggingface/modules/datasets_modules/datasets/ner_eval/1739d6fd80c40f11df64fba54bf39bd05b1b1408659c4325f28f0ca9ee2a04b0/ner_eval.py (registry.py:141)
[2025-01-15 15:11:28,861] [    INFO]: ... default (lighteval_task.py:187)
[2025-01-15 15:11:28,861] [ WARNING]: Careful, the task ... is using evaluation data to build the few shot examples. (lighteval_task.py:261)
[2025-01-15 15:11:28,898] [    INFO]: --- INIT SEEDS --- (pipeline.py:224)
[2025-01-15 15:11:28,899] [    INFO]: --- RUNNING MODEL --- (pipeline.py:267)
[2025-01-15 15:11:28,899] [    INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:271)
[2025-01-15 15:11:28,903] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.90s/it]
[2025-01-15 15:11:33,800] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:299)                                                                  
[2025-01-15 15:11:33,802] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:342)
|            Task             |Version|        Metric         |Value|   |Stderr|
|-----------------------------|------:|-----------------------|----:|---|-----:|
...

[2025-01-15 15:11:33,824] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:332)
[2025-01-15 15:11:33,825] [    INFO]: Saving experiment tracker (evaluation_tracker.py:154)
[2025-01-15 15:11:33,848] [    INFO]: Saving results to ... (evaluation_tracker.py:208)
[2025-01-15 15:11:33,851] [    INFO]: Saving details to ... (evaluation_tracker.py:216)
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 82.46ba/s]

Note: I have anonymized parts of the logs

cpcdoy · 2025-01-15T17:07:59Z

Updated the PR to add support for JSON Grammar Constrained Generation for TGI

NathanHB

Thanks for the PR ! Few thing i'm not sure to get / understand

NathanHB · 2025-02-05T09:33:38Z

src/lighteval/models/endpoints/endpoint_model.py

@@ -491,6 +492,7 @@ def _process_batch_logprob(
                context=request.context if rolling else request.context + request.choice,
                stop_tokens=[],
                max_tokens=1,
+                grammar=request.generation_grammar,


why is the grammar in the request here while it is defined in the generation config ?

Thank you for the feedback @NathanHB 🙏🏻

I did it similarly to how I've seen it done in the same file. See here for an example.
Potentially the usage could be improved in a follow-up PR.

Lmk if I'm also missing something on my end

Hey ! Sorry for the delay, so in your PR you are adding generation_grammar in the GenerationParameters and not in the requests. You would need to add a field in requests (defaults to None). Though maybe i'm missing something, did you test with your tasks and made sure it was using the correct grammar ?

Hey @NathanHB, no worries! Yes, I've used my branch to run evaluations on several models and, just in case, I just reran the evaluation pipeline I wrote at the time (this is running on a base Qwen2.5-0.5B-Instruct not fine-tuned, just in case, so it's harder for it to follow my grammar to help demonstrate this), and TGI does receive my grammar in its logs like this:

2025-04-09T12:48:30.781244Z INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-geforce-rtx-3060"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(256), return_full_text: Some(false), stop: ["\n\n", "<|im_end|>"], truncate: None, watermark: false, details: true, decoder_input_details: true, seed: None, top_n_tokens: None, grammar: Some(Json(Object {"type": String("object"), "properties": Object {"reason": Object {"type": String("string"), "description": String("Reasoning for the classification")}, "classification": Object {"type": String("string"), "description": String("Banking transaction classification"), "enum": Array [String("label 0"), ..., String("label n"),]}, "confidence": Object {"type": String("string"), "enum": Array [String("high"), String("medium"), String("low")]}}, "required": Array [String("classification"), String("reason"), String("confidence")]})), adapter_id: None } total_time="1.430241619s" validation_time="1.408552ms" queue_time="107.404µs" inference_time="1.428725763s" time_per_token="18.085136ms" seed="None"}: text_generation_router::server: router/src/server.rs:422: Success 2025-04-09T12:48:30.787272Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 468

And the model's predictions look like this, which follow the grammar format perfectly:
"predictions":"['{\"reason\": \"...\", \"classification\": \"...\", \"confidence\": \"high\"}']

This is how I define the grammar and then I run it using the lighteval command on my script:

... def get_bank_system_classification_grammar() -> TextGenerationInputGrammarType: return TextGenerationInputGrammarType( type="json", value={ "type": "object", "properties": { "reason": { "type": "string", "description": "Reasoning for the classification", }, "classification": { "type": "string", "description": "Banking transaction classification", "enum": BANK_SYSTEM_LABELS, }, "confidence": {"type": "string", "enum": ["high", "medium", "low"]}, }, "required": ["reason", "classification", "confidence"], }, ) DATASET_DIR = "src/llm_tasks_eval/datasets/bank_system_classification_dataset" BANK_SYSTEM_CLASSIFICATION_TASK = LightevalTaskConfig( name="bank_system_classification", prompt_function=prompt_bank_system_classification, suite=["custom"], hf_repo=DATASET_DIR, hf_subset=None, metric=[ bank_system_group, semantic_similarity_metric, detailed_classification_metric, openai_comparison_metric, ], generation_size=256, generation_grammar=get_bank_system_classification_grammar(), stop_sequence=["\n\n"], trust_dataset=True, evaluation_splits=["test"], hf_avail_splits=["test"], ) TASKS_TABLE = [BANK_SYSTEM_CLASSIFICATION_TASK]

Note: I have censored/truncated multiple outputs and fields since they contain sensitive elements from work.

NathanHB · 2025-02-05T09:33:52Z

src/lighteval/models/endpoints/tgi_model.py

+        generated_text = self.client.generate(
+            prompt=context,
+            do_sample=generation_config.do_sample or False,
+            max_new_tokens=generation_config.max_new_tokens,
+            best_of=generation_config.best_of,
+            repetition_penalty=generation_config.repetition_penalty,
+            return_full_text=generation_config.return_full_text or False,
+            seed=generation_config.seed,
+            stop_sequences=generation_config.stop,
+            temperature=generation_config.temperature,
+            top_k=generation_config.top_k,
+            top_p=generation_config.top_p,
+            truncate=generation_config.truncate,
+            typical_p=generation_config.typical_p,
+            watermark=generation_config.watermark or False,
+            decoder_input_details=generation_config.decoder_input_details,
+            grammar=generation_config.grammar,
+        )


is this needed ?

IIRC, this is the interface text-generation==0.7.0 is exposing now which is different since I have upgraded it from 0.6.0 in this PR.

Did you mean, is there a cleaner way to do this?

naufalso · 2025-02-07T09:16:57Z

UP! I encountered a similar issue where the bug prevented us from using the TGI endpoint. The key issues I found are:

Line 111-113 in `src/lighteval/models/model_loader.py:
The current implementation:
```
model = ModelClient(address=config.inference_server_address, auth_token=config.inference_server_auth, model_id=config.model_id)  
```
should be updated to:
```
model = ModelClient(config=config)  
```
This ensures that the initialization parameters are correctly passed to ModelClient, resolving configuration-related issues.
model_dtype issue:
The model_dtype is not consistently available on the /info route of TGI, which leads to errors when the field is required. To address this, model_dtype should be set to None by default.

cpcdoy · 2025-02-07T18:48:48Z

Exactly @naufalso , this is already solved in this PR!

ZQ-Dev8 · 2025-04-02T18:07:27Z

+1 is this going to be merged @NathanHB ? Would really like to use lighteval with locally hosted TGI, but I'm seeing the same TypeError: ModelClient.__init__() got an unexpected keyword argument 'address' error described above.

cpcdoy added 2 commits January 15, 2025 15:12

fix: Lighteval communication with TGI

ab68a1b

fix: JSON grammar constrained generation

f442a29

cpcdoy changed the title ~~Fix TGI (Text Generation Inference) Endpoint Inference~~ Fix TGI (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation Jan 15, 2025

NathanHB reviewed Feb 5, 2025

View reviewed changes

Merge branch 'main' into fix/tgi_inference

6bb6b13

NathanHB and others added 2 commits April 8, 2025 11:49

Merge branch 'main' into fix/tgi_inference

cb13617

Merge branch 'main' into fix/tgi_inference

749887e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `TGI` (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation #502

Fix `TGI` (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation #502

cpcdoy commented Jan 15, 2025 •

edited

Loading

cpcdoy commented Jan 15, 2025

NathanHB left a comment

NathanHB Feb 5, 2025

cpcdoy Feb 7, 2025

NathanHB Apr 8, 2025

cpcdoy Apr 9, 2025 •

edited

Loading

NathanHB Feb 5, 2025

cpcdoy Feb 7, 2025 •

edited

Loading

naufalso commented Feb 7, 2025

cpcdoy commented Feb 7, 2025

ZQ-Dev8 commented Apr 2, 2025

Fix TGI (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation #502

Are you sure you want to change the base?

Fix TGI (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation #502

Conversation

cpcdoy commented Jan 15, 2025 • edited Loading

Description

Fixes for TGI Endpoint Inference

Fixes for TGI JSON Grammar Generation

Environment

Command

Dependencies

model_config_path argument for TGI

Test Results

TGI Logs with JSON Grammar Generation

Lighteval Logs

cpcdoy commented Jan 15, 2025

NathanHB left a comment

Choose a reason for hiding this comment

NathanHB Feb 5, 2025

Choose a reason for hiding this comment

cpcdoy Feb 7, 2025

Choose a reason for hiding this comment

NathanHB Apr 8, 2025

Choose a reason for hiding this comment

cpcdoy Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

NathanHB Feb 5, 2025

Choose a reason for hiding this comment

cpcdoy Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

naufalso commented Feb 7, 2025

cpcdoy commented Feb 7, 2025

ZQ-Dev8 commented Apr 2, 2025

Fix `TGI` (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation #502

Fix `TGI` (Text Generation Inference) Endpoint Inference and TGI JSON Grammar Generation #502

cpcdoy commented Jan 15, 2025 •

edited

Loading

`model_config_path` argument for TGI

cpcdoy Apr 9, 2025 •

edited

Loading

cpcdoy Feb 7, 2025 •

edited

Loading