TGI Performance script run_generation.py missing Throughput Info #148

jingkang99 · 2024-05-23T22:11:07Z

System Info

Ubuntu 22.04
ghcr.io/huggingface/tgi-gaudi 2.0.0

cd tgi-gaudi/examples
python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 50 --max_input_length 200 --max_output_length 200 --total_sample_count 200

100%|████████████| 200/200 [01:08<00:00, 2.92it/s]

----- Performance summary -----

Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s

First token latency:
Median: 18783.24ms
Average: 16835.90ms

Output token latency:
Median: 14.22ms
Average: 15.22ms

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

btw, if not specify

--max_concurrent_requests 50

get following error

Thread failed with error: Model is overloaded

Expected behavior

Throughput info calculated and shown in

Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s

The text was updated successfully, but these errors were encountered:

kdamaszk · 2024-05-27T07:37:28Z

@jingkang99 please share also your TGI server command

kdamaszk · 2024-05-28T09:58:44Z

@jingkang99 I think I know what is the issue in your case -- please check what version of huggingface_hub you have installed in your env. The newest version has an issue with the details=True in stream mode: huggingface#1876.
To resolve this issue, please install requirements as mentioned in the README.

jingkang99 · 2024-05-30T19:28:25Z

Thanks a lot for your input.

MUST install exact version:

huggingface_hub==0.20.3
requests==2.31.0
datasets==2.18.0
transformers==4.37.0

result:
time python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 100 --max_input_length 1000 --max_output_length 1000 --total_sample_count 1000

Filter: 100%|████████| 10331/10331 [00:03<00:00, 3380.24 examples/s]
100%|████████████| 1000/1000 [29:31<00:00, 1.77s/it]

----- Performance summary -----

Throughput: 479.7 tokens/s
Throughput: 0.5 queries/s

First token latency:
Median: 181037.75ms
Average: 173501.17ms

Output token latency:
Median: 14.01ms
Average: 14.51ms

real 33m0.485s
user 4m22.686s
sys 0m49.261s

jingkang99 closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGI Performance script run_generation.py missing Throughput Info #148

TGI Performance script run_generation.py missing Throughput Info #148

jingkang99 commented May 23, 2024

kdamaszk commented May 27, 2024

kdamaszk commented May 28, 2024

jingkang99 commented May 30, 2024

TGI Performance script run_generation.py missing Throughput Info #148

TGI Performance script run_generation.py missing Throughput Info #148

Comments

jingkang99 commented May 23, 2024

System Info

100%|████████████| 200/200 [01:08<00:00, 2.92it/s]

----- Performance summary -----

Throughput: 0.0 tokens/s Throughput: 0.0 queries/s

First token latency: Median: 18783.24ms Average: 16835.90ms

Output token latency: Median: 14.22ms Average: 15.22ms

Information

Tasks

Reproduction

Expected behavior

kdamaszk commented May 27, 2024

kdamaszk commented May 28, 2024

jingkang99 commented May 30, 2024

Filter: 100%|████████| 10331/10331 [00:03<00:00, 3380.24 examples/s] 100%|████████████| 1000/1000 [29:31<00:00, 1.77s/it]

----- Performance summary -----

Throughput: 479.7 tokens/s Throughput: 0.5 queries/s

First token latency: Median: 181037.75ms Average: 173501.17ms

Output token latency: Median: 14.01ms Average: 14.51ms

Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s

First token latency:
Median: 18783.24ms
Average: 16835.90ms

Output token latency:
Median: 14.22ms
Average: 15.22ms

Filter: 100%|████████| 10331/10331 [00:03<00:00, 3380.24 examples/s]
100%|████████████| 1000/1000 [29:31<00:00, 1.77s/it]

Throughput: 479.7 tokens/s
Throughput: 0.5 queries/s

First token latency:
Median: 181037.75ms
Average: 173501.17ms

Output token latency:
Median: 14.01ms
Average: 14.51ms