You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@jingkang99 I think I know what is the issue in your case -- please check what version of huggingface_hub you have installed in your env. The newest version has an issue with the details=True in stream mode: huggingface#1876.
To resolve this issue, please install requirements as mentioned in the README.
System Info
Ubuntu 22.04
ghcr.io/huggingface/tgi-gaudi 2.0.0
cd tgi-gaudi/examples
python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 50 --max_input_length 200 --max_output_length 200 --total_sample_count 200
100%|████████████| 200/200 [01:08<00:00, 2.92it/s]
----- Performance summary -----
Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s
First token latency:
Median: 18783.24ms
Average: 16835.90ms
Output token latency:
Median: 14.22ms
Average: 15.22ms
Information
Tasks
Reproduction
btw, if not specify
--max_concurrent_requests 50
get following error
Thread failed with error: Model is overloaded
Expected behavior
Throughput info calculated and shown in
Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s
The text was updated successfully, but these errors were encountered: