-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Error with 32k Long Text in chatglm2-6b-32k Model #1725
Comments
Does it fail consistently regardless of inputs or only on specific input? It looks like some PyTorch and GPU memory access issue
|
short text inputs without any problems, Long text can cause the above problem,I used 30000 tokens for long text because the model above supports 32k |
What's your hardware configuration? I wonder whether this is an OOM issue in disguise... |
A100 and 3090 both have the same error, with cuda version 12.2, I found slight differences in model inference between THUDM/chatglm2-6b-32k and THUDM/chatglm2-6b, with RotaryEmbedding and kvcache The logic of cache is different. Currently, VLLM supports chatglm2-6b but does not support chatglm2-6b-32k |
这个我找到原因了,原来的逻辑有bug 这个需要修改改两个地方 ,一个是 rotary_embedding.py 文件 里面的 _compute_inv_freq 函数 加上 base = base * self.rope_ratio 或者 直接写死 base = base * 50 ,因为官方还不支持 rope_ratio 参数传递,2是修改 GLMAttention 类 79 行 ,写成 self.attn = PagedAttentionWithRoPE( |
@junior-zsy can you submit a PR to address this so others won't run into the same issue. 🙇♂️ |
Ah i think so. |
Yes, # 1841 has been resolved |
python3 api_server.py --model /hbox2dir/chatglm2-6b-32k --trust-remote-code --host 0.0.0.0 --port 7070 --tensor-parallel-size 2
Strangely, the inference process fails even on 8 GPUs, whereas the Hugging Face version of the model performs well on a 2-GPU setup.
The text was updated successfully, but these errors were encountered: