Skip to content

Process isAssertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1160: cma_read_nbytes == size #107

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
JunxiChhen opened this issue Dec 1, 2023 · 6 comments

Comments

@JunxiChhen
Copy link
Contributor

Issue occured:
image

When running llama2-7b, input4096, output2048, BS16, Beam1, on SPR-HBM flat mode SNC4.
Benchmarking CMD:
image

@pujiang2018
Copy link
Contributor

Looks something wrong with oneCCL. @JunxiChhen Do we still encounter such issue?

@JunxiChhen
Copy link
Contributor Author

Looks something wrong with oneCCL. @JunxiChhen Do we still encounter such issue?

Yes. But it only occurred on SPR-HBM snc4 flat mode now. I didn't see any issue on SPR Quad mode.

@pujiang2018
Copy link
Contributor

@shanzhou2186 Have you ever encountered such issue? for such issue, need an environment to debug.

@shanzhou2186
Copy link

Have you checked the memory usage on each sub-numa? Is it possible that one of sub numa OOM?

@shanzhou2186
Copy link

@shanzhou2186 Have you ever encountered such issue? for such issue, need an environment to debug.

No. I haven't tried such large input/output before.

@zongy17
Copy link

zongy17 commented Dec 26, 2024

@JunxiChhen Hi, I also got this error when using Intel OneAPI (version 2023.2.0). Did you finally know how to avoid this bug?

Duyi-Wang added a commit that referenced this issue Mar 19, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants