Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix error showing time spent in llama perf context print #1898

Conversation

shakalaca
Copy link
Contributor

This PR addresses the issue reported here: #1830 - After the 0.3.0 update, llama_perf_context_print() failed to correctly display inference time, tokens per second, and other related data.

After some investigation, I found that this change: f8fcb3e, caused this issue due to a commit in the upstream llama.cpp repo: ggml-org/llama.cpp@0abc6a2. When designing the no_perf parameter, although it defaults to false, it is set to true in the llama_context_default_params() function for external program calls, leading to incorrect calculation of performance metrics when calling llama_synchronize(). As a result, llama-cpp-python displays incorrect information when using llama_perf_context_print().

In addition to adding the no_perf field in llama_cpp.py, we should also set no_perf to false in llama.py. Since llama-cpp-python project always calls llama_perf_context_print() during usage, I don't see a reason not to collect this information. Of course, if we want to maintain consistency with llama.cpp's settings, we can add an API to allow users to set the no_perf value, providing a way to toggle performance statistics.

shakalaca and others added 3 commits January 18, 2025 10:37
Add `no_perf` field to `llama_context_params` to optionally disable performance timing measurements.
@abetlen abetlen merged commit 4442ff8 into abetlen:main Jan 29, 2025
14 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants