-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
C RecMatMul Using BenchmarkRunner #749
Conversation
The benchmark has a race condition, so we expect some entries to have smaller values than they should. When you run it with a single thread, do the validation failed messages go away? (The number of rows and columns must also be a power of 2, but I don't think that's the issue here.) |
I see. Interesting.
Yes they do. Thanks for the clarification. |
Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This helps a lot, but for some reason C++ is still a bit faster (C++ is blue in the plot below).
I don't think we need to worry about it though.
Note that I significantly simplified the C++ benchmark runner. It only has a start and a finished port now.
I think in order to merge this we should add this version including the runner alongside the original benchmark. Unfortunately, the python runner script will not parse the output of this modified benchmark correctly.
I do not disagree, but in case anyone is still concerned: On my machine, median execution time decreased from 5.5s to 3.1s for the single-threaded runtime when I used this function
as a replacement for Apparently transposing a matrix in this way is a standard trick for speeding up matrix multiplication (because of cache performance). I know that our results are invalid if there are algorithmic differences between implementations, but this seems fairly low-level. If I understand correctly, it is little more than a change in how we interact with the hardware prefetcher, which seems mild enough compared to what a JIT compiler can do. Then again, it appears to me that the C version with the benchmark runner but without transposing might be the most similar to the C++ version, judging from the first and subsequent execution times on my machine. One might also just dismiss this as silliness: It only highlights the fact that this benchmark tells us more about the content of the inner loop than it tells us about the runtime. |
Interesting! I don't think this explains the small gap between C and C++ though (unless the C++ compiler is smart enough to make this optimization automatically). I don't think it would be unfair to optimize the matrix access, but we should do it similarly in both the C++ and the C version. Then, however, I would expect to see the same gap again. |
This is based on a suggestion from Christian, since the Python runner script does not parse the output of the modified benchmark correctly. The original benchmark has a couple of minor corrections that also appear in the one that uses the benchmark runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you both
This adds back the BenchmarkRunner reactor and makes use of it in the RecMatMul (
MatMul.lf
) benchmark.@petervdonovan When I use the benchmark runner script to run this benchmark, I get validation failed messages. Do you know what might be the reason for this?