Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

C RecMatMul Using BenchmarkRunner #749

Merged
merged 5 commits into from
Nov 22, 2021
Merged

C RecMatMul Using BenchmarkRunner #749

merged 5 commits into from
Nov 22, 2021

Conversation

Soroosh129
Copy link
Contributor

@Soroosh129 Soroosh129 commented Nov 12, 2021

This adds back the BenchmarkRunner reactor and makes use of it in the RecMatMul (MatMul.lf) benchmark.

@petervdonovan When I use the benchmark runner script to run this benchmark, I get validation failed messages. Do you know what might be the reason for this?

$ ./run_benchmark.py benchmark=savina_parallelism_recmatmul.yaml target=lf-c
[2021-11-12 10:41:02,614][bash][INFO] - ---- Start execution at time Fri Nov 12 10:40:59 2021
[2021-11-12 10:41:02,614][bash][INFO] - ---- plus 319410005 nanoseconds.
[2021-11-12 10:41:02,614][bash][INFO] - Benchmark: ThreadRingReactorLFCppBenchmark
[2021-11-12 10:41:02,614][bash][INFO] - System information
[2021-11-12 10:41:02,614][bash][INFO] - O/S Name: Linux
[2021-11-12 10:41:02,614][bash][INFO] - Validation failed for (i,j)=(1, 179) with (181506.000000, 183296.000000)
[2021-11-12 10:41:02,614][bash][INFO] - Iteration: 1	 Duration: 341.537 msec
[2021-11-12 10:41:02,614][bash][INFO] - Validation failed for (i,j)=(2, 663) with (1345890.000000, 1357824.000000)
[2021-11-12 10:41:02,614][bash][INFO] - Iteration: 2	 Duration: 267.330 msec
[2021-11-12 10:41:02,614][bash][INFO] - Validation failed for (i,j)=(1, 727) with (729908.000000, 744448.000000)
[2021-11-12 10:41:02,614][bash][INFO] - Iteration: 3	 Duration: 265.782 msec
[2021-11-12 10:41:02,614][bash][INFO] - Validation failed for (i,j)=(2, 277) with (558432.000000, 567296.000000)
[2021-11-12 10:41:02,614][bash][INFO] - Iteration: 4	 Duration: 267.440 msec
[2021-11-12 10:41:02,614][bash][INFO] - Validation failed for (i,j)=(1, 288) with (290304.000000, 294912.000000)
[2021-11-12 10:41:02,614][bash][INFO] - Iteration: 5	 Duration: 270.877 msec
[2021-11-12 10:41:02,614][bash][INFO] - Validation failed for (i,j)=(1, 1017) with (1017000.000000, 1041408.000000)
[2021-11-12 10:41:02,614][bash][INFO] - Iteration: 6	 Duration: 265.654 msec
[2021-11-12 10:41:02,614][bash][INFO] - Validation failed for (i,j)=(1, 256) with (259328.000000, 262144.000000)
[2021-11-12 10:41:02,614][bash][INFO] - Iteration: 7	 Duration: 269.025 msec
[2021-11-12 10:41:02,615][bash][INFO] - Validation failed for (i,j)=(1, 951) with (943392.000000, 973824.000000)
[2021-11-12 10:41:02,615][bash][INFO] - Iteration: 8	 Duration: 269.129 msec
[2021-11-12 10:41:02,615][bash][INFO] - Validation failed for (i,j)=(2, 790) with (1545240.000000, 1617920.000000)
[2021-11-12 10:41:02,615][bash][INFO] - Iteration: 9	 Duration: 270.081 msec
[2021-11-12 10:41:02,615][bash][INFO] - Validation failed for (i,j)=(1, 987) with (631680.000000, 1010688.000000)
[2021-11-12 10:41:02,615][bash][INFO] - Iteration: 10	 Duration: 266.024 msec
[2021-11-12 10:41:02,615][bash][INFO] - Validation failed for (i,j)=(1, 390) with (396240.000000, 399360.000000)
[2021-11-12 10:41:02,615][bash][INFO] - Iteration: 11	 Duration: 269.198 msec
[2021-11-12 10:41:02,615][bash][INFO] - Validation failed for (i,j)=(1, 847) with (864787.000000, 867328.000000)
[2021-11-12 10:41:02,615][bash][INFO] - Iteration: 12	 Duration: 267.104 msec
[2021-11-12 10:41:02,615][bash][INFO] - Execution - Summary:
[2021-11-12 10:41:02,615][bash][INFO] - Best Time:	 265.782 msec
[2021-11-12 10:41:02,615][bash][INFO] - Worst Time:	 341.537 msec
[2021-11-12 10:41:02,615][bash][INFO] - Median Time:	 268.065 msec
[2021-11-12 10:41:02,615][bash][INFO] - ---- Elapsed logical time (in nsec): 0
[2021-11-12 10:41:02,615][bash][INFO] - ---- Elapsed physical time (in nsec): 3,294,657,984

@petervdonovan
Copy link
Collaborator

petervdonovan commented Nov 12, 2021

The benchmark has a race condition, so we expect some entries to have smaller values than they should. When you run it with a single thread, do the validation failed messages go away?

(The number of rows and columns must also be a power of 2, but I don't think that's the issue here.)

@Soroosh129
Copy link
Contributor Author

The benchmark has a race condition, so we expect some entries to have smaller values than they should.

I see. Interesting.

When you run it with a single thread, do the validation failed messages go away?

Yes they do. Thanks for the clarification.

Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>
Copy link
Collaborator

@cmnrd cmnrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This helps a lot, but for some reason C++ is still a bit faster (C++ is blue in the plot below).
matmul
I don't think we need to worry about it though.

Note that I significantly simplified the C++ benchmark runner. It only has a start and a finished port now.

I think in order to merge this we should add this version including the runner alongside the original benchmark. Unfortunately, the python runner script will not parse the output of this modified benchmark correctly.

@petervdonovan
Copy link
Collaborator

I don't think we need to worry about it though.

I do not disagree, but in case anyone is still concerned: On my machine, median execution time decreased from 5.5s to 3.1s for the single-threaded runtime when I used this function

double* transposed_mat_at_d(matrix_t matrix, size_t i, size_t j) {
    return mat_at_d(matrix, j, i);
}

as a replacement for mat_at_d when accessing the B matrix (both writing and reading).

Apparently transposing a matrix in this way is a standard trick for speeding up matrix multiplication (because of cache performance).

I know that our results are invalid if there are algorithmic differences between implementations, but this seems fairly low-level. If I understand correctly, it is little more than a change in how we interact with the hardware prefetcher, which seems mild enough compared to what a JIT compiler can do. Then again, it appears to me that the C version with the benchmark runner but without transposing might be the most similar to the C++ version, judging from the first and subsequent execution times on my machine.

One might also just dismiss this as silliness: It only highlights the fact that this benchmark tells us more about the content of the inner loop than it tells us about the runtime.

@cmnrd
Copy link
Collaborator

cmnrd commented Nov 15, 2021

Interesting! I don't think this explains the small gap between C and C++ though (unless the C++ compiler is smart enough to make this optimization automatically). I don't think it would be unfair to optimize the matrix access, but we should do it similarly in both the C++ and the C version. Then, however, I would expect to see the same gap again.

This is based on a suggestion from Christian, since the Python runner script does not parse the output of the modified benchmark correctly.

The original benchmark has a couple of minor corrections that also appear in the one that uses the benchmark runner.
Copy link
Collaborator

@cmnrd cmnrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you both

@cmnrd cmnrd merged commit f230123 into master Nov 22, 2021
@cmnrd cmnrd deleted the c-matmul-benchmarkRunner branch November 22, 2021 08:11
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants