-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add a coins benchmark from nofib #132
Conversation
I have observed an oddity with this benchmark, namely that the sequential version is way slower than the parallel on a single-thread. Here's a example run:
For this run, parallel has As far as I can tell I've ported the benchmark to SML faithfully, but I might have overlooked something. I suspect that there's something going on with the sequential version, rather than the parallel one. I'm struggling with |
I forgot to mention that that Haskell file has several variants of |
Cool! Just to satisfy my own curiosity: could you tell me a bit about this benchmark? Is it figuring out how to make change for a given amount of money? It seems like the output is morally just a sequence of integers (I'm guessing the purpose of As for the performance, there's a number of things to consider. Typically, the first run of any microbenchmark is much slower, due to warmup issues (including e.g. OS scheduling but also memory allocation: initially, the program begins with a small heap and allocations are more expensive until the heap reaches a good size and memory can start to be reused). So, for a more thorough comparison we should do lots of repetitions. Here's 10 each back-to-back:
Now we see that there's only a ~3% difference in average performance. :) But there's more to this story, because memory management in MPL depends heavily on parallelism, and actually benefits from it! So that could be another reason why the parallel performance appears to be typically better in this benchmark. |
I should mention also, there's a current issue with MPL where top-level (outside any calls to
When I make this change (only to the sequential version), the average performance of the parallel and sequential code becomes nearly identical on one processor. It also eliminates that nasty outlier in the parallel repetitions we saw above (the single 3.8s run). I believe this outlier was caused by the GC trying to make up for lost space on the earlier sequential runs. |
Indeed.
The
I fully agree! I did not include this code in the pull-request, but I was actually looking at a median of 9 runs. If I run it more times with a for loop: val _ = Util.for
(0,10)
(fn _ => let val t0 = Time.now ()
val result_seq = payA_par 3 size coins_input
val t1 = Time.now ()
in print ("Parallel: " ^ Int.toString (lenA result_seq) ^ ". Finished in: " ^ Time.fmt 4 (Time.- (t1, t0)) ^ "s.\n")
end) this is what I see:
It's definitely better than before, but still has a high overhead. What did you use for the
Oh, I see! Actually, MPL performs really well on all other benchmark programs, and this single-thread overhead for
Aha, looks like #115 is the culprit! With your suggested change, I can reproduce the almost |
Oh I see! I'd love to see the updated version, if you have some free time to make the change.
Excellent, I'm glad you were able to reproduce. We're in the process of fixing some of these little GC issues, so hopefully soon the workaround won't be necessary anymore. |
Oh and yes, I did something very similar for |
I made some changes in af049b3. Now by default, it does a single run of the parallel version. You can use
If you get a chance and would like to update the output of the benchmark to be more interesting like you mentioned above, that would be cool! |
Super! Yep, I'll update it to track the coins which get used, and then print the full |
GHC's version: https://github.com/ghc/nofib/blob/f481777acf608c132db47cb8badb618ef39a0d6f/parallel/coins/coins.hs