Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add a coins benchmark from nofib #132

Merged
merged 2 commits into from
Feb 26, 2021
Merged

Conversation

ckoparkar
Copy link
Contributor

@ckoparkar
Copy link
Contributor Author

I have observed an oddity with this benchmark, namely that the sequential version is way slower than the parallel on a single-thread. Here's a example run:

numactl --physcpubind=17 ./bin/coins '@MLton' number-processors 1 -- -N 999
Sequential: 2.1437s.
Parallel: 0.8474s.

For this run, parallel has -60.47% overhead! Granted, this is just a single run, but I've observed similar results (roughly -40% overhead) with different methodology (median of 9) as well.

As far as I can tell I've ported the benchmark to SML faithfully, but I might have overlooked something. I suspect that there's something going on with the sequential version, rather than the parallel one. I'm struggling with mlprof a bit, and I would appreciate any help in debugging it.

@ckoparkar
Copy link
Contributor Author

I forgot to mention that that Haskell file has several variants of payA_par. I've ported V3, but mine returns an AList Int rather than an AList [Int].

@shwestrick
Copy link
Collaborator

Cool! Just to satisfy my own curiosity: could you tell me a bit about this benchmark? Is it figuring out how to make change for a given amount of money? It seems like the output is morally just a sequence of integers (I'm guessing the purpose of AList is just for fast append) -- what does each integer mean?

As for the performance, there's a number of things to consider. Typically, the first run of any microbenchmark is much slower, due to warmup issues (including e.g. OS scheduling but also memory allocation: initially, the program begins with a small heap and allocations are more expensive until the heap reaches a good size and memory can start to be reused). So, for a more thorough comparison we should do lots of repetitions. Here's 10 each back-to-back:

$ bin/coins @mpl procs 1 -- -N 999 -repeat 10
Sequential: 329565. Finished in: 2.7010s.
Sequential: 329565. Finished in: 1.9687s.
Sequential: 329565. Finished in: 1.9678s.
Sequential: 329565. Finished in: 1.9668s.
Sequential: 329565. Finished in: 1.9662s.
Sequential: 329565. Finished in: 1.9698s.
Sequential: 329565. Finished in: 1.9717s.
Sequential: 329565. Finished in: 1.9693s.
Sequential: 329565. Finished in: 1.9674s.
Sequential: 329565. Finished in: 1.9670s.
Parallel: 329565. Finished in: 1.5374s.
Parallel: 329565. Finished in: 3.8510s.
Parallel: 329565. Finished in: 1.7965s.
Parallel: 329565. Finished in: 1.8258s.
Parallel: 329565. Finished in: 1.8121s.
Parallel: 329565. Finished in: 1.8052s.
Parallel: 329565. Finished in: 1.8117s.
Parallel: 329565. Finished in: 1.7989s.
Parallel: 329565. Finished in: 1.8224s.
Parallel: 329565. Finished in: 1.7937s.

Average sequential: 2.042s
Average parallel: 1.985s

Now we see that there's only a ~3% difference in average performance. :)

But there's more to this story, because memory management in MPL depends heavily on parallelism, and actually benefits from it! So that could be another reason why the parallel performance appears to be typically better in this benchmark.

@shwestrick
Copy link
Collaborator

I should mention also, there's a current issue with MPL where top-level (outside any calls to par) sequential code is not handled correctly (#115). The workaround at the moment is to put sequential code inside a single par, e.g.:

val (result_seq, _) = ForkJoin.par (fn _ => payA_seq amt coins_input, fn _ => "workaround")

When I make this change (only to the sequential version), the average performance of the parallel and sequential code becomes nearly identical on one processor. It also eliminates that nasty outlier in the parallel repetitions we saw above (the single 3.8s run). I believe this outlier was caused by the GC trying to make up for lost space on the earlier sequential runs.

@ckoparkar
Copy link
Contributor Author

Cool! Just to satisfy my own curiosity: could you tell me a bit about this benchmark? Is it figuring out how to make change for a given amount of money?

Indeed.

It seems like the output is morally just a sequence of integers (I'm guessing the purpose of AList is just for fast append) -- what does each integer mean?

The AList is indeed for fast appends. The integer is pointless at the moment, and can be dropped safely. The original program keeps a track of what coins got used each time. For example, if the amount of money is 5, then the AList looks like this: Append (ASing [5]) (ASing [1,1,1,1,1]). Perhaps I should update the SML program to do this as well.

As for the performance, there's a number of things to consider. Typically, the first run of any microbenchmark is much slower, due to warmup issues (including e.g. OS scheduling but also memory allocation: initially, the program begins with a small heap and allocations are more expensive until the heap reaches a good size and memory can start to be reused). So, for a more thorough comparison we should do lots of repetitions. Here's 10 each back-to-back:

I fully agree! I did not include this code in the pull-request, but I was actually looking at a median of 9 runs. If I run it more times with a for loop:

val _ = Util.for 
            (0,10)
            (fn _ => let val t0 = Time.now ()
                         val result_seq = payA_par 3 size coins_input
                         val t1 = Time.now ()
                     in print ("Parallel: " ^ Int.toString (lenA result_seq) ^ ". Finished in: " ^ Time.fmt 4 (Time.- (t1, t0)) ^ "s.\n")
                     end)

this is what I see:

Sequential: 329565. Finished in: 2.1605s.
Sequential: 329565. Finished in: 2.1827s.
Sequential: 329565. Finished in: 2.1845s.
Sequential: 329565. Finished in: 2.1822s.
Sequential: 329565. Finished in: 2.1836s.
Sequential: 329565. Finished in: 2.1830s.
Sequential: 329565. Finished in: 2.2066s.
Sequential: 329565. Finished in: 2.1847s.
Sequential: 329565. Finished in: 2.1840s.
Sequential: 329565. Finished in: 2.1824s.
Parallel: 329565. Finished in: 1.5463s.
Parallel: 329565. Finished in: 1.5505s.
Parallel: 329565. Finished in: 1.6255s.
Parallel: 329565. Finished in: 1.6270s.
Parallel: 329565. Finished in: 1.6479s.
Parallel: 329565. Finished in: 1.6367s.
Parallel: 329565. Finished in: 1.6354s.
Parallel: 329565. Finished in: 1.6290s.
Parallel: 329565. Finished in: 1.6241s.
Parallel: 329565. Finished in: 1.6234s.

It's definitely better than before, but still has a high overhead. What did you use for the -repeat?

But there's more to this story, because memory management in MPL depends heavily on parallelism, and actually benefits from it! So that could be another reason why the parallel performance appears to be typically better in this benchmark.

Oh, I see! Actually, MPL performs really well on all other benchmark programs, and this single-thread overhead for coins was an odd one. That's why I thought I should ask here to see if I'm doing something wrong.

I should mention also, there's a current issue with MPL where top-level (outside any calls to par) sequential code is not handled correctly (#115). The workaround at the moment is to put sequential code inside a single par, e.g.:

Aha, looks like #115 is the culprit! With your suggested change, I can reproduce the almost 0% overhead. This coins benchmark performs a lot of cons/heal/tail operations. If every cons has to allocate instead of being able to rely on the GC, it'll probably cause things to be slow down a lot.

@shwestrick
Copy link
Collaborator

The original program keeps a track of what coins got used each time. For example, if the amount of money is 5, then the AList looks like this: Append (ASing [5]) (ASing [1,1,1,1,1]). Perhaps I should update the SML program to do this as well.

Oh I see! I'd love to see the updated version, if you have some free time to make the change.

Aha, looks like #115 is the culprit! With your suggested change, I can reproduce the almost 0% overhead. This coins benchmark performs a lot of cons/heal/tail operations. If every cons has to allocate instead of being able to rely on the GC, it'll probably cause things to be slow down a lot.

Excellent, I'm glad you were able to reproduce. We're in the process of fixing some of these little GC issues, so hopefully soon the workaround won't be necessary anymore.

@shwestrick
Copy link
Collaborator

What did you use for the -repeat?

Oh and yes, I did something very similar for -repeat: just a for-loop to run the code back-to-back. Although I guess I should also have been careful to ignore the warmup outliers 👍🏻

@shwestrick shwestrick merged commit f736ebf into MPLLang:master Feb 26, 2021
@shwestrick
Copy link
Collaborator

shwestrick commented Feb 26, 2021

I made some changes in af049b3. Now by default, it does a single run of the parallel version. You can use --sequential to instead run sequential (with the workaround), and -repeat ... to specify a number of repetitions.

$ make coins
$ bin/coins -N 999 --sequential -repeat 10
$ bin/coins @mpl procs 4 -- -N 999 -repeat 10

If you get a chance and would like to update the output of the benchmark to be more interesting like you mentioned above, that would be cool!

@ckoparkar
Copy link
Contributor Author

Super! Yep, I'll update it to track the coins which get used, and then print the full AList with a -print flag.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants