Add a coins benchmark from nofib #132

ckoparkar · 2021-02-25T13:26:06Z

GHC's version: https://github.com/ghc/nofib/blob/f481777acf608c132db47cb8badb618ef39a0d6f/parallel/coins/coins.hs

ckoparkar · 2021-02-25T13:49:28Z

I have observed an oddity with this benchmark, namely that the sequential version is way slower than the parallel on a single-thread. Here's a example run:

numactl --physcpubind=17 ./bin/coins '@MLton' number-processors 1 -- -N 999
Sequential: 2.1437s.
Parallel: 0.8474s.

For this run, parallel has -60.47% overhead! Granted, this is just a single run, but I've observed similar results (roughly -40% overhead) with different methodology (median of 9) as well.

As far as I can tell I've ported the benchmark to SML faithfully, but I might have overlooked something. I suspect that there's something going on with the sequential version, rather than the parallel one. I'm struggling with mlprof a bit, and I would appreciate any help in debugging it.

ckoparkar · 2021-02-25T13:59:28Z

I forgot to mention that that Haskell file has several variants of payA_par. I've ported V3, but mine returns an AList Int rather than an AList [Int].

shwestrick · 2021-02-25T15:32:14Z

Cool! Just to satisfy my own curiosity: could you tell me a bit about this benchmark? Is it figuring out how to make change for a given amount of money? It seems like the output is morally just a sequence of integers (I'm guessing the purpose of AList is just for fast append) -- what does each integer mean?

As for the performance, there's a number of things to consider. Typically, the first run of any microbenchmark is much slower, due to warmup issues (including e.g. OS scheduling but also memory allocation: initially, the program begins with a small heap and allocations are more expensive until the heap reaches a good size and memory can start to be reused). So, for a more thorough comparison we should do lots of repetitions. Here's 10 each back-to-back:

$ bin/coins @mpl procs 1 -- -N 999 -repeat 10
Sequential: 329565. Finished in: 2.7010s.
Sequential: 329565. Finished in: 1.9687s.
Sequential: 329565. Finished in: 1.9678s.
Sequential: 329565. Finished in: 1.9668s.
Sequential: 329565. Finished in: 1.9662s.
Sequential: 329565. Finished in: 1.9698s.
Sequential: 329565. Finished in: 1.9717s.
Sequential: 329565. Finished in: 1.9693s.
Sequential: 329565. Finished in: 1.9674s.
Sequential: 329565. Finished in: 1.9670s.
Parallel: 329565. Finished in: 1.5374s.
Parallel: 329565. Finished in: 3.8510s.
Parallel: 329565. Finished in: 1.7965s.
Parallel: 329565. Finished in: 1.8258s.
Parallel: 329565. Finished in: 1.8121s.
Parallel: 329565. Finished in: 1.8052s.
Parallel: 329565. Finished in: 1.8117s.
Parallel: 329565. Finished in: 1.7989s.
Parallel: 329565. Finished in: 1.8224s.
Parallel: 329565. Finished in: 1.7937s.

Average sequential: 2.042s
Average parallel: 1.985s

Now we see that there's only a ~3% difference in average performance. :)

But there's more to this story, because memory management in MPL depends heavily on parallelism, and actually benefits from it! So that could be another reason why the parallel performance appears to be typically better in this benchmark.

shwestrick · 2021-02-25T15:50:45Z

I should mention also, there's a current issue with MPL where top-level (outside any calls to par) sequential code is not handled correctly (#115). The workaround at the moment is to put sequential code inside a single par, e.g.:

val (result_seq, _) = ForkJoin.par (fn _ => payA_seq amt coins_input, fn _ => "workaround")

When I make this change (only to the sequential version), the average performance of the parallel and sequential code becomes nearly identical on one processor. It also eliminates that nasty outlier in the parallel repetitions we saw above (the single 3.8s run). I believe this outlier was caused by the GC trying to make up for lost space on the earlier sequential runs.

ckoparkar · 2021-02-25T17:15:42Z

Cool! Just to satisfy my own curiosity: could you tell me a bit about this benchmark? Is it figuring out how to make change for a given amount of money?

Indeed.

It seems like the output is morally just a sequence of integers (I'm guessing the purpose of AList is just for fast append) -- what does each integer mean?

The AList is indeed for fast appends. The integer is pointless at the moment, and can be dropped safely. The original program keeps a track of what coins got used each time. For example, if the amount of money is 5, then the AList looks like this: Append (ASing [5]) (ASing [1,1,1,1,1]). Perhaps I should update the SML program to do this as well.

As for the performance, there's a number of things to consider. Typically, the first run of any microbenchmark is much slower, due to warmup issues (including e.g. OS scheduling but also memory allocation: initially, the program begins with a small heap and allocations are more expensive until the heap reaches a good size and memory can start to be reused). So, for a more thorough comparison we should do lots of repetitions. Here's 10 each back-to-back:

I fully agree! I did not include this code in the pull-request, but I was actually looking at a median of 9 runs. If I run it more times with a for loop:

val _ = Util.for 
            (0,10)
            (fn _ => let val t0 = Time.now ()
                         val result_seq = payA_par 3 size coins_input
                         val t1 = Time.now ()
                     in print ("Parallel: " ^ Int.toString (lenA result_seq) ^ ". Finished in: " ^ Time.fmt 4 (Time.- (t1, t0)) ^ "s.\n")
                     end)

this is what I see:

Sequential: 329565. Finished in: 2.1605s.
Sequential: 329565. Finished in: 2.1827s.
Sequential: 329565. Finished in: 2.1845s.
Sequential: 329565. Finished in: 2.1822s.
Sequential: 329565. Finished in: 2.1836s.
Sequential: 329565. Finished in: 2.1830s.
Sequential: 329565. Finished in: 2.2066s.
Sequential: 329565. Finished in: 2.1847s.
Sequential: 329565. Finished in: 2.1840s.
Sequential: 329565. Finished in: 2.1824s.
Parallel: 329565. Finished in: 1.5463s.
Parallel: 329565. Finished in: 1.5505s.
Parallel: 329565. Finished in: 1.6255s.
Parallel: 329565. Finished in: 1.6270s.
Parallel: 329565. Finished in: 1.6479s.
Parallel: 329565. Finished in: 1.6367s.
Parallel: 329565. Finished in: 1.6354s.
Parallel: 329565. Finished in: 1.6290s.
Parallel: 329565. Finished in: 1.6241s.
Parallel: 329565. Finished in: 1.6234s.

It's definitely better than before, but still has a high overhead. What did you use for the -repeat?

But there's more to this story, because memory management in MPL depends heavily on parallelism, and actually benefits from it! So that could be another reason why the parallel performance appears to be typically better in this benchmark.

Oh, I see! Actually, MPL performs really well on all other benchmark programs, and this single-thread overhead for coins was an odd one. That's why I thought I should ask here to see if I'm doing something wrong.

I should mention also, there's a current issue with MPL where top-level (outside any calls to par) sequential code is not handled correctly (#115). The workaround at the moment is to put sequential code inside a single par, e.g.:

Aha, looks like #115 is the culprit! With your suggested change, I can reproduce the almost 0% overhead. This coins benchmark performs a lot of cons/heal/tail operations. If every cons has to allocate instead of being able to rely on the GC, it'll probably cause things to be slow down a lot.

shwestrick · 2021-02-26T15:25:36Z

The original program keeps a track of what coins got used each time. For example, if the amount of money is 5, then the AList looks like this: Append (ASing [5]) (ASing [1,1,1,1,1]). Perhaps I should update the SML program to do this as well.

Oh I see! I'd love to see the updated version, if you have some free time to make the change.

Aha, looks like #115 is the culprit! With your suggested change, I can reproduce the almost 0% overhead. This coins benchmark performs a lot of cons/heal/tail operations. If every cons has to allocate instead of being able to rely on the GC, it'll probably cause things to be slow down a lot.

Excellent, I'm glad you were able to reproduce. We're in the process of fixing some of these little GC issues, so hopefully soon the workaround won't be necessary anymore.

shwestrick · 2021-02-26T15:29:24Z

What did you use for the -repeat?

Oh and yes, I did something very similar for -repeat: just a for-loop to run the code back-to-back. Although I guess I should also have been careful to ignore the warmup outliers 👍🏻

shwestrick · 2021-02-26T15:49:02Z

I made some changes in af049b3. Now by default, it does a single run of the parallel version. You can use --sequential to instead run sequential (with the workaround), and -repeat ... to specify a number of repetitions.

$ make coins
$ bin/coins -N 999 --sequential -repeat 10
$ bin/coins @mpl procs 4 -- -N 999 -repeat 10

If you get a chance and would like to update the output of the benchmark to be more interesting like you mentioned above, that would be cool!

ckoparkar · 2021-02-26T23:35:07Z

Super! Yep, I'll update it to track the coins which get used, and then print the full AList with a -print flag.

Add a coins benchmark from nofib

87022fa

GHC's version: https://github.com/ghc/nofib/blob/f481777acf608c132db47cb8badb618ef39a0d6f/parallel/coins/coins.hs

coins: print the answer

562f072

shwestrick merged commit f736ebf into MPLLang:master Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a coins benchmark from nofib #132

Add a coins benchmark from nofib #132

ckoparkar commented Feb 25, 2021

ckoparkar commented Feb 25, 2021

ckoparkar commented Feb 25, 2021

shwestrick commented Feb 25, 2021

shwestrick commented Feb 25, 2021

ckoparkar commented Feb 25, 2021

shwestrick commented Feb 26, 2021

shwestrick commented Feb 26, 2021

shwestrick commented Feb 26, 2021 •

edited

Loading

ckoparkar commented Feb 26, 2021

Add a coins benchmark from nofib #132

Add a coins benchmark from nofib #132

Conversation

ckoparkar commented Feb 25, 2021

ckoparkar commented Feb 25, 2021

ckoparkar commented Feb 25, 2021

shwestrick commented Feb 25, 2021

shwestrick commented Feb 25, 2021

ckoparkar commented Feb 25, 2021

shwestrick commented Feb 26, 2021

shwestrick commented Feb 26, 2021

shwestrick commented Feb 26, 2021 • edited Loading

ckoparkar commented Feb 26, 2021

shwestrick commented Feb 26, 2021 •

edited

Loading