timing / benchmarking kernels from RustaCUDA code #29

zeroexcuses · 2019-02-17T00:39:12Z

Can we please get examples for benchmarking / timing kernels via RustaCUDA ?

I'm not familiar with how to benchmark CUDA code and would love to learn from examples.

bheisler · 2019-03-05T14:42:22Z

Hey, thanks for the patience - I've been on vacation for a week and a half or so.

That's a fair point; it's not obvious how to measure execution time for CUDA kernels.

The quick-and-dirty version is to enqueue an event, launch the kernel, and enqueue another event, then sync. Afterwards the events can provide the time when they were processed so you can subtract them to get the time for the kernel. Or you could, if RustaCUDA supported events.

It will take some work on both RustaCUDA and Criterion.rs but I think I can improve on this.

zeroexcuses · 2019-03-05T18:14:55Z

Hi @bheisler , Thanks for the suggestion. I personally don't need this any more, but it may be useful to others.

I'm using https://doc.rust-lang.org/std/time/struct.Instant.html to measure time on the Rust side.

For my particular case, I care about kernel execution time, not data load time (which happens rarely), so it goes something like:

move data async to gpu
sync
start = Instant::now();
launch kernel
sync
end = instant::now();
print end-start

There's probably all times of problems with it, but for kernels whose run time is measured in seconds, It's been fine fo rme so far.

bheisler · 2019-03-09T00:40:41Z

Yeah, I'll reopen this as a reminder to myself to document this later.

LutzCle · 2019-05-10T13:16:59Z

There are four approaches to timing that I'm aware of:

Host timing, e.g. with Instant::now(), like you suggested. That gives you course-grained times, as you don't have any guarantees about when synchronize() returns, and CUDA calls usually involve (slow) processor interrupts.
CUDA events. This is what Nvidia recommends on their developer blog (see the post here). This is better than host timing, because it potentially allows the driver to do more exact timings. In my experience, timing results have very little variance, compared to host timing. You will be able to use this in RustaCUDA when Add Cuda Event #37 is merged.
clock64() within your CUDA kernel. See the documentation here. This is the most exact timer-based method you can use, as you're reading a register twice (start and stop timer) on the GPU. This method has a few easy-to-miss pitfalls though, that usually occur when you want to measure at a very fine detail level (e.g., individual operations). Won't go into that here.
Profilers like nvprof and nvvp. These are exact, as they internally use hardware performance counters. They're also convenient, because you don't have to modify your code.

My recommendation is definitely nvvp. It's really nice to see a detailed visual chart of your program's performance. That has helped me debug and avoid pitfalls many times already. But all of the above approaches have their merits and use-cases, so it really depends on what you're trying to achieve. Good luck!

bheisler · 2019-05-14T01:23:32Z

Huh, I didn't know about clock64().

Yeah, I'd second the recommendation to use nvprof or nvvp for profiling. For simple benchmarking, I usually use events, although they aren't yet available in RustaCUDA. Hopefully they will be soon, time permitting.

What I would really want for benchmarking is to use events for measurement combined with Criterion.rs for analysis, but that will take some careful development work and I haven't had time to do that work lately.

saona-raimundo · 2019-06-04T11:59:24Z

Just for the record:
We have events now in RustaCUDA!! :D

(newbie here: is the support enough to refresh the roadmap in README?)

bheisler · 2020-02-29T15:03:19Z

This is now kinda old; I would recommend using Criterion.rs' Batcher::iter_custom and implementing whichever timing technique you prefer.

zeroexcuses closed this as completed Mar 5, 2019

bheisler reopened this Mar 9, 2019

bheisler closed this as completed Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timing / benchmarking kernels from RustaCUDA code #29

timing / benchmarking kernels from RustaCUDA code #29

zeroexcuses commented Feb 17, 2019

bheisler commented Mar 5, 2019

zeroexcuses commented Mar 5, 2019

bheisler commented Mar 9, 2019

LutzCle commented May 10, 2019

bheisler commented May 14, 2019

saona-raimundo commented Jun 4, 2019

bheisler commented Feb 29, 2020 •

edited

Loading

timing / benchmarking kernels from RustaCUDA code #29

timing / benchmarking kernels from RustaCUDA code #29

Comments

zeroexcuses commented Feb 17, 2019

bheisler commented Mar 5, 2019

zeroexcuses commented Mar 5, 2019

bheisler commented Mar 9, 2019

LutzCle commented May 10, 2019

bheisler commented May 14, 2019

saona-raimundo commented Jun 4, 2019

bheisler commented Feb 29, 2020 • edited Loading

bheisler commented Feb 29, 2020 •

edited

Loading