Skip to content

timing / benchmarking kernels from RustaCUDA code #29

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
zeroexcuses opened this issue Feb 17, 2019 · 7 comments
Closed

timing / benchmarking kernels from RustaCUDA code #29

zeroexcuses opened this issue Feb 17, 2019 · 7 comments

Comments

@zeroexcuses
Copy link

Can we please get examples for benchmarking / timing kernels via RustaCUDA ?

I'm not familiar with how to benchmark CUDA code and would love to learn from examples.

@bheisler
Copy link
Owner

bheisler commented Mar 5, 2019

Hey, thanks for the patience - I've been on vacation for a week and a half or so.

That's a fair point; it's not obvious how to measure execution time for CUDA kernels.

The quick-and-dirty version is to enqueue an event, launch the kernel, and enqueue another event, then sync. Afterwards the events can provide the time when they were processed so you can subtract them to get the time for the kernel. Or you could, if RustaCUDA supported events.

It will take some work on both RustaCUDA and Criterion.rs but I think I can improve on this.

@zeroexcuses
Copy link
Author

Hi @bheisler , Thanks for the suggestion. I personally don't need this any more, but it may be useful to others.

I'm using https://doc.rust-lang.org/std/time/struct.Instant.html to measure time on the Rust side.

For my particular case, I care about kernel execution time, not data load time (which happens rarely), so it goes something like:

move data async to gpu
sync
start = Instant::now();
launch kernel
sync
end = instant::now();
print end-start

There's probably all times of problems with it, but for kernels whose run time is measured in seconds, It's been fine fo rme so far.

@bheisler
Copy link
Owner

bheisler commented Mar 9, 2019

Yeah, I'll reopen this as a reminder to myself to document this later.

@bheisler bheisler reopened this Mar 9, 2019
@LutzCle
Copy link
Contributor

LutzCle commented May 10, 2019

There are four approaches to timing that I'm aware of:

  1. Host timing, e.g. with Instant::now(), like you suggested. That gives you course-grained times, as you don't have any guarantees about when synchronize() returns, and CUDA calls usually involve (slow) processor interrupts.
  2. CUDA events. This is what Nvidia recommends on their developer blog (see the post here). This is better than host timing, because it potentially allows the driver to do more exact timings. In my experience, timing results have very little variance, compared to host timing. You will be able to use this in RustaCUDA when Add Cuda Event #37 is merged.
  3. clock64() within your CUDA kernel. See the documentation here. This is the most exact timer-based method you can use, as you're reading a register twice (start and stop timer) on the GPU. This method has a few easy-to-miss pitfalls though, that usually occur when you want to measure at a very fine detail level (e.g., individual operations). Won't go into that here.
  4. Profilers like nvprof and nvvp. These are exact, as they internally use hardware performance counters. They're also convenient, because you don't have to modify your code.

My recommendation is definitely nvvp. It's really nice to see a detailed visual chart of your program's performance. That has helped me debug and avoid pitfalls many times already. But all of the above approaches have their merits and use-cases, so it really depends on what you're trying to achieve. Good luck!

@bheisler
Copy link
Owner

Huh, I didn't know about clock64().

Yeah, I'd second the recommendation to use nvprof or nvvp for profiling. For simple benchmarking, I usually use events, although they aren't yet available in RustaCUDA. Hopefully they will be soon, time permitting.

What I would really want for benchmarking is to use events for measurement combined with Criterion.rs for analysis, but that will take some careful development work and I haven't had time to do that work lately.

@saona-raimundo
Copy link

Just for the record:
We have events now in RustaCUDA!! :D

(newbie here: is the support enough to refresh the roadmap in README?)

@bheisler
Copy link
Owner

bheisler commented Feb 29, 2020

This is now kinda old; I would recommend using Criterion.rs' Batcher::iter_custom and implementing whichever timing technique you prefer.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants