Add min rel accuracy stopping criterion #1744

maartenarnst · 2024-01-25T14:49:38Z

This is work in progress to try to add a statistical stopping criterion. While the existing statistics features seek to characterise the variability between repetitions of the benchmark, the statistical stopping criterion in this PR seeks to ensure that within a single repetition, the estimate of the mean duration has become sufficiently statistically accurate.

This is a draft.

google-cla · 2024-01-25T14:49:42Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

LebedevRI · 2024-01-25T15:14:42Z

src/thread_timer.h

  // Accumulated time so far (does not contain current slice if running_)
  double real_time_used_ = 0;
+
  double cpu_time_used_ = 0;
  // Manually set iteration time. User sets this with SetIterationTime(seconds).
  double manual_time_used_ = 0;
+  double manual_time_used2_ = 0;


So we need to track every timer as-is and as as a sum-of-squares?

Yes, the idea is to look at the time durations as independent identically distributed samples of a random variable. And then use second-order statistics to estimate mean, variance, and accuracy of the estimate of the mean. So for this, we need indeed to keep track of the sum of the squares of the time durations.

LebedevRI · 2024-01-25T15:15:17Z

src/benchmark_runner.cc

-         i.seconds >=
-             GetMinTimeToApply() ||  // The elapsed time is large enough.
+         (((i.seconds >= GetMinTimeToApply()) &&
+          (b.use_manual_time() && !IsZero(GetMinRelAccuracy()) && (std::sqrt(i.seconds2 / i.iters - std::pow(i.seconds / i.iters, 2.)) / (i.seconds / i.iters) / sqrt(i.iters) <= GetMinRelAccuracy()))) || // The relative accuracy is enough.


Err, i think this needs a bit of refactoring...

LebedevRI · 2024-01-25T15:17:48Z

src/benchmark_runner.cc

-
-  BM_VLOG(3) << "Next iters: " << next_iters << ", " << multiplier << "\n";
-  return next_iters;  // round up before conversion to integer.
-}


I do not understand the changes to this function.
Does this even compile? multiplier is a local variable, and it seems unused now.

Hi @LebedevRI. Thanks for looking at the PR. I must admit that I meant to push it to my own fork, rather than to the Google repository. This is why the PR is in a very early draft stage. But it's here now :) So maybe that's indeed a good opportunity to work on it together.

For this particular change, it's an unintentional leftover from a wrong manipulation. I'll delete it.

src/benchmark_runner.cc

LebedevRI · 2024-01-25T18:19:35Z

As you probably noticed, non-manual timers don't produce per-iteration times
(I'm somewhat planning on looking into that, but i don't expect the default to change),
so this fundamentally only works for manual time.

include/benchmark/benchmark.h

src/benchmark.cc

maartenarnst · 2024-01-26T14:45:54Z

Hi @LebedevRI. In the mean time, I have elaborated the PR more. It is now in a better state for discussion.

Questions/issues that @romintomasetti and I are thinking of are:

Is this feature of controlling the number of iterations through a statistical criterion of interest to you? Our main motivation is that it's not always easy to set a "min time" in advance to ensure that the result has converged. Even if it's ultimately not retained as a stopping criterion, reporting a measure of statistical accuracy may be of interest.
I agree that as it stands, the feature is limited to benchmarks that use manual timing. On our side, we use benchmark for GPU kernels, mainly with Kokkos, an HPC library that itself also uses benchmark for its performance tests. We use manual timing, as you also suggest in your user guide. This use case may be sufficiently widespread as a motivation. (It may be worth exploring whether calls to the iterator for the state might be a way of obtaining per-iteration times for benchmarks that don't use manual timing).
How should the new stopping criterion (statistical relative accuracy) interact with the existing stopping criterion (time or number of iterations)? In the current state of the PR, ShouldReportIterationResults uses an "AND" logic. It will return true if both criteria are reached. For both criteria, there is an opt-out/opt-in by setting the threshold to zero. So the current behaviour of benchmark is maintained, and this PR is mainly adding functionality.
Do you agree with the current way of implementing the feature? The current implementation keeps track of the sum of the squares of the per-iteration times. An alternative could be to store all the per-iteration times. An advantage would be that such an alternative implementation would allow your existing code for computing statistics to be reused. It would also be more flexible for future extensions, e.g. looking at statistics other than the mean and variance, such as detecting outliers. Drawbacks may be the increased storage requirement and needs to dynamically resize the storage space.

Many thanks in advance!

LebedevRI · 2024-01-26T15:34:59Z

Is this feature of controlling the number of iterations through a statistical criterion of interest to you? Our main motivation is that it's not always easy to set a "min time" in advance to ensure that the result has converged. Even if it's ultimately not retained as a stopping criterion, reporting a measure of statistical accuracy may be of interest.

I don't have an answer to this question. Following replies assume "yes", but i'm undecided.
(FWIW, at the very least, i find this experiment, at the very least, to be entertaining/educating.)

I agree that as it stands, the feature is limited to benchmarks that use manual timing. On our side, we use benchmark for GPU kernels, mainly with Kokkos, an HPC library that itself also uses benchmark for its
performance tests. We use manual timing, as you also suggest in your user guide.

Yes, that use of manual time for GPU kernels is indeed very reasonable.

This use case may be sufficiently widespread as a motivation.

(It may be worth exploring whether calls to the iterator for the state might be a way of obtaining per-iteration times for benchmarks that don't use manual timing).

True per-iteration times are, a complicated topic.
The current library has been very intentionally designed
to prohibit that, and while, as i have said,
i was already planning on looking into that,
i'm not sure how that would look like.

How should the new stopping criterion (statistical relative accuracy) interact with the existing stopping criterion (time or number of iterations)? In the current state of the PR, ShouldReportIterationResults uses an "AND" logic. It will return true if both criteria are reached. For both criteria, there is an opt-out/opt-in by setting the threshold to zero. So the current behaviour of benchmark is maintained, and this PR is mainly adding functionality.

That's my biggest problem honestly. The existing lack of max-time limit
for manual time already seems problematic. What if the requested accuracy
is more precise than what is actually possible on a given machine?
(run out of time to reply here, may elaborate later)

Do you agree with the current way of implementing the feature? The current implementation keeps track of the sum of the squares of the per-iteration times. An alternative could be to store all the per-iteration times. An advantage would be that such an alternative implementation would allow your existing code for computing statistics to be reused. It would also be more flexible for future extensions, e.g. looking at statistics other than the mean and variance, such as detecting outliers. Drawbacks may be the increased storage requirement and needs to dynamically resize the storage space.

This feature does not need to store per-iteration times,
it only needs to compute standard deviation of all of the per-iteration times,
which can be computed as a "live stdev" the way you currently do it,
so i certainly agree with only just a new accumulator,
and would not agree with an alternative design.

As i have said, true per-iteration times is a complicated topic, and storing them doubly so.

LebedevRI · 2024-01-29T14:21:20Z

Also, BenchmarkRunner::DoOneRepetition() is wrong now,
we now can't re-use iteration count from one repetition to the others.
(now == with this accuracy check).

maartenarnst · 2024-01-29T17:56:45Z

Also, BenchmarkRunner::DoOneRepetition() is wrong now,
we now can't re-use iteration count from one repetition to the others.
(now == with this accuracy check).

I'm not sure to understand entirely what you mean.

The condition

benchmark/src/benchmark_runner.cc

Lines 432 to 434 in 30a37e1

    
           const bool results_are_significant = !is_the_first_repetition || 
        
                                                has_explicit_iteration_count || 
        
                                                ShouldReportIterationResults(i);

seems to be such that even after the changes in this PR, the behavior is that the first repetition sets the number of iterations and the next repetitions then reuse that number.

An issue may be that subsequent repetitions don't reach the required relative accuracy for the reused number of iterations. But isn't this issue also present for the current stopping criterion based on the time?

LebedevRI · 2024-01-29T19:12:11Z

The condition

benchmark/src/benchmark_runner.cc

Lines 432 to 434 in 30a37e1

const bool results_are_significant = !is_the_first_repetition ||

has_explicit_iteration_count ||

ShouldReportIterationResults(i);

seems to be such that even after the changes in this PR, the behavior is that the first repetition sets the number of iterations and the next repetitions then reuse that number.

An issue may be that subsequent repetitions don't reach the required relative accuracy for the reused number of iterations.

Precisely, yes. Presumably that is not what you want if you explicitly request convergence-based run-time.

But isn't this issue also present for the current stopping criterion based on the time?

Sure.

LebedevRI · 2024-01-31T03:45:05Z

(To be noted, i'm just pointing out that failure mode,
perhaps you consider such unexpected outcome to be acceptable.)

maartenarnst · 2024-02-01T16:15:45Z

Hi @LebedevRI.

In the mean time, I've thought a bit more about it.

An answer may be to look at it as two steps. This PR could be a first step focused on introducing a new convergence criterion. And exploring changes to the behavior for multiple repetitions could be a follow-up.

If I do think now about the behavior for multiple repetitions, I'm not sure I would call the issue a failure mode.

From a statistical point of view, doing the repetitions using a fixed number of iterations may be quite sensible. An advantage would be that basic versions of theoretical results from statistics would give some support to the approach. Such as basic versions of the central limit theorem to characterize theoretically the variability of a partial sum of a fixed number of random variables.

The current behavior for multiple repetitions, and keeping it unchanged for the new convergence criterion, also appears quite clear and easy to understand. The first repetition fixes the number of iterations. It can be done via a maximum number of iterations, a minimum number of time to apply, and/or, now, a minimum relative accuracy. Then, in subsequent repetitions, the fixed number of iterations is reused. Perhaps doing the experiment of trying to write a piece of doc about the new convergence criterion could be a good next step.

Those would be my two cents, coming more from a background of statistics.

Clean up the initial commit Further cleaning of initial commit. Add test. Improvements to comments thanks to review Reformat thanks to clang format. Static cast to avoid conversion warning

maartenarnst closed this Jan 25, 2024

maartenarnst reopened this Jan 25, 2024

maartenarnst marked this pull request as draft January 25, 2024 15:03

LebedevRI reviewed Jan 25, 2024

View reviewed changes

romintomasetti reviewed Jan 25, 2024

View reviewed changes

src/benchmark_runner.cc Outdated Show resolved Hide resolved

Hphu2004 approved these changes Jan 26, 2024

View reviewed changes

maartenarnst force-pushed the min_rel_accuracy branch from c242f5a to 3bca691 Compare January 26, 2024 11:19

romintomasetti reviewed Jan 26, 2024

View reviewed changes

include/benchmark/benchmark.h Outdated Show resolved Hide resolved

romintomasetti reviewed Jan 26, 2024

View reviewed changes

include/benchmark/benchmark.h Outdated Show resolved Hide resolved

romintomasetti reviewed Jan 26, 2024

View reviewed changes

src/benchmark.cc Outdated Show resolved Hide resolved

maartenarnst force-pushed the min_rel_accuracy branch 2 times, most recently from 6947d6c to d3990ce Compare May 24, 2024 11:11

Add min rel accuracy stopping criterion

35285be

Clean up the initial commit Further cleaning of initial commit. Add test. Improvements to comments thanks to review Reformat thanks to clang format. Static cast to avoid conversion warning

maartenarnst force-pushed the min_rel_accuracy branch from d3990ce to 35285be Compare December 10, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add min rel accuracy stopping criterion #1744

Add min rel accuracy stopping criterion #1744

maartenarnst commented Jan 25, 2024 •

edited

Loading

google-cla bot commented Jan 25, 2024

LebedevRI Jan 25, 2024

maartenarnst Jan 25, 2024

LebedevRI Jan 25, 2024

LebedevRI Jan 25, 2024

maartenarnst Jan 25, 2024

LebedevRI commented Jan 25, 2024

maartenarnst commented Jan 26, 2024

LebedevRI commented Jan 26, 2024 •

edited

Loading

LebedevRI commented Jan 29, 2024

maartenarnst commented Jan 29, 2024

LebedevRI commented Jan 29, 2024

LebedevRI commented Jan 31, 2024

maartenarnst commented Feb 1, 2024

Add min rel accuracy stopping criterion #1744

Are you sure you want to change the base?

Add min rel accuracy stopping criterion #1744

Conversation

maartenarnst commented Jan 25, 2024 • edited Loading

google-cla bot commented Jan 25, 2024

LebedevRI Jan 25, 2024

Choose a reason for hiding this comment

maartenarnst Jan 25, 2024

Choose a reason for hiding this comment

LebedevRI Jan 25, 2024

Choose a reason for hiding this comment

LebedevRI Jan 25, 2024

Choose a reason for hiding this comment

maartenarnst Jan 25, 2024

Choose a reason for hiding this comment

LebedevRI commented Jan 25, 2024

maartenarnst commented Jan 26, 2024

LebedevRI commented Jan 26, 2024 • edited Loading

LebedevRI commented Jan 29, 2024

maartenarnst commented Jan 29, 2024

LebedevRI commented Jan 29, 2024

LebedevRI commented Jan 31, 2024

maartenarnst commented Feb 1, 2024

maartenarnst commented Jan 25, 2024 •

edited

Loading

LebedevRI commented Jan 26, 2024 •

edited

Loading