Stack painting with `--measure-stack` is slow #258

jonas-schievink · 2021-09-08T16:02:51Z

With --measure-stack, added in #254, we paint the whole area the stack could occupy with a bit pattern, and then read it back to determine the program's stack usage. This can write and read hundreds of KBs of RAM, which takes several seconds, so it would be great to speed this up.

One idea for speeding this up was to essentially run memset on the MCU, but probe-rs does not seem to expose an API for this (if this is even possible at all, with the vendor-provided on-device algorithms).

The text was updated successfully, but these errors were encountered:

japaric · 2022-01-25T11:32:37Z

Context

the measurement consists of two steps:

before program start, fill the memory region that corresponds to the call stack with a known bit pattern
after program end, linearly search that memory region for the address that does not contain the known bit pattern

note that the search has to start at the "end" of the stack. in the case of the ARM ISA that would be the lowest address

Solution

here's how to make those two steps (hopefully) faster:

first, we should measure how long that takes right now.
the operation is currently done using a probe_rs API that does a memcpy from the host to the target over USB.
to make step (1) faster try this:
- load a fill_stack subroutine to the target
- have the target execute that subroutine and pause (breakpoint instruction) when it's done
- the host busy waits until the target is done (hits the breakpoint)
to make step (2) faster try this:
- load a search_stack subroutine to the target
- have the target execute that subroutine, store the result address in a register and pause when it's done
- the host busy waits until the target is done then it reads the target's register that contains the result

these two operations can be prototyped outside probe-run using the probe_rs library.

these two alternative approaches should be timed before being integrated into probe-run. if it turns out they are slower then there's no point in integrating them.

More context

more details on loading and executing the program on the target:

How to write the subroutine?

the fill_stack function can be written in Rust but must be cross compiled to the thumbv6m-none-eabi target so that it also works with Cortex-M0.
after that function is cross compiled it'll become machine code (a bunch of bytes); that's what needs to be loaded to the target.
the function should be written in a way that's self-contained and does not perform any other function call (otherwise executing it becomes tricky)
it's also OK to write the function in assembly -- actually it may be easier to avoid stack usage and function calls that way; as we'll only use the machine code it doesn't matter what the source code is

Where to load the subroutine?

after that, the question is where to load the subroutine: I would suggest loading it to RAM because that's easier than writing to Flash and that way there's no risk it'll collide with program we want to run on the target.
careful here: the subroutine will write to RAM so the subroutine itself must be written somewhere it won't overwrite itself

How to run the subroutine?

to run the subroutine it should suffice to set the program counter (PC) register to the start of it and resume the target
that would only be the case if the subroutine does not use any stack space; that should be the case for these simple functions but double check the assembly (the Stack Pointer register should NOT be modified)

302: Make stack painting fast again! 🇪🇺 r=Urhengulas a=Urhengulas This PR implements the first one of improvements outlined in #258. Fixes #258. ## But what is "stack painting" anyways? The idea is to write a specific byte pattern to (part of) the stack before the program is getting executed. After the program finished, either because it is done with its task, or because there was an error, we read out the previously painted area and check how much of it is still intact. If the pattern is still the same, we can be rather certain that the program didn't write to this part of the stack. This information helps to either know if there was a stack overflow, or just to measure how much of the stack was used. So far both reading and writing of the memory was done via the probe. While this works it is also rather slow, because the host and probe communicate via USB which takes time. The new approach is writing a subroutine to the MCU, which will paint the memory from within. ## Mesurements In following table you can see the measurement how much time the old and new approach take for memory from 8 to 256KiB. ![data](https://user-images.githubusercontent.com/37087391/154973187-c17e66f7-cb22-4e56-8dff-a9798ab3a39a.png) The results are pretty impressive. The new approach is about 170 times faster! ## Further work - A similar approach can also be applied to reading out the stack after the program finished. - Additionally the stack canary can be simplified quite a lot. So far we are not painting the whole stack, except the user asks for it, because this _was_ slow. Because it is fast now we can always paint all of it, which simplifies the code and removes the need for the `--measure-stack` flag. Co-authored-by: Johann Hemmann <johann.hemmann@code.berlin>

Urhengulas · 2022-02-25T10:23:46Z

Reopening because only part of it is fixed so far.

Urhengulas added difficulty: medium Somewhat difficult to solve status: needs design This feature needs design work to move forward type: enhancement Enhancement or feature request labels Jan 25, 2022

Urhengulas self-assigned this Jan 25, 2022

Urhengulas mentioned this issue Feb 21, 2022

Make stack painting fast again! 🇪🇺 #302

Merged

bors bot closed this as completed in 12228a0 Feb 25, 2022

Urhengulas reopened this Feb 25, 2022

Urhengulas mentioned this issue Jun 28, 2022

Optimize stack usage measuring #327

Merged

bors bot closed this as completed in 5b71ea3 Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack painting with `--measure-stack` is slow #258

Stack painting with `--measure-stack` is slow #258

jonas-schievink commented Sep 8, 2021

japaric commented Jan 25, 2022 •

edited by Urhengulas

Loading

Urhengulas commented Feb 25, 2022

Stack painting with --measure-stack is slow #258

Stack painting with --measure-stack is slow #258

Comments

jonas-schievink commented Sep 8, 2021

japaric commented Jan 25, 2022 • edited by Urhengulas Loading

Context

Solution

More context

How to write the subroutine?

Where to load the subroutine?

How to run the subroutine?

Urhengulas commented Feb 25, 2022

Stack painting with `--measure-stack` is slow #258

Stack painting with `--measure-stack` is slow #258

japaric commented Jan 25, 2022 •

edited by Urhengulas

Loading