Skip to content
This repository was archived by the owner on Jan 24, 2022. It is now read-only.

Allow GDB to unwind HardFault callstacks #131

Merged
merged 1 commit into from
Oct 23, 2018
Merged

Allow GDB to unwind HardFault callstacks #131

merged 1 commit into from
Oct 23, 2018

Conversation

adamgreen
Copy link
Contributor

When I currently request GDB to dump a hard fault stack, I see
something like this:

(gdb) bt
#0  UserHardFault_ (ef=0x10001fb8) at /depots/cortex-m-rt/src/lib.rs:537
#1  0x08003fe6 in HardFault ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

GDB can't unwind past HardFault since the current implementation of
this function overwrites the Link Register (LR) value. This change
pushes LR and R0 (to maintain 8-byte stack alignment) to the stack
before transferring execution to UserHardFault().

After this change, I see a callstack like this from GDB:

(gdb) bt
#0  UserHardFault_ (ef=0x10001fb0) at /depots/cortex-m-rt/src/lib.rs:537
#1  0x08003fe8 in HardFault ()
#2  <signal handler called>
#3  0x08002820 in core::ptr::read_volatile (src=0x48001800) at libcore/ptr.rs:472
#4  0x080001a2 in main () at src/07-registers/src/main.rs:14

Notes:

  • This code uses 8 more stack bytes.
  • Increases the size of the HardFault handler by 2 narrow instructions
    or 4 bytes. This could be decreased to 2 bytes by removing the pop
    since UserHardFault() doesn't currently return but it just looks too
    odd for me to do as an initial attempt.

When I currently request GDB to dump a hard fault stack, I see
something like this:
    (gdb) bt
    #0  UserHardFault_ (ef=0x10001fb8) at /depots/cortex-m-rt/src/lib.rs:537
    #1  0x08003fe6 in HardFault ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

GDB can't unwind past HardFault since the current implementation of
this function overwrites the Link Register (LR) value. This change
pushes LR and R0 (to maintain 8-byte stack alignment) to the stack
before transferring execution to UserHardFault().

After this change, I see a callstack like this from GDB:
    (gdb) bt
    #0  UserHardFault_ (ef=0x10001fb0) at /depots/cortex-m-rt/src/lib.rs:537
    #1  0x08003fe8 in HardFault ()
    #2  <signal handler called>
    #3  0x08002820 in core::ptr::read_volatile (src=0x48001800) at libcore/ptr.rs:472
    #4  0x080001a2 in main () at src/07-registers/src/main.rs:14

Notes:
* This code uses 8 more stack bytes.
* Increases the size of the HardFault handler by 2 narrow instructions
  or 4 bytes. This could be decreased to 2 bytes by removing the pop
  since UserHardFault() doesn't currently return but it just looks too
  odd for me to do as an initial attempt.
@adamgreen adamgreen requested a review from a team as a code owner October 23, 2018 00:45
@thejpster
Copy link
Contributor

Oooh, having spent today doing a Rust workshop and regularly having to work out why I was in the fault handler, I approve of this.

Copy link
Contributor

@therealprof therealprof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, too.

@therealprof
Copy link
Contributor

bors r+

bors bot added a commit that referenced this pull request Oct 23, 2018
131: Allow GDB to unwind HardFault callstacks r=therealprof a=adamgreen

When I currently request GDB to dump a hard fault stack, I see
something like this:
```
(gdb) bt
#0  UserHardFault_ (ef=0x10001fb8) at /depots/cortex-m-rt/src/lib.rs:537
#1  0x08003fe6 in HardFault ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
```
GDB can't unwind past HardFault since the current implementation of
this function overwrites the Link Register (LR) value. This change
pushes LR and R0 (to maintain 8-byte stack alignment) to the stack
before transferring execution to UserHardFault().

After this change, I see a callstack like this from GDB:
```
(gdb) bt
#0  UserHardFault_ (ef=0x10001fb0) at /depots/cortex-m-rt/src/lib.rs:537
#1  0x08003fe8 in HardFault ()
#2  <signal handler called>
#3  0x08002820 in core::ptr::read_volatile (src=0x48001800) at libcore/ptr.rs:472
#4  0x080001a2 in main () at src/07-registers/src/main.rs:14
```
Notes:
* This code uses 8 more stack bytes.
* Increases the size of the HardFault handler by 2 narrow instructions
  or 4 bytes. This could be decreased to 2 bytes by removing the pop
  since UserHardFault() doesn't currently return but it just looks too
  odd for me to do as an initial attempt.

Co-authored-by: Adam Green <adamgreen@users.noreply.github.com>
@bors
Copy link
Contributor

bors bot commented Oct 23, 2018

Build succeeded

@bors bors bot merged commit f501d5a into rust-embedded:master Oct 23, 2018
@adamgreig
Copy link
Member

This is great but I really don't think we need the pop at the end there -- UserHardFault really won't ever return.

@thejpster
Copy link
Contributor

When is the next release scheduled? I'd love to be able to tell my Rust embedded training workshop when they can get backtraces in fault handlers.

@adamgreig
Copy link
Member

I'd like to do one soon to get #[interrupt] out there too. It might be worth revisiting some of the pending issues and PRs first.

@adamgreen
Copy link
Contributor Author

Thanks for reviewing and merging! Much appreciated!

Some of my notes on debugging faults on Cortex-M devices with GDB:

  • Even without this commit, it should be possible to place a breakpoint on HardFault to halt the CPU before LR is overwritten with the call to UserHardFault. The change in this commit is useful when you don't have a breakpoint set and you break into the UserHardFault loop and want to see what got you there. The Discovery book that I was reading through at the time when I hit this places the breakpoint on UserHardFault so that is why I saw it.
  • This commit should just work for Cortex-M0, Cortex-M3, and other Cortex-M devices without a FPU. Once you have a FPU, GDB itself doesn't know how to unwind exception stacks containing the volatile FPU registers. It also has issues with unrolling stacks when PSP is used by the faulting code. I have been patching GDB on my machine for a few years now to fix this issue. I recently sent a 64-bit Linux version of this patched GDB to the Smoothie CNC firmware guys. The diff I applied to the recent GDB sources to create this binary is also up on their GitHub repo.
  • GDB has had this broken Cortex-M exception unwinding code for almost a decade now. I would love to get a variant of my fixes rolled into the official GDB sources but there a few things stopping me:
    • I have no experience with submitting patches to GDB. Does anyone here have experience at upstreaming fixes to GDB that could help me through the process?
    • People have tried to submit patches for this in the past (mine is actually based on one of those past attempts) but they don't make it through. People don't like the use of the user_reg_map_name_to_regnum() function and it ends up turning into a conversation about how the required register set for Cortex-M devices should include MSP and PSP but don't currently. It's not an invalid argument but should really be addressed by someone more senior in the GDB community than the people like me who are just trying to fix this broken exception unwind functionality. It would be best if GDB just took a patch like this and then refactor it later once they make this architectural update (if they ever actually do).

@adamgreen adamgreen deleted the cleanupHardFaultStackFrame branch October 23, 2018 17:31
@adamgreig
Copy link
Member

@adamgreen I haven't had a chance to test yet but a couple of thoughts..

  • If we replaced bl UserHardFault with b UserHardFault, avoiding overwriting LR, would that also resolve the issue without needing to push onto the stack?
  • I still think we should remove the pop afterwards since it is dead code
  • Does the FPU problem happen if you're using the hard EABI where FP arguments are passed on the FPU registers instead?

@adamgreen
Copy link
Contributor Author

@adamgreig

  • I figured that bl UserHardFault was used instead of b UserHardFault since the latter doesn't have as large of a offset capability on ARMv6-M.
  • I can remove the the pop but it isn't that safe. While UserHardFault doesn't currently return, I don't see anything that would stop the linker from linking in a version which does return since it is using unmangled C function naming. Maybe there is something else enforcing this?
  • The FPU problem doesn't have anything to do with the calling convention. It is whether the volatile FPU registers are stacked upon exception entry or not. Usually they are so that ISRs can safely make use of floating point instructions.

@japaric
Copy link
Member

japaric commented Oct 23, 2018

@adamgreen Thanks for fixing this! I was about to file an issue about this today :-).

I figured that bl UserHardFault was used instead of b UserHardFault since the latter doesn't have as large of a offset capability on ARMv6-M.

Indeed that's the case. We had linker failures with b UserHardFault on ARMv6-M because the linker is free to put UserHardFault far away from HardFault (IDK if there's a way to put them close; perhaps using KEEP would be fine in this situation).

While UserHardFault doesn't currently return, I don't see anything that would stop the linker from linking in a version which does return since it is using unmangled C function naming.

The #[exception] attribute type checks that UserHardFault is a diverging function (fn() -> !) so it can't return.

Directly using no_mangle, export_name or link_section is very unsafe (though the compiler doesn't warn about it) so one should use the safe attributes instead of directly using those three.

If you directly use no_mangle making the function non-diverging is the least of your problems. You can also write:

#[no_mangle]
fn UserHardFault(i_like_boxes: Box<i32>, and_vectors: Vec<i32>) -> ! { .. }

and generate invalid references, types, etc.

@adamgreen
Copy link
Contributor Author

Cool! I will submit a PR today to remove the extraneous pop.

Thanks again all for the feedback.

bors bot added a commit that referenced this pull request Oct 23, 2018
132: Remove extraneous POP from my prev commit r=therealprof a=adamgreen

Based on the great feedback to PR #131, I have removed the POP
instruction that I added in my previous commit since UserHardFault
never returns.

Co-authored-by: Adam Green <adamgreen@users.noreply.github.com>
@therealprof
Copy link
Contributor

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants