Skip to content

Fix segfault in JTC and simplify checks #423

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
wants to merge 8 commits into from

Conversation

JafarAbdi
Copy link
Contributor

This PR fixes a segfault in JTC, we were getting a segfault every time we deactivate and then re-activate the controllers.

Attaching a debugger to the node shows read_state_from_hardware(state_desired_) causing the segfault, for some reason the variable in empty, digging more I don't see this variable being modified much, but I suspect it's related to the call to sample which set it to the default value output_state = trajectory_msgs::msg::JointTrajectoryPoint()

Copy link
Member

@destogl destogl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extend tests to catch this error, i.e., provoke this error so we see this is actually doing what it should?

Are you changing interfaces between reactivation cycles?

@@ -374,12 +374,17 @@ void JointTrajectoryController::read_state_from_hardware(JointTrajectoryPoint &
auto assign_point_from_interface =
[&](std::vector<double> & trajectory_point_interface, const auto & joint_interface)
{
trajectory_point_interface.resize(dof_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but this should already have the right size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, it's not (see the PR description for where I think the size is being changed) which is why it's segfaulting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@destogl managed to find the cause? We could open an investigation ticket but I'm keen on merging this soon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this info is useful or not, but I encountered this issue when doing the following (using the latest master):

  1. load the JTC into inactivate state from the CLI using the --set-state configured option
  2. unload the JTC immediately

The result was a segfault in every single case. No reactivation needed, as the controller was not activated at all. On the other hand, if we load it into unconfigured state, and then attempt to unload it, the segfault doesn't occur.
Neither when we're loading the JTC into inactive state -> activating it -> deactivating it (JTC is back into inactive state) and only then unloading it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VX792 Did this PR resolve the issue for you? If yes, do you mind creating a test that trigger the segfault? I no longer have access to the codebase that reproduce this issue 😿

Copy link
Contributor

@VX792 VX792 Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it didn't. It's entirely possible that we're talking about two different issues, since mine was 'solved' by a really quick fix.

The problem was that the traj_home_point_ptr_ variable is initially a nullptr, and it only gets initialized in the on_activate() function. If we're calling the on_cleanup() fn through a unload_controller request, until the activation doesn't happen, the cleanup will attempt to call a traj_home_point_ptr_->update(), trying to access a member of a nullptr. And that's a huge segfault!

We can check if the pointer is null and simply return (it works!), but I know nothing about control theory so @destogl might come up with a better solution.

@JafarAbdi
Copy link
Contributor Author

Are you changing interfaces between reactivation cycles?

No, we don't, it's the same controller

@destogl
Copy link
Member

destogl commented Aug 26, 2022

I am not convinced that this actually solves the cause of the issue. I would say it's more about hiding the real issue. I would wait to reproduce this in the tests and then know what is causing this...

@mergify
Copy link
Contributor

mergify bot commented Sep 7, 2022

This pull request is in conflict. Could you fix it @JafarAbdi?

@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2022

Codecov Report

Merging #423 (154bbd1) into master (e7f9962) will decrease coverage by 3.31%.
The diff coverage is 26.60%.

@@            Coverage Diff             @@
##           master     #423      +/-   ##
==========================================
- Coverage   35.78%   32.48%   -3.31%     
==========================================
  Files         189        7     -182     
  Lines       17570      665   -16905     
  Branches    11592      357   -11235     
==========================================
- Hits         6287      216    -6071     
+ Misses        994      157     -837     
+ Partials    10289      292    -9997     
Flag Coverage Δ
unittests 32.48% <26.60%> (-3.31%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ontroller/test/test_load_diff_drive_controller.cpp 12.50% <0.00%> (ø)
diff_drive_controller/src/odometry.cpp 42.16% <11.11%> (ø)
diff_drive_controller/src/speed_limiter.cpp 46.55% <11.11%> (ø)
...ive_controller/test/test_diff_drive_controller.cpp 17.62% <12.08%> (ø)
...troller/include/diff_drive_controller/odometry.hpp 20.00% <20.00%> (ø)
...iff_drive_controller/src/diff_drive_controller.cpp 39.22% <35.50%> (ø)
...de/diff_drive_controller/diff_drive_controller.hpp 100.00% <100.00%> (ø)
...ontrollers/src/joint_group_velocity_controller.cpp
...test/test_load_joint_group_velocity_controller.cpp
...include/joint_trajectory_controller/tolerances.hpp
... and 191 more

@JafarAbdi JafarAbdi force-pushed the pr-segfault branch 2 times, most recently from 547ff4e to d7d1fd9 Compare October 12, 2022 16:57
Copy link
Member

@bmagyar bmagyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up w tests

@JafarAbdi
Copy link
Contributor Author

I'm sorry, I forgot to mention that the test I added has an issue. I had a call with Denis, and he asked me to push my WIP test, so he could take a look and see if he can help finish it!

@destogl
Copy link
Member

destogl commented Jan 24, 2023

I tried to merge this with the old reactivation test, where actual trajectory execution is done. The complexity there is that we change the controller to use input time and period in the update method, so their parameters should be adjusted.

First the test need to replicate the described issue and after that show that fix is working. The position, velocity and acceleration filed don't need to have the same size, i.e., they may be uninitialized if not used. But they should definitely be allocated in the configure method. The question here is when after reactivation we loose the initialization in the memory of those filed. That's confusing for me – this is the reason I think we have much significant issue then this fix proposes.

@VX792 VX792 mentioned this pull request Feb 4, 2023
@mechwiz
Copy link
Contributor

mechwiz commented Feb 6, 2023

@swiz23 noticed a segfault in the JTC specifically when aborting a motion, followed by deactivating, followed by activating where the segfault would occur. After looking at this PR, I think it very much has to do with the line
output_state = trajectory_msgs::msg::JointTrajectoryPoint().
I think the flow of events occur as follows:

  1. When aborting, set_hold_position() is called. That essentially adds and writes an empty trajectory message in order to disregard the current one.
  2. When sample() is called in the update() loop, state_desired_ gets overwritten with that empty trajectory point meaning it has an empty position, velocity, and acceleration array.
  3. The controllers get deactivated
  4. The controllers get reactivated. When this happens, read_state_frome_hardware is called on state_desired_. Within that function, we try to index the position array within state_deisred_ as if it has the same size as dof_ when it in fact no longer does. This causes the segault.

The solution is to simply move the line defining the output_state within the sample() function in trajectory.cpp to after the last possible place that the function can return false -> here. I tested this out at least for the case that @swiz23 had issues with and verified that there was no longer a segfault.

@mechwiz mechwiz mentioned this pull request Feb 6, 2023
@JafarAbdi
Copy link
Contributor Author

Thanks @mechwiz for the fix. I'm closing this since someone else who currently can reproduce the issue have a fix

@JafarAbdi JafarAbdi closed this Feb 7, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants