-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Heisenbug (crash) on macOS #949
Comments
A similar bug on Linux (somehow, with edm4eic and eicrecon's factory):
|
Any idea as to the frequency of this? Are you seeing it in production? |
We are seeing this in the production, according to @rahmans1 . Unfortunately, on Linux it's much more infrequent. On macOS you can process <100 events and get this. |
Sounds reasonably reproducible at least. Which architecture is your macOS machine running? Intel/M1/M2? |
I don't have apple silicon to test on. x86_64 is what I have. |
The nature of the bug seems to be with the fact that there is, in fact, reference counting from a podio "Object"/"MutableObject" to "ObjectObj" that is supposed to remove deallocate the underlying Obj in cases when it's not owned by a collection: There is also some funky stuff going with how JANA2 manages its memory. Basically, JEvent doesn't store any data directly, but has factories that are managed in JFactorySet. When it's time to reset, the ClearData is called |
A basic reproducer that mimics our flow and triggers ASAN: #include "datamodel/ExampleClusterCollection.h"
#include "podio/Frame.h"
int main() {
ExampleCluster *clone;
{
podio::Frame frame;
{
ExampleClusterCollection coll;
coll.create();
const ExampleClusterCollection &moved = frame.put(std::move(coll), "mycoll");
clone = new ExampleCluster(moved[0]);
}
}
delete clone;
} This may be not the reproducer, though. |
We are observing inconsistent crashes in delete operator invocation in ClearData() on certain events. These faults occur on macOS with a high rate, specifically for CalorimeterHitContribution. We also have reports of some percent of reconstruction jobs (on Linux) crashing specifically on ReconstructedParticle. There might be an issue with PODIO object clones, they should be pointing to the same object data. There doesn't seem to be any facility for reference counting. A slight complication with addressing this is that the failure comes in the implementation FactoryPodioT from JANA2 libraries. That factory is implicitly instantiated as a result of EICrecon calling the `JEvent::InsertCollectionAlreadyInFrame` method in `JEventSourcePODIO`. This PR switches to EICrecon variant of the same code to have a chance of handling the issue locally. It is not clear if mData is used or just populated for compatibility. This is an attempt at disabling the functionality in order to mitigate the issue. Resolves: #949 cc @wdconinc @rahmans1
Crashes after several events
The only workaround is to compile with -DUSE_ASAN=ON, then it doesn't crash at all.
Originally posted by @veprbl in #944 (comment)
The text was updated successfully, but these errors were encountered: