Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Heisenbug (crash) on macOS #949

Closed
veprbl opened this issue Sep 12, 2023 · 7 comments · Fixed by #1008
Closed

Heisenbug (crash) on macOS #949

veprbl opened this issue Sep 12, 2023 · 7 comments · Fixed by #1008
Assignees
Milestone

Comments

@veprbl
Copy link
Member

veprbl commented Sep 12, 2023

Crashes after several events

Thread model: pthreads
0x70000ecb2000: 
       `- JSignalHandler::handle_sigsegv(int, __siginfo*, void*) (0x1027e06ac)
        `- _sigtramp (0x7fff20776d7d)
         `- 4   ???                                 0x00007f89bc790e50 0x0 + 140229549297232 (0x7f89bc790e50)
          `- JFactoryPodioT<edm4hep::CaloHitContribution>::ClearData() (0x1072bccd9)
           `- JFactorySet::Release() (0x10277e042)
            `- JEventPool::get(unsigned long) (0x10278d727)
             `- JEventSourceArrow::execute(JArrowMetrics&, unsigned long) (0x10278cd5a)
              `- JWorker::loop() (0x102790a0c)
               `- void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (JWorker::*)(), JWorker*> >(void*) (0x102791c2e)
                `- _pthread_start (0x7fff207318fc)
                 `- thread_start (0x7fff2072d443)

The only workaround is to compile with -DUSE_ASAN=ON, then it doesn't crash at all.

Originally posted by @veprbl in #944 (comment)

@veprbl veprbl changed the title Heisenbug on macOS Heisenbug (crash) on macOS Sep 12, 2023
@veprbl veprbl added this to the 1.6.0 milestone Sep 18, 2023
@veprbl veprbl self-assigned this Sep 18, 2023
@veprbl
Copy link
Member Author

veprbl commented Sep 22, 2023

A similar bug on Linux (somehow, with edm4eic and eicrecon's factory):

Thread model: pthreads
140736862168640: 
       `- JSignalHandler::handle_sigsegv(int, siginfo_t*, void*) (0x7ffff7f6b721)
        `- /lib/libc.so.6 (0x7ffff563dbf0)
         `- edm4eic::ReconstructedParticle::~ReconstructedParticle() (0x7ffff4ef0d78)
          `- eicrecon::JFactoryPodioT<edm4eic::ReconstructedParticle>::ClearData() (0x7ffff379d3d0)
           `- JFactorySet::Release() (0x7ffff7eb5982)
            `- JEventSourceArrow::execute(JArrowMetrics&, unsigned long) (0x7ffff7edaf28)
             `- JWorker::loop() (0x7ffff7ee0614)
              `- /lib/libstdc++.so.6 (0x7ffff5adf3d4)
               `- /lib/libc.so.6 (0x7ffff5688e86)
                `- /lib/libc.so.6 (0x7ffff570fc60)

@nathanwbrei
Copy link
Contributor

Any idea as to the frequency of this? Are you seeing it in production?

@veprbl
Copy link
Member Author

veprbl commented Sep 22, 2023

We are seeing this in the production, according to @rahmans1 . Unfortunately, on Linux it's much more infrequent. On macOS you can process <100 events and get this.

@nathanwbrei
Copy link
Contributor

Sounds reasonably reproducible at least. Which architecture is your macOS machine running? Intel/M1/M2?

@veprbl
Copy link
Member Author

veprbl commented Sep 22, 2023

I don't have apple silicon to test on. x86_64 is what I have.

@veprbl
Copy link
Member Author

veprbl commented Oct 2, 2023

The nature of the bug seems to be with the fact that there is, in fact, reference counting from a podio "Object"/"MutableObject" to "ObjectObj" that is supposed to remove deallocate the underlying Obj in cases when it's not owned by a collection:
https://github.com/AIDASoft/podio/blob/babec745c12727cd5da590abbe1e3cf1ab8743ad/python/templates/macros/implementations.jinja2#L35-L39
https://github.com/AIDASoft/podio/blob/958ae51db7e105aca619a763536eaf28f9ec48ba/include/podio/ObjBase.h#L27-L33
What is likely happening is the collection gets deallocated together with all Obj's (they are not refcounted because they are owned), then we try to deallocate our Objects from mData and they either fail to check if they are owned, or try to occasionally decrease some memory. I don't know exactly why the crashes are rare.

There is also some funky stuff going with how JANA2 manages its memory. Basically, JEvent doesn't store any data directly, but has factories that are managed in JFactorySet. When it's time to reset, the ClearData is called
https://github.com/JeffersonLab/JANA2/blob/ea9ebf5aa2f0e00db587cf08ec106030188e54b7/src/libraries/JANA/JFactorySet.cc#L208-L214
in an order determined by an ordering of an std::map
https://github.com/JeffersonLab/JANA2/blob/ea9ebf5aa2f0e00db587cf08ec106030188e54b7/src/libraries/JANA/JFactorySet.h#L43
Mind you, in this map there is an entry for a factory that holds the global podio::Frame object with an empty string tag for the current written event
https://github.com/JeffersonLab/JANA2/blob/ea9ebf5aa2f0e00db587cf08ec106030188e54b7/src/libraries/JANA/Podio/JFactoryPodioT.cc#L8-L22
that factory gets cleared before some other factories but after some other factories.
One may ask: how PODIO manages to deallocate several cross-referncing collections without segfaulting? The answer is: what it does, is it uses codegen to unlink() all relations before removing the objects
https://github.com/AIDASoft/podio/blob/958ae51db7e105aca619a763536eaf28f9ec48ba/python/templates/CollectionData.cc.jinja2#L49-L81
https://github.com/AIDASoft/podio/blob/958ae51db7e105aca619a763536eaf28f9ec48ba/python/templates/macros/collections.jinja2#L14-L19

@veprbl
Copy link
Member Author

veprbl commented Oct 2, 2023

A basic reproducer that mimics our flow and triggers ASAN:

#include "datamodel/ExampleClusterCollection.h"
#include "podio/Frame.h"

int main() {
  ExampleCluster *clone;
  {
  podio::Frame frame;
  {
  ExampleClusterCollection coll;
  coll.create();
  const ExampleClusterCollection &moved = frame.put(std::move(coll), "mycoll");
  clone = new ExampleCluster(moved[0]);
  }
  }
  delete clone;
}

This may be not the reproducer, though.

github-merge-queue bot pushed a commit that referenced this issue Oct 3, 2023
We are observing inconsistent crashes in delete operator invocation in
ClearData() on certain events. These faults occur on macOS with a high
rate, specifically for CalorimeterHitContribution. We also have reports
of some percent of reconstruction jobs (on Linux) crashing specifically
on ReconstructedParticle.

There might be an issue with PODIO object clones, they should be
pointing to the same object data. There doesn't seem to be any facility
for reference counting.

A slight complication with addressing this is that the failure comes in
the implementation FactoryPodioT from JANA2 libraries. That factory is
implicitly instantiated as a result of EICrecon calling the
`JEvent::InsertCollectionAlreadyInFrame` method in `JEventSourcePODIO`.
This PR switches to EICrecon variant of the same code to have a chance
of handling the issue locally.

It is not clear if mData is used or just populated for compatibility.
This is an attempt at disabling the functionality in order to mitigate
the issue.

Resolves: #949
cc @wdconinc @rahmans1
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants