C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

goglusid · 2017-07-03T20:33:15Z

When the following stack of functions are executed, if the C++ client still has pointers on the log buffers then it crashes.

Following is how it can happen:

Thread#1: Call Publication::tryClaim
Thread#1: Use the BufferClaim...
Thread#2[ConductorThread]: Detects a timeout and execute the following stack.
Thread#1: Calls BufferClaim.commit();

Obviously, here I'm debugging so I reach the 5 seconds timeout.

That being said, to be thread safe it seems that there's a need to managed the MemoryMappedFiles has lingering resources like the subscription's images.

aeron::util::MemoryMappedFile::cleanUp() Line 206
aeron::util::MemoryMappedFile::~MemoryMappedFile() Line 219
std::_Ref_countaeron::util::MemoryMappedFile::_Destroy() Line 578 + 0x23 bytes
std::_Ref_count_base::_Decref() Line 538
std::vector<std::shared_ptraeron::util::MemoryMappedFile,std::allocator<std::shared_ptraeron::util::MemoryMappedFile > >::_Destroy(std::shared_ptraeron::util::MemoryMappedFile * _First=0x00549310, std::shared_ptraeron::util::MemoryMappedFile * _Last=0x00549318) Line 1885 + 0x40 bytes
std::vector<std::shared_ptraeron::util::MemoryMappedFile,std::allocator<std::shared_ptraeron::util::MemoryMappedFile > >::_Tidy() Line 1952
aeron::LogBuffers::~LogBuffers() Line 84 + 0x56 bytes
aeron::LogBuffers::`scalar deleting destructor'() + 0xf bytes
std::_Ref_count_objaeron::LogBuffers::_Destroy() Line 1327
std::_Ref_count_base::_Decref() Line 538 aeron::ClientConductor::PublicationStateDefn::~PublicationStateDefn() + 0x65 bytes
std::vector<aeron::ClientConductor::PublicationStateDefn,std::allocatoraeron::ClientConductor::PublicationStateDefn >::clear() Line 1616 + 0x64 bytes
aeron::ClientConductor::onInterServiceTimeout(__int64 now=1499112928196) Line 548
aeron::ClientConductor::onHeartbeatCheckTimeouts() Line 303
aeron::concurrent::AgentRunneraeron::ClientConductor,aeron::concurrent::SleepingIdleStrategy::run() Line 64 + 0x2e bytes

goglusid · 2017-07-03T20:41:32Z

If we were to delay the call to MemoryMappedFile::cleanUp() X ms after the actual ClientConductor::onInterServiceTimeout then we could avoid this crash.

This period (X ms) would represent the maximum amount of execution time for a single call to Publication::offer or the time elapsed between calling Publication::tryClaim and BufferClaim.commit

goglusid · 2017-07-03T20:55:29Z

For reference, I'm talking about the following code when I talk about the management of Image's log buffers has lingering resources:

void ClientConductor::onUnavailableImage(
std::int32_t streamId,
std::int64_t correlationId)
{
const long long now = m_epochClock();
std::lock_guardstd::recursive_mutex lock(m_adminLock);

std::for_each(m_subscriptions.begin(), m_subscriptions.end(),
    [&](const SubscriptionStateDefn &entry)
    {
        if (streamId == entry.m_streamId)
        {
            std::shared_ptr<Subscription> subscription = entry.m_subscription.lock();

            if (nullptr != subscription)
            {
                std::pair<Image*, int> result = subscription->removeImage(correlationId);
                Image* oldArray = result.first;
                const int index = result.second;

                if (nullptr != oldArray)
                {
                    **lingerResource(now, oldArray[index].logBuffers());**
                    lingerResource(now, oldArray);
                    entry.m_onUnavailableImageHandler(oldArray[index]);
                }
            }
        }
    });

}

tmontgomery · 2017-07-03T21:15:10Z

cc @mjpt777

Lingering doesn't solve the underlying issue. The same thing exists in the Java version, I do believe. Lingering simply moves the time horizon. At its heart this is a race between the munmap due to the inter service timeout and the BufferClaim commit/abort operations.

…371.

mjpt777 · 2017-07-03T21:58:26Z

The Java code does not call the unavailable handlers when a forced close happens. I've also just pushed a change that will linger the resources for 1ms on a normal close and 1s on an inter service timeout.

tmontgomery · 2017-07-03T22:13:48Z

I will reflect in C++ in the next couple days if not sooner. Also, I want to make the C++ API have the agent invoker type option soon.

goglusid · 2017-07-03T23:59:10Z

I agree that lingering only reduce the probability of having this issue.

If we were to store a smart ptr in the Publication instance returned by findPublication then the application would control the lifetime of the logbuffers without possible race. Anything I am missing here?

tmontgomery · 2017-07-04T00:02:43Z

@goglusid Hmmm. Very very good point. That might work. Will give it a think. Yeah, that might be a nice way to handle it. Might also be usable for Java as well. Keep it around until Publication.close.

goglusid · 2017-07-04T00:14:51Z

@tmontgomery I meant keep it around until Publication::~Publication

tmontgomery · 2017-07-04T00:17:25Z

Agreed. Was thinking about Java as well. Which requires an explicit close of the Publication instead of it simply going out of scope.

…ivePublication to keep mapping around while in scope. For #371. Updated naming and layout for subcriber position in available image.

tmontgomery · 2017-07-05T22:34:46Z

@goglusid go ahead and see about this now. The Publication (and ExclusivePublication) have a shared_ptr to the LogBuffers. So, this should be cleaner now.

goglusid · 2017-07-06T01:38:07Z

@tmontgomery Your awesomeness knows no bounds! ;p Problem solved. Thanks :D

tmontgomery · 2017-07-06T15:59:01Z

Thanks! No worries! We'll be making some other changes in this area shortly as well.

goglusid · 2017-07-06T17:29:22Z

@tmontgomery Could you please elaborate a bit on the other changes in this area?

tmontgomery · 2017-07-06T18:29:37Z

Experimenting with reference counting the mappings for #365 so multiple mappings are not needed. Also want to add the agent invoker style thread control to C++. And also change the mapping flags.

mjpt777 added a commit that referenced this issue Jul 3, 2017

[Java] Linger resources on close to mitigate against unmapping. Issue #…

578f5dd

…371.

tmontgomery self-assigned this Jul 3, 2017

tmontgomery added a commit that referenced this issue Jul 5, 2017

[C/C++]: add shared_ptr for LogBuffer saved in Publication and Exclus…

455cd88

…ivePublication to keep mapping around while in scope. For #371. Updated naming and layout for subcriber position in available image.

goglusid closed this as completed Jul 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

goglusid commented Jul 3, 2017 •

edited

Loading

goglusid commented Jul 3, 2017

goglusid commented Jul 3, 2017

tmontgomery commented Jul 3, 2017

mjpt777 commented Jul 3, 2017

tmontgomery commented Jul 3, 2017

goglusid commented Jul 3, 2017

tmontgomery commented Jul 4, 2017

goglusid commented Jul 4, 2017

tmontgomery commented Jul 4, 2017

tmontgomery commented Jul 5, 2017

goglusid commented Jul 6, 2017

tmontgomery commented Jul 6, 2017

goglusid commented Jul 6, 2017

tmontgomery commented Jul 6, 2017

C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

Comments

goglusid commented Jul 3, 2017 • edited Loading

goglusid commented Jul 3, 2017

goglusid commented Jul 3, 2017

tmontgomery commented Jul 3, 2017

mjpt777 commented Jul 3, 2017

tmontgomery commented Jul 3, 2017

goglusid commented Jul 3, 2017

tmontgomery commented Jul 4, 2017

goglusid commented Jul 4, 2017

tmontgomery commented Jul 4, 2017

tmontgomery commented Jul 5, 2017

goglusid commented Jul 6, 2017

tmontgomery commented Jul 6, 2017

goglusid commented Jul 6, 2017

tmontgomery commented Jul 6, 2017

goglusid commented Jul 3, 2017 •

edited

Loading