Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

Closed
goglusid opened this issue Jul 3, 2017 · 14 comments
Closed

C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

goglusid opened this issue Jul 3, 2017 · 14 comments
Assignees

Comments

@goglusid
Copy link

goglusid commented Jul 3, 2017

When the following stack of functions are executed, if the C++ client still has pointers on the log buffers then it crashes.

Following is how it can happen:

Thread#1: Call Publication::tryClaim
Thread#1: Use the BufferClaim...
Thread#2[ConductorThread]: Detects a timeout and execute the following stack.
Thread#1: Calls BufferClaim.commit();

Obviously, here I'm debugging so I reach the 5 seconds timeout.

That being said, to be thread safe it seems that there's a need to managed the MemoryMappedFiles has lingering resources like the subscription's images.

aeron::util::MemoryMappedFile::cleanUp() Line 206
aeron::util::MemoryMappedFile::~MemoryMappedFile() Line 219
std::_Ref_countaeron::util::MemoryMappedFile::_Destroy() Line 578 + 0x23 bytes
std::_Ref_count_base::_Decref() Line 538
std::vector<std::shared_ptraeron::util::MemoryMappedFile,std::allocator<std::shared_ptraeron::util::MemoryMappedFile > >::_Destroy(std::shared_ptraeron::util::MemoryMappedFile * _First=0x00549310, std::shared_ptraeron::util::MemoryMappedFile * _Last=0x00549318) Line 1885 + 0x40 bytes
std::vector<std::shared_ptraeron::util::MemoryMappedFile,std::allocator<std::shared_ptraeron::util::MemoryMappedFile > >::_Tidy() Line 1952
aeron::LogBuffers::~LogBuffers() Line 84 + 0x56 bytes
aeron::LogBuffers::`scalar deleting destructor'() + 0xf bytes
std::_Ref_count_objaeron::LogBuffers::_Destroy() Line 1327
std::_Ref_count_base::_Decref() Line 538 aeron::ClientConductor::PublicationStateDefn::~PublicationStateDefn() + 0x65 bytes
std::vector<aeron::ClientConductor::PublicationStateDefn,std::allocatoraeron::ClientConductor::PublicationStateDefn >::clear() Line 1616 + 0x64 bytes
aeron::ClientConductor::onInterServiceTimeout(__int64 now=1499112928196) Line 548
aeron::ClientConductor::onHeartbeatCheckTimeouts() Line 303
aeron::concurrent::AgentRunneraeron::ClientConductor,aeron::concurrent::SleepingIdleStrategy::run() Line 64 + 0x2e bytes

@goglusid
Copy link
Author

goglusid commented Jul 3, 2017

If we were to delay the call to MemoryMappedFile::cleanUp() X ms after the actual ClientConductor::onInterServiceTimeout then we could avoid this crash.

This period (X ms) would represent the maximum amount of execution time for a single call to Publication::offer or the time elapsed between calling Publication::tryClaim and BufferClaim.commit

@goglusid
Copy link
Author

goglusid commented Jul 3, 2017

For reference, I'm talking about the following code when I talk about the management of Image's log buffers has lingering resources:

void ClientConductor::onUnavailableImage(
std::int32_t streamId,
std::int64_t correlationId)
{
const long long now = m_epochClock();
std::lock_guardstd::recursive_mutex lock(m_adminLock);

std::for_each(m_subscriptions.begin(), m_subscriptions.end(),
    [&](const SubscriptionStateDefn &entry)
    {
        if (streamId == entry.m_streamId)
        {
            std::shared_ptr<Subscription> subscription = entry.m_subscription.lock();

            if (nullptr != subscription)
            {
                std::pair<Image*, int> result = subscription->removeImage(correlationId);
                Image* oldArray = result.first;
                const int index = result.second;

                if (nullptr != oldArray)
                {
                    **lingerResource(now, oldArray[index].logBuffers());**
                    lingerResource(now, oldArray);
                    entry.m_onUnavailableImageHandler(oldArray[index]);
                }
            }
        }
    });

}

@tmontgomery
Copy link
Contributor

cc @mjpt777

Lingering doesn't solve the underlying issue. The same thing exists in the Java version, I do believe. Lingering simply moves the time horizon. At its heart this is a race between the munmap due to the inter service timeout and the BufferClaim commit/abort operations.

@mjpt777
Copy link
Contributor

mjpt777 commented Jul 3, 2017

The Java code does not call the unavailable handlers when a forced close happens. I've also just pushed a change that will linger the resources for 1ms on a normal close and 1s on an inter service timeout.

@tmontgomery tmontgomery self-assigned this Jul 3, 2017
@tmontgomery
Copy link
Contributor

I will reflect in C++ in the next couple days if not sooner. Also, I want to make the C++ API have the agent invoker type option soon.

@goglusid
Copy link
Author

goglusid commented Jul 3, 2017

I agree that lingering only reduce the probability of having this issue.

If we were to store a smart ptr in the Publication instance returned by findPublication then the application would control the lifetime of the logbuffers without possible race. Anything I am missing here?

@tmontgomery
Copy link
Contributor

@goglusid Hmmm. Very very good point. That might work. Will give it a think. Yeah, that might be a nice way to handle it. Might also be usable for Java as well. Keep it around until Publication.close.

@goglusid
Copy link
Author

goglusid commented Jul 4, 2017

@tmontgomery I meant keep it around until Publication::~Publication

@tmontgomery
Copy link
Contributor

Agreed. Was thinking about Java as well. Which requires an explicit close of the Publication instead of it simply going out of scope.

tmontgomery added a commit that referenced this issue Jul 5, 2017
…ivePublication to keep mapping around while in scope. For #371. Updated naming and layout for subcriber position in available image.
@tmontgomery
Copy link
Contributor

@goglusid go ahead and see about this now. The Publication (and ExclusivePublication) have a shared_ptr to the LogBuffers. So, this should be cleaner now.

@goglusid
Copy link
Author

goglusid commented Jul 6, 2017

@tmontgomery Your awesomeness knows no bounds! ;p Problem solved. Thanks :D

@goglusid goglusid closed this as completed Jul 6, 2017
@tmontgomery
Copy link
Contributor

Thanks! No worries! We'll be making some other changes in this area shortly as well.

@goglusid
Copy link
Author

goglusid commented Jul 6, 2017

@tmontgomery Could you please elaborate a bit on the other changes in this area?

@tmontgomery
Copy link
Contributor

Experimenting with reference counting the mappings for #365 so multiple mappings are not needed. Also want to add the agent invoker style thread control to C++. And also change the mapping flags.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants