C++ client: heap buffer overflow in Subscription::poll if conductor thread calls Subscription::removeImage concurrently #383

lukeocamden · 2017-07-28T07:59:16Z

Hello, consider the following interleaving:

Subscription::poll

const int length = std::atomic_load(&m_imagesLength); // reads eg 5

Subscription::removeImage

std::atomic_store(&m_imagesLength, length - 1); // writes 4
std::atomic_store(&m_images, newArray); // writes pointer to array of length 4

Subscription::poll

Image *images = std::atomic_load(&m_images); // reads pointer to array of length 4
...
fragmentsRead += images[i].poll(fragmentHandler, fragmentLimit - fragmentsRead); // with length == 4, this is undefined behavior.

Does that make sense?

This would be simpler to reason about, and perhaps faster if there were only one write (say a pointer to a vector) required for the conductor to communicate the updated set of images. So conductor would do:

// <updates m_conductorImageList>
delete m_latestImagesList.exchange(new std::vector<Image>(m_conductorImageList));

and subscriber does

if (m_latestImagesList.load(std::memory_order_relaxed) // rarely true
{
    std::unique_ptr<std::vector<Images>> latest(m_latestImagesList.exchange(nullptr));
    m_subscriberImagesList = *latest;
}
<uses m_subscriberImagesList>

Of course, any change that fixes the bug is good for me!

Thanks!

The text was updated successfully, but these errors were encountered:

tmontgomery · 2017-07-31T01:44:34Z

The case of the removeImage while a poll takes an interrupt is possible.

Testing with vector did not show nearly the needed performance on the iteration side over a raw array. So, a raw array does much better in this case.

A couple potential fixes here, though.
1 - encapsulate the array into a dynamic array type. This was done originally and dropped due to some optimizations being missed. It is well past time to revisit.
1 - Re-read the array length after grabbing the array to make sure it is the same length and has not changed. This might prevent some migration and hinder optimizations.
1 - Add in a change number scheme such as used with the driver conductor/receiver handoff for SMs and NAKs. This might also deal well with lingering the array and buffers.

All three need to be tried and looked at.

goglusid · 2017-07-31T02:32:33Z

@tmontgomery What about the following? You still get the performance of using an array but you need only one atomic load so no interleaving.

struct ImageArray
{
size_t imageCount;
Image* images[1];
};

You could also allocate size and array as a contiguous memory block to avoid having an array of 1 image when count is zero. It is the same idea anyway.

ImageArray* images = std::atomic_load(&m_images);

mjpt777 · 2017-07-31T09:40:14Z

@tmontgomery Re-reading the array length could suffer from the ABA problem. The better solution is to have a change number applied to two counters for the before and after change complete. We use this effectively in the PublicationImage for NAK signalling which is your third suggestion.

@goglusid How do you ensure the access lifetimes to the pointed struct is appropriate?

goglusid · 2017-07-31T14:36:29Z

@mjpt777 Well in the current design, you use a timeout to deal with the lifetime of these image arrays and logbuffers. So my proposed change still works with lingering...even if it is risky to use timeouts. Fair enough though a high enough timeout would be practically safer.

That being said, in my super dooper version of Aeron. ;) I implemented the following:

In the ClientConductor:

For every resource being lingered, I increment long long m_lingerPosition. And I take a snapshot of this position along with the actual resource ptr.
Every RESOURCE_TIMEOUT_MS I update an atomic m_conductorLingerPosition like this:

m_conductorLingeringPosition.store(m_lingeringPosition, std::memory_order_release);

Once in a while the application calls a new function on the Aeron class, let's say Aeron::consumeLingerResources().

inline void ClientConductor::consumeLingerResources()
{
const long long clientLingeringPosition(m_clientLingeringPosition.load(std::memory_order_acquire));
const long long conductorLingeringPosition(m_conductorLingeringPosition.load(std::memory_order_acquire));

  if (clientLingeringPosition < conductorLingeringPosition)
  {
    m_clientLingeringPosition.store(conductorLingeringPosition, std::memory_order_release);
  }
}

The ClientConductor is safe to delete lingering resources when the m_clientLingeringPosition >= m_conductorLingeringPosition

Note that this approach only works if a all subscribers and the ClientConductor::consumeLingerResources are called by a single thread.

I'll take a look to the PublicationImage's NAK signalling.

tmontgomery · 2017-07-31T17:31:26Z

@mjpt777 yes, re-reading the length suffers from ABA. I realized it late last night while thinking on it. Encapsulating the length and array into a dynamic array (or @goglusid struct) is also appealing if it can allow the optimizations to work themselves out. The lifetime is easy since we already linger the array itself anyway. But that can be cleaned up as well.

@goglusid one of the items we want to do is to refcnt the logbuffers. So, that complicates the lingering slightly as then logbuffers can be polled from multiple threads.

…inger struct.

goglusid · 2017-08-02T13:14:18Z

@tmontgomery In order to avoid allocating the ImageList and the actual Image array, how about the following?

class ImageList
{
public:

static ImageList* create(size_t length)
{
ImageList* imageList = new (new uint8_t[sizeof(ImageList) + (sizeof(Image)*length)]) ImageList(length);

for (size_t i = 0;i < length;i++)
{
  new (&(imageList->images()[i])) Image();
}

return imageList;

}

void deleteIt()
{
for (size_t i = 0;i < m_length;i++)
{
images()[i].~Image();
}

delete[] reinterpret_cast<uint8_t*>(this);

}

inline Image* images()
{
return reinterpret_cast<Image*>(reinterpret_cast<uint8_t*>(this) + sizeof(ImageList));
}

const size_t m_length;

private:
ImageList(size_t length) : m_length(length)
{
}

~ImageList() {}
};

tmontgomery · 2017-08-02T16:06:00Z

@goglusid we could do this. Actually, I have a different option in mind that is similar. We intend to soon make more changes, though. One is the refcnt of logbuffers so that the footprint of the mappings can be lower. And another is a change number technique similar to what you proposed for lingering resources. So, this is a temporary change until after next release and we make more changes.

mjpt777 · 2019-08-10T12:28:18Z

Was addressed some time ago with the latest solution applied in this commit. 8031771

tmontgomery self-assigned this Jul 31, 2017

tmontgomery added the bug label Jul 31, 2017

tmontgomery added a commit that referenced this issue Aug 1, 2017

[C++]: Issue #383. Encapsulate array and length into single struct. L…

df363ae

…inger struct.

mjpt777 closed this as completed Aug 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ client: heap buffer overflow in Subscription::poll if conductor thread calls Subscription::removeImage concurrently #383

C++ client: heap buffer overflow in Subscription::poll if conductor thread calls Subscription::removeImage concurrently #383

lukeocamden commented Jul 28, 2017 •

edited

Loading

tmontgomery commented Jul 31, 2017

goglusid commented Jul 31, 2017 •

edited

Loading

mjpt777 commented Jul 31, 2017

goglusid commented Jul 31, 2017 •

edited

Loading

tmontgomery commented Jul 31, 2017

goglusid commented Aug 2, 2017 •

edited

Loading

tmontgomery commented Aug 2, 2017

mjpt777 commented Aug 10, 2019

C++ client: heap buffer overflow in Subscription::poll if conductor thread calls Subscription::removeImage concurrently #383

C++ client: heap buffer overflow in Subscription::poll if conductor thread calls Subscription::removeImage concurrently #383

Comments

lukeocamden commented Jul 28, 2017 • edited Loading

tmontgomery commented Jul 31, 2017

goglusid commented Jul 31, 2017 • edited Loading

mjpt777 commented Jul 31, 2017

goglusid commented Jul 31, 2017 • edited Loading

tmontgomery commented Jul 31, 2017

goglusid commented Aug 2, 2017 • edited Loading

tmontgomery commented Aug 2, 2017

mjpt777 commented Aug 10, 2019

lukeocamden commented Jul 28, 2017 •

edited

Loading

goglusid commented Jul 31, 2017 •

edited

Loading

goglusid commented Jul 31, 2017 •

edited

Loading

goglusid commented Aug 2, 2017 •

edited

Loading