.pop_front() causing memory corruption (macOS) #57

aldanor · 2018-12-03T00:43:26Z

This took me a long while to figure out, but I narrowed it down to just a single line:

println!("{:?}", queue);
queue.pop_front();
println!("{:?}", queue);

The elements are simple Clone/Copy structs; the first println outputs the queue in its normal state, whereas in the second all values are like 4294999990 or 123145302343606. This only happens in a quickcheck-like test suite where the queue is used super intensively, and pushes/pop are done thousands of times (it constantly fails at the same spot though).

I could try running it through valgrind if it helps, not sure how else I could help. Is this a known issue perhaps?

gnzlbg · 2018-12-03T16:50:02Z

Hi @aldanor thank you for the report and sorry for the slow reply (i'm travelling). Could you share a full snippet that reproduces the issue ? It's ok if it depends on quicktest or one of your libraries, I can use that as a starting point to investigate further.

Also it would be nice to know whether this happens with the latest version of the library or some older version (could you try with master ?). IIUC, this happens on MacOSX right ? Could you try enabling the unix_sysv cargo feature and see if you also reproduce it with that ?

aldanor · 2018-12-03T20:46:15Z

@gnzlbg I've managed to isolate this. Here's an example that always fails:

    #[test]
    fn test_deque() {
        #[derive(Clone, Copy, Debug, PartialEq)]
        pub struct Foo {
            a: i64,
            b: Option<(bool, i64)>,
        }

        use slice_deque::SliceDeque;
        use rand::{StdRng, SeedableRng, distributions::Uniform};

        let mut rng: StdRng = SeedableRng::seed_from_u64(0);
        let mut deque = SliceDeque::new();

        loop {
            let n = rng.sample(Uniform::new_inclusive(0, 1000));
            for i in 0..n {
                deque.push_front(Foo { a: 42, b: None });
            }
            let n = rng.sample(Uniform::new_inclusive(1, deque.len()));
            for i in 0..n {
                assert_eq!(deque.pop_front(), Some(Foo { a: 42, b: None }));
                if !deque.is_empty() {
                    // this assertion fails (becomes corrupt after pop_front())
                    assert_eq!(unsafe { *deque.get_unchecked(deque.len() - 1) },
                               Foo { a: 42, b: None });
                }
            }
        }
    }

fails like so:

thread 'test_deque' panicked at 'assertion failed: `(left == right)`
  left: `Foo { a: 1048576, b: Some((true, 2)) }`,
 right: `Foo { a: 42, b: None }`', test.rs:251:21

Note that the same example if you replace a struct with int works just fine. Could it have anything to do with alignment?

(Haven't tried master/unix_sysv yet, will do next.)

aldanor · 2018-12-03T20:52:42Z

@gnzlbg Just checked the master branch and unix_sysv enabled, same outcome.

gnzlbg · 2018-12-05T10:39:06Z

This is indeed a bug somewhere in SliceDeque, minimal working example:

const C: [i16; 3] = [42; 3];

let mut deque = SliceDeque::new();
for _ in 0..918 {
    deque.push_front(C);
}

for _ in 0..237 {
    assert_eq!(deque.pop_front(), Some(C));
    assert!(!deque.is_empty());
    assert_eq!(*deque.back().unwrap(), C); // fails B != C
}

aldanor · 2018-12-05T10:43:03Z

Indeed, your example is even more minimal.

Does this only occur on macOS?

gnzlbg · 2018-12-05T10:44:24Z

Does this only occur on macOS?

I have only tested this on macOS, working on a fix. I suspect the bug is platform independent, but I can't say for sure yet.

aldanor · 2018-12-05T11:17:25Z

Just checked on 64-bit Linux:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Some([0, 42, 42])`,
 right: `Some([42, 42, 42])`', src/main.rs:14:9

aldanor · 2018-12-05T11:19:54Z

Another thing to note re: your example, if you replace 3 with any power of two (2/4/8), everything works. If you use an odd number though, like 3/5/7, it fails, deterministically. So something to do with alignment.

gnzlbg · 2018-12-05T11:20:00Z

It appears that this was caused by a "by 1 off"-error: 70c87a4#diff-b4aea3e418ccdb71239b96952d9cddb6R751

gnzlbg · 2018-12-05T11:20:44Z

Could you test if that branch solves the problem for you?

aldanor · 2018-12-05T11:25:11Z

It appears that this was caused by a "by 1 off"-error

That's usually the nastiest type of errors

Could you test if that branch solves the problem for you?

I've checked the branch on 64-bit Linux, seems to work fine so far, I think that fixed it.

gnzlbg · 2018-12-05T11:43:57Z

I thought about alignment too at first, but at the end the problem had nothing to do with that.

When the SliceDeque had all elements in the second mirrored region except for one element in the first one, a pop_front would remove that element, such that all elements are in the second mirrored region.

For simplicity, the whole implementation assumes that if all elements fit in a single mirrored region, they are always in the first one.

The job of the code with the bug is to make sure that this is the case. It worked in many cases, but it was missing all cases in which the elements of the second memory region lied exactly on the boundary between both regions.

This introduced a memory error that allows safe Rust code to read uninitialized memory if the deque was put in the state described above.

gnzlbg · 2018-12-05T12:13:55Z

Version 0.1.16 has been released with the fix. Sorry that it took so long to get to the bottom of this, i was travelling for the last couple of days.

aldanor · 2018-12-05T14:53:52Z

@gnzlbg Thanks a lot for the quick fix! Will have to switch from vecdeque back to slicedeque... again :)

gnzlbg · 2018-12-05T14:57:43Z

Will have to switch from vecdeque back to slicedeque... again :)

I've tried to make that as painless as possible by providing VecDeque API compatibility. If this isn't as simple as a use slice_deque::SliceDeque as VecDeque; please open an issue.

SliceDeque isn't always better than VecDeque, and I think that switching between them should be as painless as possible to allow users to measure and choose what's best for them.

Shnatsel · 2018-12-06T14:08:55Z

Reads from uninitialized memory can be exploited to obtain secret data, bypass exploit mitigations or even execute arbitrary code.

Please add this issue to the Rust security advisory database so that anyone depending on the crate has a way to check whether they depend on a vulnerable version.

See <gnzlbg/slice_deque#57>

zimond · 2018-12-15T14:57:14Z

Seems still hitting this problem even using 0.1.16. I cannot say it's the same bug, but the behavior is quite alike. At some time memory of the items in the deque is corrupted. If I turn on unix_sysvfeature, it gives an other (not out of memory) panic.

aldanor · 2018-12-15T15:29:43Z

@zimond Could you try isolating a minimal example?

aldanor · 2018-12-15T15:30:37Z

Seems like #59 would be helpful to have after all.

gnzlbg · 2018-12-15T16:25:26Z

@zimond do the tests pass on your system? which system are you on?

zimond · 2018-12-16T08:18:52Z

I'm on a 17 macbook pro. The tests pass. It's hard to isolate the problem as I run into this in a quite complex system i'm building. I created roughly 1000x SliceDeque and keeps .pop_front() and .remove(index) and .insert(index) between the dequeues, at generally the very same timepoint on each run, the memory corrupts. The items I used in the deque is a plain struct containing some vec attributes.

zimond · 2018-12-16T08:19:50Z

By the way I can confirm this is related to SliceDeque, as I just switched my code back to std VecDeque and the problem disappears.

gnzlbg · 2018-12-19T09:50:57Z

@zimond I haven't been able to reproduce this yet on macos x. It would be extremely helpful if you could come up with a minimal (or not so minimal, e.g. point me to a github repo where this fails) working example.

If your code is not available online, maybe you could provide a version where everything that doesn't have much to do with the deques is stripped out, at least as long as its reasonable to do so.

In the meantime, i'm going to start fuzzing the library and see if that finds it.

zimond · 2018-12-19T13:00:03Z

Ok I'll try... I will update in this thread once I get something

gnzlbg · 2018-12-19T13:01:26Z

Thanks! In the mean time I've set up fuzzing (see #59 (comment)) but it hasn't found any issues yet :/

zimond · 2018-12-19T14:08:07Z

Have you checked .as_mut_slice() and then mutating items? Or does it have the possibility to break memory ?

gnzlbg · 2018-12-19T14:16:47Z

So as_mut_slice is pretty trivial for SliceDeque, it cannot really corrupt anything.

Until now, the majority of bugs have been due to failure to update the head and tail of the deque properly. I don't see anything wrong with that code, and there are a lot of debug_assert!s enabled in debug mode to check that these work properly, but that does not mean that these are correct.

I didn't ask about this before, but I suppose that you are able to reproduce the memory corruption in debug builds with debug assertions turned on, and you do not get a panic coming up from any of the asserts, right?

zimond · 2018-12-19T15:36:47Z

Just tried debug build and you are right, the assertions did not catch this. It's so hard to reproduce this and I checked my code once again, I created a custom Iterator based on the slice returned by as_mut_slice (internally maintaining a usize index). The project always crashes when iterating the iterator.

zimond · 2018-12-21T06:22:03Z

I managed to narrow the bug down to pop_front(). In LLDB, I found a crash that before pop_front() call, a SliceDeque has two items. After the pop, it has one item. But:

the item id do not match either the remaining item or the removed item.
The vec field in the item is corrupted.

So I think maybe in certain situation, pop_front will move the pointer to some invalid memory, maybe?

(note: after pop, head = 0, tail = 1)

gnzlbg · 2018-12-21T08:34:59Z

So I think maybe in certain situation, pop_front will move the pointer to some invalid memory, maybe?

This is basically what was happening in this bug. That's very suspicious .

@zimond is your project in github ? or could you send me a reduced test per email if it isn't ?

zimond · 2018-12-21T09:18:53Z

Not on github. It's really hard to create a reduced test on this. So I just updated several replies here. I will try again this weekend. Sorry for the long wait.

gnzlbg · 2018-12-21T09:54:00Z

Don't be sorry, I am! I want to fix this, but without a program to reproduce it I really can't :/

gnzlbg · 2019-01-15T11:37:57Z

I haven't forgotten about this, a reproducer would still be appreciated.

zimond · 2019-01-16T03:59:50Z

@gnzlbg Hey I tried several times but it's still hard to reproduce. But today I reviewed the code and decided to replace .pop_front() with .remove(0), and it stopped crashing. Hope this helps, anyway.

Rafferty97 · 2019-01-22T16:54:53Z

I'm also encounterring a memory corruption issue when using SliceDeque. It always happens after a call to .push_front(), after which the program immediately crashes due to reading uninitialised memory. Unfortunately, my code is also quite complex like in @zimond's case, but if I can produce a simplified test case I will post it here.

gnzlbg · 2019-01-22T17:10:10Z

I've implemented an optimization on master that might break the ".remove(0) workaround", so we really do need to get to the bottom of this. At this point, any kind of working example, no matter how complex, would help.

whmountains · 2019-04-18T17:17:11Z

I think I have a working example for you. In my latest project using slice_deque I am getting a segmentation fault within 60 seconds of startup. Refactoring to use vecdeque instead of slice_deque makes the problem go away. When I run git stash to revert back to slice deque it comes back. I went back and forth to make sure it was stable.

Yesterday instead of a segmentation fault I was getting a weird error about "overflow while adding Duration". I'm guessing the VecDeque was still corrupting memory, but it belonged to the application so the OS didn't notice.

You can find the code here: https://gitlab.com/szaver/mate3/tree/master. I would recommend testing rev fc639cbca1df96dc7d2d61d0e22bf7d58425e07f to reproduce the problem and df5598b4c64ec6800a53f3b4daad1bf3b29dc34e if you want a fixed version.

NOTE: this is a separate project from the one I mentioned in #64. This time I can reproduce the crash on my MacBook.

whmountains · 2019-04-18T17:51:43Z

I found an even simpler reproduction! 🎉 Published it here: https://gitlab.com/szaver/slicedeque-crash

use slice_deque::SliceDeque;

fn main() {
    let mut deque = SliceDeque::new();

    loop {
        deque.push_back(String::from("test"));
        if deque.len() == 8 {
            deque.pop_front();
        }
    }
}

On my computer it will crash within milliseconds of running with the following error:

mate3(82392,0x1177be5c0) malloc: *** error for object 0x4: pointer being freed was not allocated
mate3(82392,0x1177be5c0) malloc: *** set a breakpoint in malloc_error_break to debug
fish: 'cargo run' terminated by signal SIGABRT (Abort)

I hope that helps!

EDIT: It even reproduces inside GitLab CI. See here: https://gitlab.com/szaver/slicedeque-crash/-/jobs/199158446. I think it's safe to say that this is not a MacOS specific bug.

gnzlbg · 2019-04-18T22:27:59Z

Wow thank you so much, I work on MacOS, will dig into this tomorrow!!

zimond · 2019-04-19T02:50:54Z

wow I'm so curious ! Waiting for the answer now 😂

… indices. This commit refactors the implementation use an internal slice instead of indices, which significantly simplifies the implementation and closes #57. The problem with #57 was that using a pair of indices to keep track of where the head and the tail of the slice are located only works correctly for `mem::size_of::<T>() % allocation_granularity() == 0` because in that case, there is a unique map from indices to elements in memory. consider a T that's 3 bytes wide, for which `mem::size_of::<T>() % allocation_granularity() != 0`, then we can have: ``` // first region mirrored region // [T02, T10, T11, T12, -, -, T00, T01] | [T02, T10, T11, T12, -, -, T00, T01] ``` such that poping the first element leaves ``` // first region mirrored region // [-, T10, T11, T12, -, -, -, -] | [-, T10, T11, T12, -, -, -, -] ``` An integer indexing scheme in multiples of `size_of::<T>()` that starts at the beginning of the first memory region cannot cope with this (e.g. we'd need to say that the head of the slice starts at index 0.5). One way to work around that would be to use indices to bytes instead (or equivalently, pointers). This PR does that, by changing the layout from a pair of indices to Ts, to a pointer to the first T, and a length (that is, a slice).

gnzlbg · 2019-05-03T17:29:25Z

I'm sorry that it took a while, but #66 should fix this for good.

The problem of the previous implementation is described in 8074e61

Basically, the previous implementation only always worked properly if size_of::<T>() % allocation_granularity() == 0 (alloc granularity is more or less the page size). If that was not the case, then depending on how the deque was being used, it might ran into this issue or not (there used to be code in the internal Buffer<T> type to detect this and panic with an unimplemented!, but for some reason that was removed at some point).

In a nutshell, before, a pair of indices was used to track the head and the tail of the slice within the ring buffer, e.g., suppose that the allocation granularity is 8, the size of T is 2, - is an unused byte, and Tij means the j-th byte of the i-th element in the deque, then we can end up with a layout like this:

// memory layout:
//          physical memory                     mirrored memory
// [-, -, -, -, T00, T01, T10, T11] |  [-, -, -, -, T00, T01, T10, T11] 
//    0    1       2         3

where 0, 1, 2, 3 are indices that can be used to refer to the memory of an element within the ring buffer. If size_of::<T>() % alloc_granularity() == 0, the indexing scheme based on 0 + i * size_of::<T>() always works.

Now consider what happens if the above does not hold, and we put 3-byte wide Ts in that memory:

// memory layout:
//          physical memory                     mirrored memory
// [T00, T01, T02, T10, T11, T12, -, -] | [T00, T01, T02, T10, T11, T12, -, -] 
//         0               1

Note that only 2 3-byte wide elements fit in the 8 bytes that we have, so we have 2 1-byte holes at the end of the physical memory region. The problem is that this indexing scheme is only valid as long as we don't wrap around the deque. For example, suppose that we pop-front T0, and push_back T2, then we get:

// memory layout:
//          physical memory                     mirrored memory
// [T22, -, -, T10, T11, T12, T20, T21] | [T22, -, -, T10, T11, T12, T20, T21] 
//         0               1

The index 1 still indexes T1 properly, and if we were to index T2 with index 2, that would also work due to how memory is mirrored, but the issue is that we can't index T2 with index 0. This case also used to work, because before we did not tried to access T2 with index 0, but if we break it a little bit more, then everything breaks. For example, let's pop front T1, push back T3, and pop front T2. Then we get:

// memory layout:
//          physical memory                     mirrored memory
// [-, T30, T31, T32, -, -, -, -] | [-, T30, T31, T32, -, -, -, -]
//         0               1

Now there is no real way to access T3 via or i*size_of::<T>() indexing scheme. When this happened, the old implementation wrapped the indices around, and incorrectly stated that T3 was at index 0, instead of index 0.3 or something. So loading it would essentially transmute (-, T30, T31) as a T, resulting in garbage and undefined behavior.

This is basically what the example provided by @whmountains was doing. It was pushing a String into a deque, which is 3 pointers wide, or 1 pointer and 2 usizes - doesn't matter. What matters is that when the allocation granularity is 4096 bytes, String is 24 bytes on 64-bit architectures, and then size_of::<T>() % alloc_granularity() != 0 and the assumption does not hold. What the example does, is push some strings into the deque, and then shift them by adding one at the end via push back and removing one via pop front till the issue above is triggered after ~4096/3 + 1 times. The commit adds another test ("non_fitting") that detects this memory corruption without having to try to incorrectly free a string.

To fix this, one could keep the i * size_of::<T>() based indexing, but then one needs to make sure that when things wrap around the problems above don't happen, e.g., by moving the memory appropriately.

Instead of doing that, I decided to switch to a "byte" based indexing, where the indices point to bytes in memory, and not to Ts. The simplest way of doing that was to just, instead of storing a pair of indices for the head or tail, just store a *mut [T], which is a pointer to the first T, and a length. This is something I wanted to do anyways, since I thought it might remove a lot of brittle code that handled updating the indices, and that's what it did. Now the implementation only has a single tricky function (move_head_unchecked), but everything else became simpler, particularly common operations like using the slice methods internally, because now the slice to the elements is always available.

So that's basically why it took me so long to fix this. I allocated half an hour to look into this, to discover that it would take me a bit more to fully understand the issue. Then a couple of days later I looked again 1 hour at this, and found the issue, and toyed with different ways to solve this. And then I decided that the refactor was the right thing to do, but I couldn't allocate 2 hours for this till today =/

gnzlbg · 2019-05-03T17:38:07Z

Uh, I forgot the most important part of the post.

Please try the PR branch, and see if that fixes your issues, and report here with the results!

gnzlbg · 2019-05-07T13:21:58Z

slice-deque 0.2 has been released with this hopefully fixed for everybody, thank you all involved!

Shnatsel · 2019-05-07T17:17:15Z

Since this looks like a memory corruption and could be a security issue, could you file an advisory at https://github.com/RustSec/advisory-db so that anyone interested could check their dependencies for versions with this bug?

Also, if the fixed version is 0.2, does that mean that there's no semver-compatible version with the fix?

gnzlbg · 2019-05-07T17:31:14Z

Also, if the fixed version is 0.2, does that mean that there's no semver-compatible version with the fix?

Yes. The fix required a semver API-breaking change to SliceDeque::from_raw_parts. This API is public, but rarely used (not part of Vec or VecDeque), so migration should not be hard.

could you file an advisory

Done: rustsec/advisory-db#95

zimond · 2019-05-08T02:40:43Z

Thanks to all the hard work. I'll try 0.2 soon and feedback

gnzlbg added the bug Something isn't working label Dec 3, 2018

gnzlbg mentioned this issue Dec 5, 2018

Fix issue 57 #58

Merged

gnzlbg closed this as completed in #58 Dec 5, 2018

aldanor mentioned this issue Dec 5, 2018

Fuzz the API #59

Open

azriel91 added a commit to azriel91/minimp3-rs that referenced this issue Dec 6, 2018

Bumped slice-deque version to avoid security issue.

dc2af3a

See <gnzlbg/slice_deque#57>

azriel91 mentioned this issue Dec 6, 2018

Bumped slice-deque version to avoid security issue. germangb/minimp3-rs#5

Merged

germangb mentioned this issue Dec 7, 2018

Bump minimp3 version RustAudio/rodio#213

Merged

gnzlbg reopened this Jan 15, 2019

gnzlbg mentioned this issue Apr 16, 2019

Possible bug on ARM #64

Closed

This was referenced Apr 19, 2019

Different API with VecDeque to get the first and last elements. #65

Closed

Possible bug in ARM. SiegeLord/RustGnuplot#42

Closed

gnzlbg closed this as completed in 365a7ea May 7, 2019

.pop_front() causing memory corruption (macOS) #57

.pop_front() causing memory corruption (macOS) #57

Comments

aldanor commented Dec 3, 2018

gnzlbg commented Dec 3, 2018

aldanor commented Dec 3, 2018 • edited Loading

aldanor commented Dec 3, 2018

gnzlbg commented Dec 5, 2018 • edited Loading

aldanor commented Dec 5, 2018

gnzlbg commented Dec 5, 2018

aldanor commented Dec 5, 2018

aldanor commented Dec 5, 2018

gnzlbg commented Dec 5, 2018

gnzlbg commented Dec 5, 2018

aldanor commented Dec 5, 2018

gnzlbg commented Dec 5, 2018

gnzlbg commented Dec 5, 2018

aldanor commented Dec 5, 2018

gnzlbg commented Dec 5, 2018

Shnatsel commented Dec 6, 2018

zimond commented Dec 15, 2018

aldanor commented Dec 15, 2018

aldanor commented Dec 15, 2018

gnzlbg commented Dec 15, 2018

zimond commented Dec 16, 2018

zimond commented Dec 16, 2018

gnzlbg commented Dec 19, 2018

zimond commented Dec 19, 2018

gnzlbg commented Dec 19, 2018

zimond commented Dec 19, 2018

gnzlbg commented Dec 19, 2018

zimond commented Dec 19, 2018 • edited Loading

zimond commented Dec 21, 2018 • edited Loading

gnzlbg commented Dec 21, 2018

zimond commented Dec 21, 2018

gnzlbg commented Dec 21, 2018

gnzlbg commented Jan 15, 2019

zimond commented Jan 16, 2019

Rafferty97 commented Jan 22, 2019

gnzlbg commented Jan 22, 2019

whmountains commented Apr 18, 2019 • edited Loading

whmountains commented Apr 18, 2019 • edited Loading

gnzlbg commented Apr 18, 2019

zimond commented Apr 19, 2019

gnzlbg commented May 3, 2019 • edited Loading

gnzlbg commented May 3, 2019

gnzlbg commented May 7, 2019

Shnatsel commented May 7, 2019

gnzlbg commented May 7, 2019 • edited Loading

zimond commented May 8, 2019

aldanor commented Dec 3, 2018 •

edited

Loading

gnzlbg commented Dec 5, 2018 •

edited

Loading

zimond commented Dec 19, 2018 •

edited

Loading

zimond commented Dec 21, 2018 •

edited

Loading

whmountains commented Apr 18, 2019 •

edited

Loading

whmountains commented Apr 18, 2019 •

edited

Loading

gnzlbg commented May 3, 2019 •

edited

Loading

gnzlbg commented May 7, 2019 •

edited

Loading