Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

.pop_front() causing memory corruption (macOS) #57

Closed
aldanor opened this issue Dec 3, 2018 · 46 comments
Closed

.pop_front() causing memory corruption (macOS) #57

aldanor opened this issue Dec 3, 2018 · 46 comments
Labels
bug Something isn't working

Comments

@aldanor
Copy link

aldanor commented Dec 3, 2018

This took me a long while to figure out, but I narrowed it down to just a single line:

println!("{:?}", queue);
queue.pop_front();
println!("{:?}", queue);

The elements are simple Clone/Copy structs; the first println outputs the queue in its normal state, whereas in the second all values are like 4294999990 or 123145302343606. This only happens in a quickcheck-like test suite where the queue is used super intensively, and pushes/pop are done thousands of times (it constantly fails at the same spot though).

I could try running it through valgrind if it helps, not sure how else I could help. Is this a known issue perhaps?

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 3, 2018

Hi @aldanor thank you for the report and sorry for the slow reply (i'm travelling). Could you share a full snippet that reproduces the issue ? It's ok if it depends on quicktest or one of your libraries, I can use that as a starting point to investigate further.

Also it would be nice to know whether this happens with the latest version of the library or some older version (could you try with master ?). IIUC, this happens on MacOSX right ? Could you try enabling the unix_sysv cargo feature and see if you also reproduce it with that ?

@gnzlbg gnzlbg added the bug Something isn't working label Dec 3, 2018
@aldanor
Copy link
Author

aldanor commented Dec 3, 2018

@gnzlbg I've managed to isolate this. Here's an example that always fails:

    #[test]
    fn test_deque() {
        #[derive(Clone, Copy, Debug, PartialEq)]
        pub struct Foo {
            a: i64,
            b: Option<(bool, i64)>,
        }

        use slice_deque::SliceDeque;
        use rand::{StdRng, SeedableRng, distributions::Uniform};

        let mut rng: StdRng = SeedableRng::seed_from_u64(0);
        let mut deque = SliceDeque::new();

        loop {
            let n = rng.sample(Uniform::new_inclusive(0, 1000));
            for i in 0..n {
                deque.push_front(Foo { a: 42, b: None });
            }
            let n = rng.sample(Uniform::new_inclusive(1, deque.len()));
            for i in 0..n {
                assert_eq!(deque.pop_front(), Some(Foo { a: 42, b: None }));
                if !deque.is_empty() {
                    // this assertion fails (becomes corrupt after pop_front())
                    assert_eq!(unsafe { *deque.get_unchecked(deque.len() - 1) },
                               Foo { a: 42, b: None });
                }
            }
        }
    }

fails like so:

thread 'test_deque' panicked at 'assertion failed: `(left == right)`
  left: `Foo { a: 1048576, b: Some((true, 2)) }`,
 right: `Foo { a: 42, b: None }`', test.rs:251:21

Note that the same example if you replace a struct with int works just fine. Could it have anything to do with alignment?

(Haven't tried master/unix_sysv yet, will do next.)

@aldanor
Copy link
Author

aldanor commented Dec 3, 2018

@gnzlbg Just checked the master branch and unix_sysv enabled, same outcome.

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 5, 2018

This is indeed a bug somewhere in SliceDeque, minimal working example:

const C: [i16; 3] = [42; 3];

let mut deque = SliceDeque::new();
for _ in 0..918 {
    deque.push_front(C);
}

for _ in 0..237 {
    assert_eq!(deque.pop_front(), Some(C));
    assert!(!deque.is_empty());
    assert_eq!(*deque.back().unwrap(), C); // fails B != C
}

@aldanor
Copy link
Author

aldanor commented Dec 5, 2018

Indeed, your example is even more minimal.

Does this only occur on macOS?

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 5, 2018

Does this only occur on macOS?

I have only tested this on macOS, working on a fix. I suspect the bug is platform independent, but I can't say for sure yet.

@aldanor
Copy link
Author

aldanor commented Dec 5, 2018

Just checked on 64-bit Linux:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Some([0, 42, 42])`,
 right: `Some([42, 42, 42])`', src/main.rs:14:9

@gnzlbg gnzlbg mentioned this issue Dec 5, 2018
@aldanor
Copy link
Author

aldanor commented Dec 5, 2018

Another thing to note re: your example, if you replace 3 with any power of two (2/4/8), everything works. If you use an odd number though, like 3/5/7, it fails, deterministically. So something to do with alignment.

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 5, 2018

It appears that this was caused by a "by 1 off"-error: 70c87a4#diff-b4aea3e418ccdb71239b96952d9cddb6R751

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 5, 2018

Could you test if that branch solves the problem for you?

@aldanor
Copy link
Author

aldanor commented Dec 5, 2018

It appears that this was caused by a "by 1 off"-error

That's usually the nastiest type of errors :trollface:

Could you test if that branch solves the problem for you?

I've checked the branch on 64-bit Linux, seems to work fine so far, I think that fixed it.

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 5, 2018

I thought about alignment too at first, but at the end the problem had nothing to do with that.

When the SliceDeque had all elements in the second mirrored region except for one element in the first one, a pop_front would remove that element, such that all elements are in the second mirrored region.

For simplicity, the whole implementation assumes that if all elements fit in a single mirrored region, they are always in the first one.

The job of the code with the bug is to make sure that this is the case. It worked in many cases, but it was missing all cases in which the elements of the second memory region lied exactly on the boundary between both regions.

This introduced a memory error that allows safe Rust code to read uninitialized memory if the deque was put in the state described above.

@gnzlbg gnzlbg closed this as completed in #58 Dec 5, 2018
@gnzlbg
Copy link
Owner

gnzlbg commented Dec 5, 2018

Version 0.1.16 has been released with the fix. Sorry that it took so long to get to the bottom of this, i was travelling for the last couple of days.

@aldanor
Copy link
Author

aldanor commented Dec 5, 2018

@gnzlbg Thanks a lot for the quick fix! Will have to switch from vecdeque back to slicedeque... again :)

@aldanor aldanor mentioned this issue Dec 5, 2018
@gnzlbg
Copy link
Owner

gnzlbg commented Dec 5, 2018

Will have to switch from vecdeque back to slicedeque... again :)

I've tried to make that as painless as possible by providing VecDeque API compatibility. If this isn't as simple as a use slice_deque::SliceDeque as VecDeque; please open an issue.

SliceDeque isn't always better than VecDeque, and I think that switching between them should be as painless as possible to allow users to measure and choose what's best for them.

@Shnatsel
Copy link

Shnatsel commented Dec 6, 2018

Reads from uninitialized memory can be exploited to obtain secret data, bypass exploit mitigations or even execute arbitrary code.

Please add this issue to the Rust security advisory database so that anyone depending on the crate has a way to check whether they depend on a vulnerable version.

@zimond
Copy link

zimond commented Dec 15, 2018

Seems still hitting this problem even using 0.1.16. I cannot say it's the same bug, but the behavior is quite alike. At some time memory of the items in the deque is corrupted. If I turn on unix_sysvfeature, it gives an other (not out of memory) panic.

@aldanor
Copy link
Author

aldanor commented Dec 15, 2018

@zimond Could you try isolating a minimal example?

@aldanor
Copy link
Author

aldanor commented Dec 15, 2018

Seems like #59 would be helpful to have after all.

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 15, 2018

@zimond do the tests pass on your system? which system are you on?

@zimond
Copy link

zimond commented Dec 16, 2018

I'm on a 17 macbook pro. The tests pass. It's hard to isolate the problem as I run into this in a quite complex system i'm building. I created roughly 1000x SliceDeque and keeps .pop_front() and .remove(index) and .insert(index) between the dequeues, at generally the very same timepoint on each run, the memory corrupts. The items I used in the deque is a plain struct containing some vec attributes.

@zimond
Copy link

zimond commented Dec 16, 2018

By the way I can confirm this is related to SliceDeque, as I just switched my code back to std VecDeque and the problem disappears.

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 19, 2018

@zimond I haven't been able to reproduce this yet on macos x. It would be extremely helpful if you could come up with a minimal (or not so minimal, e.g. point me to a github repo where this fails) working example.

If your code is not available online, maybe you could provide a version where everything that doesn't have much to do with the deques is stripped out, at least as long as its reasonable to do so.

In the meantime, i'm going to start fuzzing the library and see if that finds it.

@zimond
Copy link

zimond commented Dec 19, 2018

Ok I'll try... I will update in this thread once I get something

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 19, 2018

Thanks! In the mean time I've set up fuzzing (see #59 (comment)) but it hasn't found any issues yet :/

@zimond
Copy link

zimond commented Dec 19, 2018

Have you checked .as_mut_slice() and then mutating items? Or does it have the possibility to break memory ?

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 19, 2018

So as_mut_slice is pretty trivial for SliceDeque, it cannot really corrupt anything.

Until now, the majority of bugs have been due to failure to update the head and tail of the deque properly. I don't see anything wrong with that code, and there are a lot of debug_assert!s enabled in debug mode to check that these work properly, but that does not mean that these are correct.

I didn't ask about this before, but I suppose that you are able to reproduce the memory corruption in debug builds with debug assertions turned on, and you do not get a panic coming up from any of the asserts, right?

@zimond
Copy link

zimond commented Dec 19, 2018

Just tried debug build and you are right, the assertions did not catch this. It's so hard to reproduce this and I checked my code once again, I created a custom Iterator based on the slice returned by as_mut_slice (internally maintaining a usize index). The project always crashes when iterating the iterator.

@zimond
Copy link

zimond commented Dec 21, 2018

I managed to narrow the bug down to pop_front(). In LLDB, I found a crash that before pop_front() call, a SliceDeque has two items. After the pop, it has one item. But:

  1. the item id do not match either the remaining item or the removed item.
  2. The vec field in the item is corrupted.

So I think maybe in certain situation, pop_front will move the pointer to some invalid memory, maybe?

image

(note: after pop, head = 0, tail = 1)

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 21, 2018

So I think maybe in certain situation, pop_front will move the pointer to some invalid memory, maybe?

This is basically what was happening in this bug. That's very suspicious .

@zimond is your project in github ? or could you send me a reduced test per email if it isn't ?

@zimond
Copy link

zimond commented Dec 21, 2018

Not on github. It's really hard to create a reduced test on this. So I just updated several replies here. I will try again this weekend. Sorry for the long wait.

@gnzlbg
Copy link
Owner

gnzlbg commented Dec 21, 2018

Don't be sorry, I am! I want to fix this, but without a program to reproduce it I really can't :/

@gnzlbg
Copy link
Owner

gnzlbg commented Jan 15, 2019

I haven't forgotten about this, a reproducer would still be appreciated.

@gnzlbg gnzlbg reopened this Jan 15, 2019
@zimond
Copy link

zimond commented Jan 16, 2019

@gnzlbg Hey I tried several times but it's still hard to reproduce. But today I reviewed the code and decided to replace .pop_front() with .remove(0), and it stopped crashing. Hope this helps, anyway.

@Rafferty97
Copy link

I'm also encounterring a memory corruption issue when using SliceDeque. It always happens after a call to .push_front(), after which the program immediately crashes due to reading uninitialised memory. Unfortunately, my code is also quite complex like in @zimond's case, but if I can produce a simplified test case I will post it here.

@gnzlbg
Copy link
Owner

gnzlbg commented Jan 22, 2019

I've implemented an optimization on master that might break the ".remove(0) workaround", so we really do need to get to the bottom of this. At this point, any kind of working example, no matter how complex, would help.

@whmountains
Copy link

whmountains commented Apr 18, 2019

I think I have a working example for you. In my latest project using slice_deque I am getting a segmentation fault within 60 seconds of startup. Refactoring to use vecdeque instead of slice_deque makes the problem go away. When I run git stash to revert back to slice deque it comes back. I went back and forth to make sure it was stable.

Yesterday instead of a segmentation fault I was getting a weird error about "overflow while adding Duration". I'm guessing the VecDeque was still corrupting memory, but it belonged to the application so the OS didn't notice.

You can find the code here: https://gitlab.com/szaver/mate3/tree/master. I would recommend testing rev fc639cbca1df96dc7d2d61d0e22bf7d58425e07f to reproduce the problem and df5598b4c64ec6800a53f3b4daad1bf3b29dc34e if you want a fixed version.

NOTE: this is a separate project from the one I mentioned in #64. This time I can reproduce the crash on my MacBook.

@whmountains
Copy link

whmountains commented Apr 18, 2019

I found an even simpler reproduction! 🎉 Published it here: https://gitlab.com/szaver/slicedeque-crash

use slice_deque::SliceDeque;

fn main() {
    let mut deque = SliceDeque::new();

    loop {
        deque.push_back(String::from("test"));
        if deque.len() == 8 {
            deque.pop_front();
        }
    }
}

On my computer it will crash within milliseconds of running with the following error:

mate3(82392,0x1177be5c0) malloc: *** error for object 0x4: pointer being freed was not allocated
mate3(82392,0x1177be5c0) malloc: *** set a breakpoint in malloc_error_break to debug
fish: 'cargo run' terminated by signal SIGABRT (Abort)

I hope that helps!

EDIT: It even reproduces inside GitLab CI. See here: https://gitlab.com/szaver/slicedeque-crash/-/jobs/199158446. I think it's safe to say that this is not a MacOS specific bug.

@gnzlbg
Copy link
Owner

gnzlbg commented Apr 18, 2019

Wow thank you so much, I work on MacOS, will dig into this tomorrow!!

@zimond
Copy link

zimond commented Apr 19, 2019

wow I'm so curious ! Waiting for the answer now 😂

gnzlbg added a commit that referenced this issue May 3, 2019
… indices.

This commit refactors the implementation use an internal slice instead of
indices, which significantly simplifies the implementation and closes #57.

The problem with #57 was that using a pair of indices to keep track of where the
head and the tail of the slice are located only works correctly for
`mem::size_of::<T>() % allocation_granularity() == 0` because in that case,
there is a unique map from indices to elements in memory.

consider a T that's 3 bytes wide, for which `mem::size_of::<T>() %
allocation_granularity() != 0`, then we can have:

```
//           first region                      mirrored region
// [T02, T10, T11, T12, -, -, T00, T01] | [T02, T10, T11, T12, -, -, T00, T01]
```

such that poping the first element leaves

```
//           first region                      mirrored region
// [-, T10, T11, T12, -, -, -, -] | [-, T10, T11, T12, -, -, -, -]
```

An integer indexing scheme in multiples of `size_of::<T>()` that starts at the
beginning of the first memory region cannot cope with this (e.g. we'd need to
say that the head of the slice starts at index 0.5). One way to work around that
would be to use indices to bytes instead (or equivalently, pointers).

This PR does that, by changing the layout from a pair of indices to Ts, to a
pointer to the first T, and a length (that is, a slice).
@gnzlbg
Copy link
Owner

gnzlbg commented May 3, 2019

I'm sorry that it took a while, but #66 should fix this for good.

The problem of the previous implementation is described in 8074e61

Basically, the previous implementation only always worked properly if size_of::<T>() % allocation_granularity() == 0 (alloc granularity is more or less the page size). If that was not the case, then depending on how the deque was being used, it might ran into this issue or not (there used to be code in the internal Buffer<T> type to detect this and panic with an unimplemented!, but for some reason that was removed at some point).

In a nutshell, before, a pair of indices was used to track the head and the tail of the slice within the ring buffer, e.g., suppose that the allocation granularity is 8, the size of T is 2, - is an unused byte, and Tij means the j-th byte of the i-th element in the deque, then we can end up with a layout like this:

// memory layout:
//          physical memory                     mirrored memory
// [-, -, -, -, T00, T01, T10, T11] |  [-, -, -, -, T00, T01, T10, T11] 
//    0    1       2         3   

where 0, 1, 2, 3 are indices that can be used to refer to the memory of an element within the ring buffer. If size_of::<T>() % alloc_granularity() == 0, the indexing scheme based on 0 + i * size_of::<T>() always works.

Now consider what happens if the above does not hold, and we put 3-byte wide Ts in that memory:

// memory layout:
//          physical memory                     mirrored memory
// [T00, T01, T02, T10, T11, T12, -, -] | [T00, T01, T02, T10, T11, T12, -, -] 
//         0               1

Note that only 2 3-byte wide elements fit in the 8 bytes that we have, so we have 2 1-byte holes at the end of the physical memory region. The problem is that this indexing scheme is only valid as long as we don't wrap around the deque. For example, suppose that we pop-front T0, and push_back T2, then we get:

// memory layout:
//          physical memory                     mirrored memory
// [T22, -, -, T10, T11, T12, T20, T21] | [T22, -, -, T10, T11, T12, T20, T21] 
//         0               1

The index 1 still indexes T1 properly, and if we were to index T2 with index 2, that would also work due to how memory is mirrored, but the issue is that we can't index T2 with index 0. This case also used to work, because before we did not tried to access T2 with index 0, but if we break it a little bit more, then everything breaks. For example, let's pop front T1, push back T3, and pop front T2. Then we get:

// memory layout:
//          physical memory                     mirrored memory
// [-, T30, T31, T32, -, -, -, -] | [-, T30, T31, T32, -, -, -, -]
//         0               1

Now there is no real way to access T3 via or i*size_of::<T>() indexing scheme. When this happened, the old implementation wrapped the indices around, and incorrectly stated that T3 was at index 0, instead of index 0.3 or something. So loading it would essentially transmute (-, T30, T31) as a T, resulting in garbage and undefined behavior.

This is basically what the example provided by @whmountains was doing. It was pushing a String into a deque, which is 3 pointers wide, or 1 pointer and 2 usizes - doesn't matter. What matters is that when the allocation granularity is 4096 bytes, String is 24 bytes on 64-bit architectures, and then size_of::<T>() % alloc_granularity() != 0 and the assumption does not hold. What the example does, is push some strings into the deque, and then shift them by adding one at the end via push back and removing one via pop front till the issue above is triggered after ~4096/3 + 1 times. The commit adds another test ("non_fitting") that detects this memory corruption without having to try to incorrectly free a string.

To fix this, one could keep the i * size_of::<T>() based indexing, but then one needs to make sure that when things wrap around the problems above don't happen, e.g., by moving the memory appropriately.

Instead of doing that, I decided to switch to a "byte" based indexing, where the indices point to bytes in memory, and not to Ts. The simplest way of doing that was to just, instead of storing a pair of indices for the head or tail, just store a *mut [T], which is a pointer to the first T, and a length. This is something I wanted to do anyways, since I thought it might remove a lot of brittle code that handled updating the indices, and that's what it did. Now the implementation only has a single tricky function (move_head_unchecked), but everything else became simpler, particularly common operations like using the slice methods internally, because now the slice to the elements is always available.

So that's basically why it took me so long to fix this. I allocated half an hour to look into this, to discover that it would take me a bit more to fully understand the issue. Then a couple of days later I looked again 1 hour at this, and found the issue, and toyed with different ways to solve this. And then I decided that the refactor was the right thing to do, but I couldn't allocate 2 hours for this till today =/

@gnzlbg
Copy link
Owner

gnzlbg commented May 3, 2019

Uh, I forgot the most important part of the post.

Please try the PR branch, and see if that fixes your issues, and report here with the results!

@gnzlbg gnzlbg closed this as completed in 365a7ea May 7, 2019
@gnzlbg
Copy link
Owner

gnzlbg commented May 7, 2019

slice-deque 0.2 has been released with this hopefully fixed for everybody, thank you all involved!

@Shnatsel
Copy link

Shnatsel commented May 7, 2019

Since this looks like a memory corruption and could be a security issue, could you file an advisory at https://github.com/RustSec/advisory-db so that anyone interested could check their dependencies for versions with this bug?

Also, if the fixed version is 0.2, does that mean that there's no semver-compatible version with the fix?

@gnzlbg
Copy link
Owner

gnzlbg commented May 7, 2019

Also, if the fixed version is 0.2, does that mean that there's no semver-compatible version with the fix?

Yes. The fix required a semver API-breaking change to SliceDeque::from_raw_parts. This API is public, but rarely used (not part of Vec or VecDeque), so migration should not be hard.

could you file an advisory

Done: rustsec/advisory-db#95

@zimond
Copy link

zimond commented May 8, 2019

Thanks to all the hard work. I'll try 0.2 soon and feedback

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants