Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Possible UB in Tensor #33

Open
Pratyush opened this issue May 4, 2019 · 21 comments
Open

Possible UB in Tensor #33

Pratyush opened this issue May 4, 2019 · 21 comments

Comments

@Pratyush
Copy link
Contributor

Pratyush commented May 4, 2019

Hi,

I'm not super familiar with the design of the library, but I've been trying to use it in a project of mine. In doing so I found that Tensor::f_copy mutates a Tensor that's passed as a immutable reference, which is UB in Rust. I think one of the arguments should be changed to &mut to fix this. Sorry if I've missed some context.

Thanks!

@LaurentMazare
Copy link
Owner

Thanks for reporting this issue. Indeed mut is not handled properly. This is mostly due to a lot of the wrapping code being auto-generated from a description file that doesn't specify whether tensors are mutated or not.
That said, the f_copy_ is not automatically generated so I just changed it as per your suggestion to have the proper &mut self argument.
There is also a convention in PyTorch method that methods ending with an underscore are mutating, I'll make further changes so that the automatically generated code handles this properly.

@Pratyush
Copy link
Contributor Author

Pratyush commented May 4, 2019

Great, thanks for the prompt response!

@Pratyush Pratyush changed the title Possible unsafe in Tensor Possible UB in Tensor May 5, 2019
@LaurentMazare
Copy link
Owner

The code generation has been tweaked to reflect that functions ending with underscores can mutate self. I'm not sure whether this covers all possible scenarios but this should at least improve the current situation. Feel free to re-open if you see further inconsistencies.

@Pratyush
Copy link
Contributor Author

Pratyush commented May 5, 2019

This is slightly unrelated to this issue, but I see that a number of functions return a Tensor out that has different relations with the input Tensor in. For example, out might be a view into in, or out might be a entirely new Tensor.

In general, there are some lifetime relations between out and in which the current code doesn't quite capture, which could be UB. For example, if out is a view into in, at the moment I can drop in and still keep out around. I'm not sure if this would be UB or not as in is backed by storage allocated on the C side of things, but it is something that should be kept in mind.

ndarray gets around this by having different ArrayOwned, ArrayView, etc. types that are instantiations of a core ArrayBase struct. maybe something similar could be useful here?

@LaurentMazare
Copy link
Owner

You're perfectly right. Again the issue comes from the code being mostly generated from some ops definition file and this file doesn't include much information about what is shared between the different inputs and output.
I don't think it's very straightforward to handle this properly. Maybe would it be worth writing a more type safe api on top of these wrappers, but this is likely to be a bit involved if one wants to handle all the operations provided by torch. One pretty tough case is the backprop operation: when called on ops o this could modify all the gradients on the computation graph that lead to o.
Meanwhile note that we still some form of memory safety: a tensor memory shouldn't be released too early (this is provided by the C++ api using refcounting internally).

@Pratyush
Copy link
Contributor Author

Pratyush commented May 5, 2019

In that case could having Tensor wrap its contents in a Rc/Rc<RefCell<...>> help? Otherwise it might be prudent to have these possibly unsafe operations as unsafe, to alert users of the issue? Either way, this should be a different issue, I guess.

@LaurentMazare
Copy link
Owner

Unsafe seems a bit too much to me in this case: it cannot cause a segfault but there may be some data sharing that is not represented in the type system. I would have thought that the only issues this creates are potential race conditions when using multiple threads - maybe not implementing the Send trait for tensors so that they cannot be shared between threads would be enough.

@Pratyush
Copy link
Contributor Author

Pratyush commented May 5, 2019

Hmm it can still cause problems. The following example (adapted from the swap example in this blog post) demonstrates some unexpected issues:

// Say we have Tensor t1
let t2 = t1.reshape(t1.shape()); // t2 is now a view of t1
let swap = |x, y| {
    *x = *x ^ *y;
    *y = *x ^ *y;
    *x = *x ^ *y;
};
swap(&mut t1, &mut t2); // Now t1 and t2 are both 0

Unless reshape() outputs a newly allocated slice for t2, and unless PyTorch enforces Rust-like borrowing in its ref-counting, you'll get the above phenomenon, which should be impossible in safe Rust. (Note that if one tried to do the above via Rc in Rust, one would get a runtime panic because x and y would would be simultaneously mutably and immutably borrowing the underlying storage.)

@LaurentMazare
Copy link
Owner

Very interesting example thanks, indeed there is more to it than race conditions. I'm still not sure whether fixing this will be in the scope of this crate or of another layer on top of it, I have to think more about it but meanwhile I reopened this issue so that it's clearer what the current behavior is.

@LaurentMazare LaurentMazare reopened this May 6, 2019
@Pratyush
Copy link
Contributor Author

Pratyush commented May 10, 2019

Also see PyO3/pyo3#342, this chapter in wasm-bindgen, and finally this stackoverflow question

@ExpHP
Copy link

ExpHP commented Jul 12, 2019

I'll bump this just to point out the comment I recently added to the PyO3 issue, which is basically that rust-cpython uses &[Cell<T>] and &[ReadOnlyCell<T>] to fix this sort of UB:

PyO3/pyo3#342 (comment)

@andersk
Copy link

andersk commented Aug 31, 2019

The title of this issue may be unnecessarily alarmist; I was certainly alarmed. It’s UB for a Rust reference to alias mutable data, and that would be very bad. But this library doesn’t appear to actually take Rust references to the underlying data inside a Tensor; there’s no operation like fn data(&Tensor) -> &[u8]. It may be confusing for the data accessible through a Tensor to change due to aliasing, and perhaps there are ways to reduce that confusion with better types, but it’s not UB if it’s not observed through Rust references.

@antimora
Copy link

Linking burn's ticket #235 which uses tch-rs that may have the same issue. So if we have a solution you can look up too.

@LaurentMazare
Copy link
Owner

@antimora that's very interesting, do you have a small tch example that triggers the UB? I don't think we ever came up with one and it's unclear that it could actually happen but if you have an example that would be very helpful in order to propose a fix.

@nathanielsimard
Copy link

The problem arises when using tch::Tensor wrapped inside an Arc for a pattern similar to "copy-on-write". However, not all operations return tensors with newly allocated memory.

For instance, reshaping a tensor followed by an in-place operation may or may not change the original tensor depending on the shape. So it seems like you have to treat tensors as unsafe since they are not really owners of their data.

let tensor = ...
let mut tensor_cloned = tensor.shallow_clone();
tensor_cloned.relu_(); // Changes the first tensor

It's not a big problem when you explicitly call shallow_clone followed by an in-place operation, but some non-mutable operations (like narrow, reshape) sometimes return a tensor that uses the same memory location as their parent.

I fixed the problem by carefully tracking references to data_ptr, which is definitely an unsafe pattern. Since tch aims to provide bindings to libtorch in the most direct way possible, I'm not sure if it's something that can be fixed, but it can be documented somewhere.

@andersk
Copy link

andersk commented Mar 16, 2023

Do you have evidence that this is actually undefined behavior, or is it merely a confusing use of interior mutability? You can do confusing things with Arc<RefCell<T>> in completely safe Rust, too.

@antimora
Copy link

@LaurentMazare

@antimora that's very interesting, do you have a small tch example that triggers the UB? I don't think we ever came up with one and it's unclear that it could actually happen but if you have an example that would be very helpful in order to propose a fix.

I asked @nathanielsimard to respond to your request since he is the one who fixed a bug on our end.

@nathanielsimard
Copy link

@andersk I would say this:

Mutating immutable data. All data inside a const item is immutable. Moreover, all data reached through a shared reference or data owned by an immutable binding is immutable, unless that data is contained within an UnsafeCell.

The difference with RefCell is that it will panic if you try to have multiple mutable references to the same value, or just a "readonly" reference when one mutable reference exists. However, I don't think shallow_clone validates that. Maybe making the function unsafe would better communicate that it's the caller responsability to ensure that multiple tensors will not mutate the same data.

From the same link:

unsafe only means that avoiding undefined behavior is on the programmer; it does not change anything about the fact that Rust programs must never cause undefined behavior.

@andersk
Copy link

andersk commented Mar 16, 2023

That’s my point though—as far as I’m aware, the data in a Tensor is not reachable through a shared reference or owned by an immutable binding. It’s only reachable through a raw pointer, which is not subject to those rules.

pub struct Tensor {
pub(super) c_tensor: *mut C_tensor,
}

If this library were to expose an operation that dereferences this raw pointer and returns a Rust reference, that could be used by safe code to trigger undefined behavior. But I don’t see an operation like that.

(Rc<Cell<T>> is a closer analogy in safe Rust, since there’s no safe way to get a &T out of that.)

@andersk
Copy link

andersk commented Mar 16, 2023

Hmm, the Send implementation is suspicious though:

unsafe impl Send for Tensor {}

based on this note from an upstream contributor that “Tensors are thread safe to read from multiple threads but they are not threadsafe to write”.

@nathanielsimard
Copy link

I'm not qualify enough to say with certainty what is or is not undefined behavior, I'm just pointing out that in-place operations can have surprising side effects, especialy when coupled with operations that may or may not return a newly allocated tensor.

I'm also not sure if there is something to do here, LibTorch was not written in Rust and if a view is modified, they may chose to modify the parent tensor as well! I don't see how this could be fixed while still keeping and API as close as possible to the C++ API.

It aims at staying as close as possible to the original C++ api. More idiomatic rust bindings could then be developed on top of this.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants