Fully support OwnedFd #2417

anti-entropy123 · 2023-10-09T06:29:55Z

ref: issue #2317, PR #2369

Introducation
Since nix version 0.27.1, some APIs (e.g., nix::pty::openpty(), nix::sys::socket::socket()) have started using OwnedFd to manage FDs. However, it is not straightforward to pass OwnedFd between forks. So in PR #2369, we still use RawFd in the subsequent steps.

This issue try to do the change from RawFd to OwnedFd.

The text was updated successfully, but these errors were encountered:

anti-entropy123 · 2023-10-09T06:32:20Z

Some ideas
The main challenge with using OwnedFd is that ContainerArgs requires its fields to satisfy the Clone trait, but OwnedFd does not (and shouldn't, in my opinion) support it.
Maybe we can impl Clone for structs like Sender, in which we can use OwnedFd::try_clone() to safely share underlying sockets or connections. If we do this, we also need to check whether old FD will be leaked due to CloneCb.

oirom · 2023-12-21T01:22:07Z

Hi, can i try this issue??
I'm very newbie for rust and container runtime, so it will be take a bit long time I think...

utam0k · 2023-12-21T12:02:24Z

Don't worry ;)

I'm very newbie for rust and container runtime, so it will be take a bit long time I think...

zahash · 2024-03-25T15:29:05Z

@anti-entropy123 what should happen if OwnedFd::try_clone() returns Err?

say, when manually implementing the Clone trait for ContainerType

impl Clone for ContainerType {
    fn clone(&self) -> Self {
        match self {
            Self::InitContainer => Self::InitContainer,
            Self::TenantContainer { exec_notify_fd } => Self::TenantContainer { exec_notify_fd: exec_notify_fd.try_clone().unwrap() },
        }
    }
}

is unwrapping fine?

zahash · 2024-03-25T16:13:16Z

i have a question. lets say i try_clone() a OwnedFd a bunch of times.

let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();

what if one of those variables is dropped? will the file close?
if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

zahash · 2024-03-25T16:15:16Z

does it make sense to do this maybe?

pub enum ContainerType {
    InitContainer,
    TenantContainer { exec_notify_fd: Rc<OwnedFd> },
}

anti-entropy123 · 2024-03-25T16:16:23Z

what should happen if OwnedFd::try_clone() returns Err?

say, when manually implementing the Clone trait for ContainerType

Directly unwrapping indeed doesn't seem quite appropriate...

Could we just implement a try_clone method for structures like ContainerType (should also include ContainerArgs), rather than derive(Clone)? This way, errors can be passed, reusing the error handling logic from before. 👀

This suggestion might not be fully matured, but I hope it helps you. @zahash

zahash · 2024-03-25T16:19:11Z

i have a question. lets say i try_clone() a OwnedFd a bunch of times.
let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();
what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

@anti-entropy123 what about this issue then?

anti-entropy123 · 2024-03-25T16:26:55Z

what about this issue then?

I think closing one of the file descriptors won't affect the usability of the other one.

zahash · 2024-03-25T16:30:27Z

@anti-entropy123

but the OwnedFd calls close on the wrapped RawFd when dropped.

#[stable(feature = "io_safety", since = "1.63.0")]
impl Drop for OwnedFd {
    #[inline]
    fn drop(&mut self) {
        unsafe {
            // Note that errors are ignored when closing a file descriptor. The
            // reason for this is that if an error occurs we don't actually know if
            // the file descriptor was closed or not, and if we retried (for
            // something like EINTR), we might close another valid file descriptor
            // opened after we closed ours.
            #[cfg(not(target_os = "hermit"))]
            let _ = libc::close(self.fd);
            #[cfg(target_os = "hermit")]
            let _ = hermit_abi::close(self.fd);
        }
    }
}

anti-entropy123 · 2024-03-25T16:35:13Z

@zahash Sorry, I don't have a suitable Linux machine with me right now. I can try verifying the issue tomorrow morning.

TenantContainer { exec_notify_fd: Rc },

I've discussed this idea before; you can refer to this link for more information: #2369 (comment)

zahash · 2024-03-25T16:43:22Z

@anti-entropy123 @utam0k does cargo test run all the available tests?

zahash · 2024-03-25T17:44:25Z

also, mmap no longer accepts f: Option<F> and instead just accepts f: F
and with in mmap, it tries to unwrap_or(-1) to get the raw fd.

but now, i can't find a way to create a invalid Fd OwnedFd::from_raw_fd(-1) panics at runtime because rust doesn't allow -1 (asserts that fd != -1)

        mman::mmap(
            None,
            NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
            mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
            mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
            OwnedFd::from_raw_fd(-1), // used to be None
            0,
        )
        .map_err(CloneError::StackAllocation)?

so i decided to be a little naughty and use OwnedFd::from_raw_fd(-2) and there are no runtime panics due to assertions and all the tests are passing.

is this fine?

OwnedFd::from_raw_fd(RawFd::MAX) also works

anti-entropy123 · 2024-03-26T01:35:57Z

@zahash I think we can try using mmap_anonymous https://docs.rs/nix/latest/nix/sys/mman/fn.mmap_anonymous.html

anti-entropy123 · 2024-03-26T01:41:02Z

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

I tested this code snippet, and it ran fine.

fn verify_try_clone(fd: OwnedFd) {
    let new_fd = fd.try_clone().unwrap();
    drop(fd);

    let mut new_file = unsafe { File::from_raw_fd(new_fd.as_raw_fd()) };
    new_file.write_all("hello world".as_bytes()).unwrap();
}

anti-entropy123 · 2024-03-26T01:47:23Z

but the OwnedFd calls close on the wrapped RawFd when dropped.

@zahash The try_clone method of OwnedFd ultimately calls a syscall similar to dup2, so in the end, there are indeed two separate file descriptors. https://man7.org/linux/man-pages/man2/dup.2.html

zahash · 2024-03-26T02:52:36Z

@anti-entropy123 @utam0k

i was also thinking about having a ref datatype

pub enum ContainerTypeRef<'fd> {
    InitContainer,
    TenantContainer { exec_notify_id: BorrowedFd<'fd> }
}

that way, we may not have to necessarily clone all the time.

but the CloneCb in container_intermediate_process requires cloning the args.

    let cb: CloneCb = {
        let args = args.clone();
        ...
    };

zahash · 2024-03-26T03:21:27Z

@anti-entropy123 @utam0k

i have made changes related to issues #2417 and #2723

all the tests pass. Please checkout the patch file. if you like where this is going and satisfied with most of the changes, i will create a pull request for further review.

diff --git a/crates/libcgroups/src/v1/memory.rs b/crates/libcgroups/src/v1/memory.rs
index 98fa6b1e..38b0dd07 100644
--- a/crates/libcgroups/src/v1/memory.rs
+++ b/crates/libcgroups/src/v1/memory.rs
@@ -332,7 +332,7 @@ impl Memory {
             Err(e) => {
                 // we need to look into the raw OS error for an EBUSY status
                 match e.inner().raw_os_error() {
-                    Some(code) => match Errno::from_i32(code) {
+                    Some(code) => match Errno::from_raw(code) {
                         Errno::EBUSY => {
                             let usage = Self::get_memory_usage(cgroup_root)?;
                             let max_usage = Self::get_memory_max_usage(cgroup_root)?;
diff --git a/crates/libcontainer/src/container/builder_impl.rs b/crates/libcontainer/src/container/builder_impl.rs
index 38f2f1cc..59d7a062 100644
--- a/crates/libcontainer/src/container/builder_impl.rs
+++ b/crates/libcontainer/src/container/builder_impl.rs
@@ -132,7 +132,7 @@ impl ContainerBuilderImpl {
             prctl::set_dumpable(false).map_err(|e| {
                 LibcontainerError::Other(format!(
                     "error in setting dumpable to false : {}",
-                    nix::errno::from_i32(e)
+                    nix::errno::Errno::from_raw(e)
                 ))
             })?;
         }
@@ -141,7 +141,7 @@ impl ContainerBuilderImpl {
         // therefore we will have to move all the variable by value. Since self
         // is a shared reference, we have to clone these variables here.
         let container_args = ContainerArgs {
-            container_type: self.container_type,
+            container_type: self.container_type.clone(),
             syscall: self.syscall,
             spec: Rc::clone(&self.spec),
             rootfs: self.rootfs.to_owned(),
diff --git a/crates/libcontainer/src/container/tenant_builder.rs b/crates/libcontainer/src/container/tenant_builder.rs
index 0d805a0d..2c2dbc88 100644
--- a/crates/libcontainer/src/container/tenant_builder.rs
+++ b/crates/libcontainer/src/container/tenant_builder.rs
@@ -1,6 +1,6 @@
 use caps::Capability;
 use nix::fcntl::OFlag;
-use nix::unistd::{self, close, pipe2, read, Pid};
+use nix::unistd::{self, pipe2, read, Pid};
 use oci_spec::runtime::{
     Capabilities as SpecCapabilities, Capability as SpecCapability, LinuxBuilder,
     LinuxCapabilities, LinuxCapabilitiesBuilder, LinuxNamespace, LinuxNamespaceBuilder,
@@ -8,6 +8,7 @@ use oci_spec::runtime::{
 };
 use procfs::process::Namespace;
 
+use std::os::fd::AsRawFd;
 use std::rc::Rc;
 use std::{
     collections::HashMap,
@@ -148,13 +149,13 @@ impl TenantContainerBuilder {
         let mut notify_socket = NotifySocket::new(notify_path);
         notify_socket.notify_container_start()?;
 
-        close(write_end).map_err(LibcontainerError::OtherSyscall)?;
+        drop(builder_impl); // this will close write_end fd
 
         let mut err_str_buf = Vec::new();
 
         loop {
             let mut buf = [0; 3];
-            match read(read_end, &mut buf).map_err(LibcontainerError::OtherSyscall)? {
+            match read(read_end.as_raw_fd(), &mut buf).map_err(LibcontainerError::OtherSyscall)? {
                 0 => {
                     if err_str_buf.is_empty() {
                         return Ok(pid);
diff --git a/crates/libcontainer/src/process/args.rs b/crates/libcontainer/src/process/args.rs
index ad37c07b..11c5d7f2 100644
--- a/crates/libcontainer/src/process/args.rs
+++ b/crates/libcontainer/src/process/args.rs
@@ -1,5 +1,6 @@
 use libcgroups::common::CgroupConfig;
 use oci_spec::runtime::Spec;
+use std::os::fd::OwnedFd;
 use std::os::unix::prelude::RawFd;
 use std::path::PathBuf;
 use std::rc::Rc;
@@ -9,10 +10,21 @@ use crate::notify_socket::NotifyListener;
 use crate::syscall::syscall::SyscallType;
 use crate::user_ns::UserNamespaceConfig;
 use crate::workload::Executor;
-#[derive(Debug, Copy, Clone)]
+#[derive(Debug)]
 pub enum ContainerType {
     InitContainer,
-    TenantContainer { exec_notify_fd: RawFd },
+    TenantContainer { exec_notify_fd: OwnedFd },
+}
+
+impl Clone for ContainerType {
+    fn clone(&self) -> Self {
+        match self {
+            Self::InitContainer => Self::InitContainer,
+            Self::TenantContainer { exec_notify_fd } => Self::TenantContainer {
+                exec_notify_fd: exec_notify_fd.try_clone().unwrap(/* is unwrap ok? */),
+            },
+        }
+    }
 }
 
 #[derive(Clone)]
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..7e5ac9c0 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -1,3 +1,5 @@
+use std::os::fd::AsRawFd;
+
 use crate::error::MissingSpecError;
 use crate::{namespaces::Namespaces, process::channel, process::fork};
 use libcgroups::common::CgroupManager;
@@ -128,12 +130,12 @@ pub fn container_intermediate_process(
                     if let Err(err) = main_sender.exec_failed(e.to_string()) {
                         tracing::error!(?err, "failed sending error to main sender");
                     }
-                    if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
+                    if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
                         let buf = format!("{e}");
-                        if let Err(err) = write(exec_notify_fd, buf.as_bytes()) {
+                        if let Err(err) = write(&exec_notify_fd, buf.as_bytes()) {
                             tracing::error!(?err, "failed to write to exec notify fd");
                         }
-                        if let Err(err) = close(exec_notify_fd) {
+                        if let Err(err) = close(exec_notify_fd.as_raw_fd()) {
                             tracing::error!(?err, "failed to close exec notify fd");
                         }
                     }
@@ -157,8 +159,8 @@ pub fn container_intermediate_process(
     })?;
 
     // Close the exec_notify_fd in this process
-    if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
-        close(exec_notify_fd).map_err(|err| {
+    if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
+        close(exec_notify_fd.as_raw_fd()).map_err(|err| {
             tracing::error!("failed to close exec notify fd: {}", err);
             IntermediateProcessError::ExecNotify(err)
         })?;
@@ -206,7 +208,7 @@ fn setup_userns(
     prctl::set_dumpable(true).map_err(|e| {
         IntermediateProcessError::Other(format!(
             "error in setting dumpable to true : {}",
-            nix::errno::from_i32(e)
+            nix::errno::Errno::from_raw(e)
         ))
     })?;
     sender.identifier_mapping_request().map_err(|err| {
@@ -220,7 +222,7 @@ fn setup_userns(
     prctl::set_dumpable(false).map_err(|e| {
         IntermediateProcessError::Other(format!(
             "error in setting dumplable to false : {}",
-            nix::errno::from_i32(e)
+            nix::errno::Errno::from_raw(e)
         ))
     })?;
     Ok(())
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..60a33f46 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -1,4 +1,4 @@
-use std::{ffi::c_int, fs::File, num::NonZeroUsize};
+use std::{ffi::c_int, num::NonZeroUsize};
 
 use libc::SIGCHLD;
 use nix::{
@@ -164,15 +164,17 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
     // do not use MAP_GROWSDOWN since it is not well supported.
     // Ref: https://man7.org/linux/man-pages/man2/mmap.2.html
     let child_stack = unsafe {
-        // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd`.
-        // `::<File>` doesn't have any meaning because we won't use it.
-        mman::mmap::<File>(
+        // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd` and takes f: Option<F>.
+        // if f: None, then fd = -1 is used; (-1 is just an invalid fd)
+        // let fd = f.map(|f| f.as_fd().as_raw_fd()).unwrap_or(-1);
+        // Since nix = "0.28.0" mmap function accepts f: F instead of Option<F>
+        // but it is not possible to manually create OwnedFd::from_raw_fd(-1) because rust will panic due to assert
+        // So, mmap_anonymous is used which achieves the exact same behaviour
+        mman::mmap_anonymous(
             None,
             NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
             mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
             mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
-            None,
-            0,
         )
         .map_err(CloneError::StackAllocation)?
     };
@@ -187,7 +189,7 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
 
     // Since the child stack for clone grows downward, we need to pass in
     // the top of the stack address.
-    let child_stack_top = unsafe { child_stack.add(default_stack_size) };
+    let child_stack_top = unsafe { child_stack.as_ptr().add(default_stack_size) };
 
     // Combine the clone flags with exit signals.
     let combined_flags = (flags | exit_signal.unwrap_or(0)) as c_int;
diff --git a/crates/libcontainer/src/seccomp/mod.rs b/crates/libcontainer/src/seccomp/mod.rs
index 2bfce4cb..a1acb131 100644
--- a/crates/libcontainer/src/seccomp/mod.rs
+++ b/crates/libcontainer/src/seccomp/mod.rs
@@ -332,7 +332,7 @@ mod tests {
             }
 
             if let Some(errno) = ret.err() {
-                if errno != nix::errno::from_i32(expect_error) {
+                if errno != nix::errno::Errno::from_raw(expect_error) {
                     Err(TestCallbackError::Custom(format!(
                         "getcwd failed but we didn't get the expected error from seccomp profile: {}",
                         errno
diff --git a/crates/libcontainer/src/syscall/linux.rs b/crates/libcontainer/src/syscall/linux.rs
index d01c4a54..35e5430c 100644
--- a/crates/libcontainer/src/syscall/linux.rs
+++ b/crates/libcontainer/src/syscall/linux.rs
@@ -314,7 +314,7 @@ impl Syscall for LinuxSyscall {
     fn set_id(&self, uid: Uid, gid: Gid) -> Result<()> {
         prctl::set_keep_capabilities(true).map_err(|errno| {
             tracing::error!(?errno, "failed to set keep capabilities to true");
-            nix::errno::from_i32(errno)
+            nix::errno::Errno::from_raw(errno)
         })?;
         // args : real *id, effective *id, saved set *id respectively
 
@@ -350,7 +350,7 @@ impl Syscall for LinuxSyscall {
         }
         prctl::set_keep_capabilities(false).map_err(|errno| {
             tracing::error!(?errno, "failed to set keep capabilities to false");
-            nix::errno::from_i32(errno)
+            nix::errno::Errno::from_raw(errno)
         })?;
         Ok(())
     }
diff --git a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
index 10664b2e..60337c4d 100644
--- a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
+++ b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
@@ -1,7 +1,7 @@
 use anyhow::{bail, Context, Result};
 use libcontainer::container::ContainerProcessState;
 use nix::{
-    sys::socket::{self, UnixAddr},
+    sys::socket::{self, Backlog, UnixAddr},
     unistd,
 };
 use std::{
@@ -35,7 +35,11 @@ pub fn recv_seccomp_listener(seccomp_listener: &Path) -> SeccompAgentResult {
     socket::bind(socket.as_raw_fd(), &addr).context("failed to bind to seccomp listener socket")?;
     // Force the backlog to be 1 so in the case of an error, only one connection
     // from clients will be waiting.
-    socket::listen(&socket.as_fd(), 1).context("failed to listen on seccomp listener")?;
+    socket::listen(
+        &socket.as_fd(),
+        Backlog::new(1).unwrap( /* safe to unwrap because Backlog(1) is always valid */ ),
+    )
+    .context("failed to listen on seccomp listener")?;
     let conn = match socket::accept(socket.as_raw_fd()) {
         Ok(conn) => conn,
         Err(e) => {
diff --git a/tests/contest/runtimetest/src/tests.rs b/tests/contest/runtimetest/src/tests.rs
index dee3afc7..d1d2f2ce 100644
--- a/tests/contest/runtimetest/src/tests.rs
+++ b/tests/contest/runtimetest/src/tests.rs
@@ -34,7 +34,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
     // change manual matching of i32 to e.kind() and match statement
     for path in ro_paths {
         if let std::io::Result::Err(e) = test_read_access(path) {
-            let errno = Errno::from_i32(e.raw_os_error().unwrap());
+            let errno = Errno::from_raw(e.raw_os_error().unwrap());
             // In the integration tests we test for both existing and non-existing readonly paths
             // to be specified in the spec, so we allow ENOENT here
             if errno == Errno::ENOENT {
@@ -50,7 +50,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
         }
 
         if let std::io::Result::Err(e) = test_write_access(path) {
-            let errno = Errno::from_i32(e.raw_os_error().unwrap());
+            let errno = Errno::from_raw(e.raw_os_error().unwrap());
             // In the integration tests we test for both existing and non-existing readonly paths
             // being specified in the spec, so we allow ENOENT, and we expect EROFS as the paths
             // should be read-only

YJDoc2 · 2024-03-26T05:30:30Z

Hey @zahash , I'd suggest you wait till #2728 is merged, as that changes some of the things mentioned here, eg the mmap_anonymous. Also instead of using patch files, please either submit a PR, so link to a branch in your repo, as that would be easier for everyone to check and see.

zahash · 2024-03-26T05:36:18Z

@YJDoc2 oh, didn't realize there was already a pr to update nix. Fine then i will wait till it gets merged and then continue my work on the "Fully support OwnedFd" part

zahash · 2024-03-26T05:58:06Z

also, @utam0k can i also be assigned to this issue?

YJDoc2 · 2024-03-26T06:06:38Z

@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash

anti-entropy123 · 2024-03-26T08:23:52Z

@zahash, have you considered the possibility of a FD leak here?

container_main_process.rs

pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> {
    ...
    let cb: CloneCb = {
        let container_args = container_args.clone();
        ...
        Box::new(move || {
        ...
        })
    };
    ...
    let intermediate_pid = fork::container_clone(cb)

fork.rs

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
        unsafe { Box::from_raw(data as *mut CloneCb)() }
    }
    let ret = unsafe {
        libc::clone(
            main,
            child_stack_top,
            combined_flags,
            data as *mut libc::c_void,
        )
    };
    ...

It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. 🤔 (In the past, this wasn't a problem because they were all the same RawFd.)

oirom · 2024-03-26T08:45:09Z

@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash

Now this issue is too difficult to me, so please re-assign to @zahash

YJDoc2 · 2024-03-26T08:59:47Z

Now this issue is too difficult to me, so please re-assign to @zahash

No worries, feel free to take a look at some other issues such as #361 if you still want to contribute!

@zahash , assigning to you 👍

zahash · 2024-03-27T04:48:39Z

Also, cloning a OwnedFd only fails on wasm.

impl OwnedFd {
    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `OwnedFd` instance.
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone(&self) -> crate::io::Result<Self> {
        self.as_fd().try_clone_to_owned()
    }
}

impl BorrowedFd<'_> {
    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `BorrowedFd` instance.
    #[cfg(not(any(target_arch = "wasm32", target_os = "hermit")))]
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone_to_owned(&self) -> crate::io::Result<OwnedFd> {
        // We want to atomically duplicate this file descriptor and set the
        // CLOEXEC flag, and currently that's done via F_DUPFD_CLOEXEC. This
        // is a POSIX flag that was added to Linux in 2.6.24.
        #[cfg(not(target_os = "espidf"))]
        let cmd = libc::F_DUPFD_CLOEXEC;

        // For ESP-IDF, F_DUPFD is used instead, because the CLOEXEC semantics
        // will never be supported, as this is a bare metal framework with
        // no capabilities for multi-process execution. While F_DUPFD is also
        // not supported yet, it might be (currently it returns ENOSYS).
        #[cfg(target_os = "espidf")]
        let cmd = libc::F_DUPFD;

        // Avoid using file descriptors below 3 as they are used for stdio
        let fd = cvt(unsafe { libc::fcntl(self.as_raw_fd(), cmd, 3) })?;
        Ok(unsafe { OwnedFd::from_raw_fd(fd) })
    }

    /// Creates a new `OwnedFd` instance that shares the same underlying file
    /// description as the existing `BorrowedFd` instance.
    #[cfg(any(target_arch = "wasm32", target_os = "hermit"))]
    #[stable(feature = "io_safety", since = "1.63.0")]
    pub fn try_clone_to_owned(&self) -> crate::io::Result<OwnedFd> {
        Err(crate::io::const_io_error!(
            crate::io::ErrorKind::Unsupported,
            "operation not supported on WASI yet",
        ))
    }
}

anti-entropy123 · 2024-03-27T07:52:38Z

is there a "idiomatic" way to do this so rust can understand?

I think the main issue is that clone causes a "stack switch", which makes the child process lose all the local variables originally on the stack. So, at the moment, I don't have a good idea to solve this problem... (Apart from using Rc and manually adjusting the reference count.) @zahash

zahash · 2024-03-27T08:38:49Z

I think the main issue is that clone causes a "stack switch", which makes the child process lose all the local variables originally on the stack.

@anti-entropy123

i thought clone creates a process identical to the parent with all the local variables.
but they both will have separate memory spaces so modification done on the child won't affect parent.
besides, i don't think we are doing any modifications. (i might be wrong).

also, the CloneCb closure is boxed and we send the actual function pointer of the closure to libc. so shouldn't the closure retain all the references to the enclosed variables in the function body.

zahash · 2024-03-27T08:47:38Z

@anti-entropy123 i did this and all the tests pass. I basically removed #[derive(Clone)] from Senders and Receivers and added a lifetime to CloneCb

@YJDoc2 i can't create a PR right now because this hasn't merged yet #2728 and a lot of my changes depend on it. sorry for using a patch again.

diff --git a/crates/libcontainer/src/channel.rs b/crates/libcontainer/src/channel.rs
index 23d255f9..427912d6 100644
--- a/crates/libcontainer/src/channel.rs
+++ b/crates/libcontainer/src/channel.rs
@@ -18,13 +18,11 @@ pub enum ChannelError {
     #[error("channel connection broken")]
     BrokenChannel,
 }
-#[derive(Clone)]
 pub struct Receiver<T> {
     receiver: RawFd,
     phantom: PhantomData<T>,
 }
 
-#[derive(Clone)]
 pub struct Sender<T> {
     sender: RawFd,
     phantom: PhantomData<T>,
diff --git a/crates/libcontainer/src/process/channel.rs b/crates/libcontainer/src/process/channel.rs
index 9683e5ed..b1bd2e88 100644
--- a/crates/libcontainer/src/process/channel.rs
+++ b/crates/libcontainer/src/process/channel.rs
@@ -40,7 +40,6 @@ pub fn main_channel() -> Result<(MainSender, MainReceiver), ChannelError> {
     Ok((MainSender { sender }, MainReceiver { receiver }))
 }
 
-#[derive(Clone)]
 pub struct MainSender {
     sender: Sender<Message>,
 }
@@ -88,7 +87,6 @@ impl MainSender {
     }
 }
 
-#[derive(Clone)]
 pub struct MainReceiver {
     receiver: Receiver<Message>,
 }
@@ -199,7 +197,6 @@ pub fn intermediate_channel() -> Result<(IntermediateSender, IntermediateReceive
     ))
 }
 
-#[derive(Clone)]
 pub struct IntermediateSender {
     sender: Sender<Message>,
 }
@@ -219,7 +216,6 @@ impl IntermediateSender {
     }
 }
 
-#[derive(Clone)]
 pub struct IntermediateReceiver {
     receiver: Receiver<Message>,
 }
@@ -256,7 +252,6 @@ pub fn init_channel() -> Result<(InitSender, InitReceiver), ChannelError> {
     Ok((InitSender { sender }, InitReceiver { receiver }))
 }
 
-#[derive(Clone)]
 pub struct InitSender {
     sender: Sender<Message>,
 }
@@ -275,7 +270,6 @@ impl InitSender {
     }
 }
 
-#[derive(Clone)]
 pub struct InitReceiver {
     receiver: Receiver<Message>,
 }
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..8d9a015c 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -99,12 +99,7 @@ pub fn container_intermediate_process(
     }
 
     let cb: CloneCb = {
-        let args = args.clone();
-        let init_sender = init_sender.clone();
-        let inter_sender = inter_sender.clone();
-        let mut main_sender = main_sender.clone();
-        let mut init_receiver = init_receiver.clone();
-        Box::new(move || {
+        Box::new(|| {
             if let Err(ret) = prctl::set_name("youki:[2:INIT]") {
                 tracing::error!(?ret, "failed to set name for child process");
                 return ret;
@@ -121,7 +116,7 @@ pub fn container_intermediate_process(
                 tracing::error!(?err, "failed to close sender in the intermediate process");
                 return -1;
             }
-            match container_init_process(&args, &mut main_sender, &mut init_receiver) {
+            match container_init_process(&args, main_sender, init_receiver) {
                 Ok(_) => 0,
                 Err(e) => {
                     tracing::error!("failed to initialize container process: {e}");
diff --git a/crates/libcontainer/src/process/container_main_process.rs b/crates/libcontainer/src/process/container_main_process.rs
index f7066053..41fa96ab 100644
--- a/crates/libcontainer/src/process/container_main_process.rs
+++ b/crates/libcontainer/src/process/container_main_process.rs
@@ -42,16 +42,12 @@ pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bo
     // cloned process, we have to be deligent about closing any unused channel.
     // At minimum, we have to close down any unused senders. The corresponding
     // receivers will be cleaned up once the senders are closed down.
-    let (main_sender, mut main_receiver) = channel::main_channel()?;
-    let inter_chan = channel::intermediate_channel()?;
-    let init_chan = channel::init_channel()?;
+    let (mut main_sender, mut main_receiver) = channel::main_channel()?;
+    let mut inter_chan = channel::intermediate_channel()?;
+    let mut init_chan = channel::init_channel()?;
 
     let cb: CloneCb = {
-        let container_args = container_args.clone();
-        let mut main_sender = main_sender.clone();
-        let mut inter_chan = inter_chan.clone();
-        let mut init_chan = init_chan.clone();
-        Box::new(move || {
+        Box::new(|| {
             if let Err(ret) = prctl::set_name("youki:[1:INTER]") {
                 tracing::error!(?ret, "failed to set name for child process");
                 return ret;
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..2a69d8d2 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -33,7 +33,7 @@ pub enum CloneError {
 /// pass in a child stack, which is empty. By storing the closure in heap, we
 /// can then in the new process to re-box the heap memory back to a closure
 /// correctly.
-pub type CloneCb = Box<dyn FnMut() -> i32>;
+pub type CloneCb<'a> = Box<dyn FnMut() -> i32 + 'a>;
 
 // Clone a sibling process that shares the same parent as the calling
 // process. This is used to launch the container init process so the parent

zahash · 2024-03-27T08:51:05Z

the lifetime in CloneCb<_> is only there to satisfy rust. i think the actual reason this is working is because we actually wait for the child process to finish.

    let pid = fork::container_clone_sibling(cb).map_err(|err| {
        tracing::error!("failed to fork init process: {}", err);
        IntermediateProcessError::InitProcess(err)
    })?;

    // ...

    main_sender.intermediate_ready(pid).map_err(|err| {
        tracing::error!("failed to wait on intermediate process: {}", err);
        err
    })?;

without this, the child might outlive the parent and become zombie. and rust lifetimes won't save us.

maybe there is a better way to do this so that rust understands the "waiting" part is important for the lifetime.

anti-entropy123 · 2024-03-27T10:46:24Z

OK, I agree with what you're saying, and since your code doesn't clone the container_args variable, there should be no copying of file descriptors. Therefore, fd leaks shouldn't occur. 👍

However, I'd still like to ask if you're aware of whether you're actually calling clone3() or clone()?

// An internal wrapper to manage the clone3 vs clone fallback logic.
fn clone_internal(
    mut cb: CloneCb,
    flags: u64,
    exit_signal: Option<u64>,
) -> Result<Pid, CloneError> {
    match clone3(&mut cb, flags, exit_signal) {
        ...
        // For now, we decide to only fallback on ENOSYS
        Err(CloneError::Clone(nix::Error::ENOSYS)) => {
            ...
            let pid = clone(cb, flags, exit_signal)?;

Since in clone(), a new stack is allocated for the child process,

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    let ret = unsafe {
            libc::clone(
                main,
                child_stack_top,
                combined_flags,
                data as *mut libc::c_void,
            )
        };
    ...

wouldn't the closure's references to local variables (e.g., &container_args) become invalid? @zahash

zahash · 2024-03-27T10:55:46Z

@anti-entropy123

I have the exact same question 😅😅

do you know how clone3 vs clone differences are? and since the closure has references to variables in the paren'ts memory, do they also get copied to the child memory?

i can see that clone3 also takes args but clone doesn't. (but i don't think we are passing anything related in it)

but why do the tests pass? do they even check this part of the code?

anti-entropy123 · 2024-03-27T10:59:30Z

I think your test device should support clone3(), so it likely called clone3() in reality. And clone3() seems to completely copy the parent process's memory, which is why your test passed.

zahash · 2024-03-27T11:00:20Z

@anti-entropy123 is there a way to force it to use clone instead?

never mind. i'll just comment out the clone3 part

anti-entropy123 · 2024-03-27T11:01:45Z

But I'm not entirely sure why clone() necessitates switching to a new stack; it could be a requirement of the syscall itself (? 🤔 ).

zahash · 2024-03-27T11:03:10Z

@anti-entropy123 all the tests still pass

fn clone_internal(
    mut cb: CloneCb,
    flags: u64,
    exit_signal: Option<u64>,
) -> Result<Pid, CloneError> {
    clone(cb, flags, exit_signal)
}

anti-entropy123 · 2024-03-27T11:09:52Z

all the tests still pass

Amazing... I'm curious about the reason. @zahash

zahash · 2024-03-27T11:21:39Z

@anti-entropy123

Amazing... I'm curious about the reason

I'm equally curious. 😂

maybe this is undefined behavior.

anti-entropy123 · 2024-03-27T11:29:50Z

maybe this is undefined behavior

My expectation was that a segmentation fault would occur.

zahash · 2024-03-27T11:30:18Z

@anti-entropy123 if found this on the kernel man page

The clone() wrapper function
       When the child process is created with the clone() wrapper
       function, it commences execution by calling the function pointed
       to by the argument fn.  (This differs from [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html), where
       execution continues in the child from the point of the [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html)
       call.)  The arg argument is passed as the argument of the
       function fn.

       When the fn(arg) function returns, the child process terminates.
       The integer returned by fn is the exit status for the child
       process.  The child process may also terminate explicitly by
       calling [exit(2)](https://man7.org/linux/man-pages/man2/exit.2.html) or after receiving a fatal signal.

       The stack argument specifies the location of the stack used by
       the child process.  Since the child and calling process may share
       memory, it is not possible for the child process to execute in
       the same stack as the calling process.  The calling process must
       therefore set up memory space for the child stack and pass a
       pointer to this space to clone().  Stacks grow downward on all
       processors that run Linux (except the HP PA processors), so stack
       usually points to the topmost address of the memory space set up
       for the child stack.  Note that clone() does not provide a means
       whereby the caller can inform the kernel of the size of the stack
       area.

https://man7.org/linux/man-pages/man2/clone.2.html

zahash · 2024-03-27T11:33:28Z

what about others? what do they think? @utam0k @YJDoc2 @yihuaf

anti-entropy123 · 2024-03-27T11:35:39Z

I think I understand now. Even though clone() assigns a new stack to the child process, the original memory region (the old stack) may still be valid. So, those references remain effective.
Do you think so? @zahash

YJDoc2 · 2024-03-27T11:36:50Z

Hey, will need some time to read all the discussion above and consider it. Probably would get back on this around the weekend....

zahash · 2024-03-27T11:38:30Z

@anti-entropy123

maybe. but how can one process directly access the memory of another process? its not allowed right?

or maybe its allowed in "parent - child" processes?

or maybe its because, the child gets an exact copy of the parent's memory when clone/fork

anti-entropy123 · 2024-03-27T11:42:48Z

No, what I mean is that the old stack is also copied to the child process. The address spaces of the parent and child processes are also isolated. I'm planning to observe the maps of both processes. @zahash

zahash · 2024-03-27T11:44:51Z

@anti-entropy123

I'm planning to observe the maps of both processes

yes please. that would be very helpful 🥰

anti-entropy123 · 2024-03-27T12:20:44Z

@zahash It seems our guess was correct.
I made some changes to make testing easier for me:

When running youki create, the process prints logs:

Check the maps file for both the parent and child processes separately (using sudo cat /proc//maps):

Parent:
sudo cat /proc/26778/maps
...
5610cba8c000-5610cbacf000 rw-p 00000000 00:00 0 [heap]
7f7116e68000-7f7116e69000 ---p 00000000 00:00 0
7f7116e69000-7f711766b000 rw-p 00000000 00:00 0
...
7ffccdd65000-7ffccdd9b000 rw-p 00000000 00:00 0 [stack]
...
Child:
sudo cat /proc/26779/maps | grep rw
...
5610cba8c000-5610cbacf000 rw-p 00000000 00:00 0 [heap]
7f7116e69000-7f711766b000 rw-p 00000000 00:00 0
...
7f7117a6b000-7f7117a6c000 rw-p 00000000 00:00 0
7ffccdd65000-7ffccdd9b000 rw-p 00000000 00:00 0 [stack]

Both the memory region containing the local variables of the child process (0x7f71176678c4) and the region referencing the local variables of the parent process (located at 7ffccdd65000-7ffccdd9b000) appear to be valid. So now it all makes sense. 🎉

utam0k · 2024-03-30T07:05:52Z

@anti-entropy123 @utam0k does cargo test run all the available tests?

No, you can see our test coverage here: https://app.codecov.io/gh/containers/youki

utam0k · 2024-03-30T07:30:19Z

@anti-entropy123 @zahash
Sorry for the late coming. My understanding is a memory stack is copied from parent process's one when clone(2) is called. But be careful, it is just copy, so changes in the child process's memory don't affect the parent's one.

YJDoc2 · 2024-05-26T17:51:07Z

@anti-entropy123 , @zahash if you are still interested in this, the nix update to 0.28 is done, so this can be worked on.

zahash · 2024-05-27T10:32:39Z

@YJDoc2 yes I will start working on it, thanks for notifying me

zahash · 2024-06-05T12:17:30Z

@YJDoc2 @utam0k

i bought a windows laptop and using wsl to work on this project.
i'm getting this error when running cargo test

failures:

---- systemd::dbus_native::dbus::tests::test_dbus_function_calls stdout ----
Error: DBus(ConnectionError("ENOENT: No such file or directory"))

---- systemd::manager::tests::test_task_addition stdout ----
thread 'systemd::manager::tests::test_task_addition' panicked at crates/libcgroups/src/systemd/manager.rs:553:10:
called `Result::unwrap()` on an `Err` value: SystemdClient(DBus(BusAddressError("no valid bus path found in list unix:path=/run/user/1000/bus")))

---- systemd::dbus_native::dbus::tests::test_dbus_function_calls_errors stdout ----
thread 'systemd::dbus_native::dbus::tests::test_dbus_function_calls_errors' panicked at crates/libcgroups/src/systemd/dbus_native/dbus.rs:539:69:
called `Result::unwrap()` on an `Err` value: DBus(ConnectionError("ENOENT: No such file or directory"))

---- systemd::dbus_native::dbus::tests::test_dbus_connection_auth stdout ----
thread 'systemd::dbus_native::dbus::tests::test_dbus_connection_auth' panicked at crates/libcgroups/src/systemd/dbus_native/dbus.rs:486:9:
assertion failed: conn.is_ok()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    systemd::dbus_native::dbus::tests::test_dbus_connection_auth
    systemd::dbus_native::dbus::tests::test_dbus_function_calls
    systemd::dbus_native::dbus::tests::test_dbus_function_calls_errors
    systemd::manager::tests::test_task_addition

test result: FAILED. 139 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.11s

error: test failed, to rerun pass `-p libcgroups --lib`

zahash · 2024-06-05T12:35:50Z

here is the initial PR #2809

YJDoc2 · 2024-06-05T18:18:15Z

Hey @zahash

i bought a windows laptop and using wsl to work on this project.
i'm getting this error when running cargo test

I have never used wsl, so not 100% sure, but from error it seems like the dbus is not setup properly. You might check if wsl runs systemd+dbus in the first place, which is needed for these tests. In case it doesn't, it should be fine as long as you are not changing any code related to the dbus impl.

utam0k added help wanted Extra attention is needed good first issue Good for newcomers labels Oct 10, 2023

utam0k assigned oirom Dec 21, 2023

YJDoc2 assigned zahash and unassigned oirom Mar 26, 2024

zahash mentioned this issue Jun 8, 2024

initial progress on supporting OwnedFd #2809

Merged

Fully support OwnedFd #2417

Fully support OwnedFd #2417

Comments

anti-entropy123 commented Oct 9, 2023

anti-entropy123 commented Oct 9, 2023 • edited Loading

oirom commented Dec 21, 2023

utam0k commented Dec 21, 2023

zahash commented Mar 25, 2024

zahash commented Mar 25, 2024

zahash commented Mar 25, 2024

anti-entropy123 commented Mar 25, 2024

zahash commented Mar 25, 2024

anti-entropy123 commented Mar 25, 2024

zahash commented Mar 25, 2024

anti-entropy123 commented Mar 25, 2024

zahash commented Mar 25, 2024 • edited Loading

zahash commented Mar 25, 2024 • edited Loading

anti-entropy123 commented Mar 26, 2024

anti-entropy123 commented Mar 26, 2024

anti-entropy123 commented Mar 26, 2024

zahash commented Mar 26, 2024

zahash commented Mar 26, 2024 • edited Loading

YJDoc2 commented Mar 26, 2024

zahash commented Mar 26, 2024

zahash commented Mar 26, 2024

YJDoc2 commented Mar 26, 2024

anti-entropy123 commented Mar 26, 2024

oirom commented Mar 26, 2024

YJDoc2 commented Mar 26, 2024

zahash commented Mar 27, 2024

anti-entropy123 commented Mar 27, 2024

zahash commented Mar 27, 2024 • edited Loading

zahash commented Mar 27, 2024

zahash commented Mar 27, 2024 • edited Loading

anti-entropy123 commented Mar 27, 2024 • edited Loading

zahash commented Mar 27, 2024 • edited Loading

anti-entropy123 commented Mar 27, 2024

zahash commented Mar 27, 2024 • edited Loading

anti-entropy123 commented Mar 27, 2024

zahash commented Mar 27, 2024

anti-entropy123 commented Mar 27, 2024

zahash commented Mar 27, 2024 • edited Loading

anti-entropy123 commented Mar 27, 2024

zahash commented Mar 27, 2024 • edited Loading

zahash commented Mar 27, 2024

anti-entropy123 commented Mar 27, 2024

YJDoc2 commented Mar 27, 2024

zahash commented Mar 27, 2024 • edited Loading

anti-entropy123 commented Mar 27, 2024

zahash commented Mar 27, 2024

anti-entropy123 commented Mar 27, 2024

utam0k commented Mar 30, 2024

utam0k commented Mar 30, 2024

YJDoc2 commented May 26, 2024

zahash commented May 27, 2024

zahash commented Jun 5, 2024

zahash commented Jun 5, 2024

YJDoc2 commented Jun 5, 2024

anti-entropy123 commented Oct 9, 2023 •

edited

Loading

zahash commented Mar 25, 2024 •

edited

Loading

zahash commented Mar 25, 2024 •

edited

Loading

zahash commented Mar 26, 2024 •

edited

Loading

zahash commented Mar 27, 2024 •

edited

Loading

zahash commented Mar 27, 2024 •

edited

Loading

anti-entropy123 commented Mar 27, 2024 •

edited

Loading

zahash commented Mar 27, 2024 •

edited

Loading

zahash commented Mar 27, 2024 •

edited

Loading

zahash commented Mar 27, 2024 •

edited

Loading

zahash commented Mar 27, 2024 •

edited

Loading

zahash commented Mar 27, 2024 •

edited

Loading