Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

libcontainer/cgroups/fs: fix OCI runtime pause failed #4388

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

botieking98
Copy link

For some instance, runc pause still failed with
ctr: OCI runtime pause failed: unable to freeze: unknown.

We should let it sleep a longer time for some really very slow system or machine.

if i%500 == 499 {
// should sleep a longer time for
// some really very slow machine.
time.Sleep(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really very long sleep interval.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really very long sleep interval.

Yes, for some machines, it may cause frozen failed if not sleep a longer time.

Copy link
Member

@lifubang lifubang Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, I think this looks like a cat-is-catching-mouse game, when the mouse runs more quickly, the cat needs more stength.
So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, I think this looks like a cat-is-catching-mouse game, when the mouse runs more quickly, the cat needs more stength. So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?

Indeed, as you mentioned, it's hard to determine the exact sleep time. In my case, when I was running an LLM training container, I needed to checkpoint the container and export a snapshot. A relatively long sleep time was required to read correct value of freezer.state, ensuring that the checkpoint was successful. Of course, if necessary, I think we could add a retry mechanism to ensure the program is foolproof.

@botieking98
Copy link
Author

@kolyshkin

@botieking98
Copy link
Author

botieking98 commented Sep 23, 2024

@lifubang @kolyshkin I have add some retry mechanism, PTAL again :)

For some instance, runc pause still failed with
`ctr: OCI runtime pause failed: unable to freeze: unknown`.

We should let it sleep a longer time for some really very
slow system or machine.

Signed-off-by: Song Zhang <zhangsong34@huawei.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants