libcontainer/cgroups/fs: fix OCI runtime pause failed #4388

botieking98 · 2024-08-31T03:06:44Z

For some instance, runc pause still failed with
ctr: OCI runtime pause failed: unable to freeze: unknown.

We should let it sleep a longer time for some really very slow system or machine.

kolyshkin · 2024-09-03T17:55:28Z

libcontainer/cgroups/fs/freezer.go

+			if i%500 == 499 {
+				// should sleep a longer time for
+				// some really very slow machine.
+				time.Sleep(5 * time.Second)


This is a really very long sleep interval.

This is a really very long sleep interval.

Yes, for some machines, it may cause frozen failed if not sleep a longer time.

In my opinion, I think this looks like a cat-is-catching-mouse game, when the mouse runs more quickly, the cat needs more stength.
So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?

In my opinion, I think this looks like a cat-is-catching-mouse game, when the mouse runs more quickly, the cat needs more stength. So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?

Indeed, as you mentioned, it's hard to determine the exact sleep time. In my case, when I was running an LLM training container, I needed to checkpoint the container and export a snapshot. A relatively long sleep time was required to read correct value of freezer.state, ensuring that the checkpoint was successful. Of course, if necessary, I think we could add a retry mechanism to ensure the program is foolproof.

botieking98 · 2024-09-14T09:20:23Z

@kolyshkin

botieking98 · 2024-09-23T12:46:47Z

@lifubang @kolyshkin I have add some retry mechanism, PTAL again :)

For some instance, runc pause still failed with `ctr: OCI runtime pause failed: unable to freeze: unknown`. We should let it sleep a longer time for some really very slow system or machine. Signed-off-by: Song Zhang <zhangsong34@huawei.com>

botieking98 force-pushed the fix-set-freeze branch 2 times, most recently from 7ce71d3 to 296d0c8 Compare August 31, 2024 03:26

botieking98 mentioned this pull request Sep 2, 2024

Cannot pause container - OCI runtime pause failed: unable to freeze: unknown" error_type="*errors.errorString" module=api moby/moby#48205

Open

kolyshkin reviewed Sep 3, 2024

View reviewed changes

botieking98 requested a review from kolyshkin September 9, 2024 01:53

botieking98 force-pushed the fix-set-freeze branch from 296d0c8 to f0936b9 Compare September 14, 2024 07:03

botieking98 force-pushed the fix-set-freeze branch from f0936b9 to c6112ac Compare September 23, 2024 12:43

botieking98 force-pushed the fix-set-freeze branch from c6112ac to e12061a Compare September 23, 2024 12:49

botieking98 force-pushed the fix-set-freeze branch from e12061a to b2f8637 Compare September 23, 2024 12:57

botieking98 requested a review from lifubang October 11, 2024 09:23

kolyshkin mentioned this pull request Oct 28, 2024

flaky tests: TestUsernsCheckpoint, TestCheckpoint #4273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libcontainer/cgroups/fs: fix OCI runtime pause failed #4388

libcontainer/cgroups/fs: fix OCI runtime pause failed #4388

botieking98 commented Aug 31, 2024

kolyshkin Sep 3, 2024

botieking98 Sep 4, 2024

lifubang Sep 14, 2024 •

edited

Loading

botieking98 Sep 18, 2024

botieking98 commented Sep 14, 2024

botieking98 commented Sep 23, 2024 •

edited

Loading

libcontainer/cgroups/fs: fix OCI runtime pause failed #4388

Are you sure you want to change the base?

libcontainer/cgroups/fs: fix OCI runtime pause failed #4388

Conversation

botieking98 commented Aug 31, 2024

kolyshkin Sep 3, 2024

Choose a reason for hiding this comment

botieking98 Sep 4, 2024

Choose a reason for hiding this comment

lifubang Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

botieking98 Sep 18, 2024

Choose a reason for hiding this comment

botieking98 commented Sep 14, 2024

botieking98 commented Sep 23, 2024 • edited Loading

lifubang Sep 14, 2024 •

edited

Loading

botieking98 commented Sep 23, 2024 •

edited

Loading