-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
libcontainer/cgroups/fs: fix OCI runtime pause failed #4388
base: main
Are you sure you want to change the base?
Conversation
7ce71d3
to
296d0c8
Compare
libcontainer/cgroups/fs/freezer.go
Outdated
if i%500 == 499 { | ||
// should sleep a longer time for | ||
// some really very slow machine. | ||
time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really very long sleep interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really very long sleep interval.
Yes, for some machines, it may cause frozen failed if not sleep a longer time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, I think this looks like a cat-is-catching-mouse
game, when the mouse runs more quickly, the cat needs more stength.
So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, I think this looks like a
cat-is-catching-mouse
game, when the mouse runs more quickly, the cat needs more stength. So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?
Indeed, as you mentioned, it's hard to determine the exact sleep time. In my case, when I was running an LLM training container, I needed to checkpoint the container and export a snapshot. A relatively long sleep time was required to read correct value of freezer.state
, ensuring that the checkpoint was successful. Of course, if necessary, I think we could add a retry mechanism to ensure the program is foolproof.
296d0c8
to
f0936b9
Compare
f0936b9
to
c6112ac
Compare
@lifubang @kolyshkin I have add some retry mechanism, PTAL again :) |
c6112ac
to
e12061a
Compare
For some instance, runc pause still failed with `ctr: OCI runtime pause failed: unable to freeze: unknown`. We should let it sleep a longer time for some really very slow system or machine. Signed-off-by: Song Zhang <zhangsong34@huawei.com>
e12061a
to
b2f8637
Compare
For some instance, runc pause still failed with
ctr: OCI runtime pause failed: unable to freeze: unknown
.We should let it sleep a longer time for some really very slow system or machine.