-
Notifications
You must be signed in to change notification settings - Fork 373
virtcontainers : fix shared dir resource remaining #292
Conversation
virtcontainers/container.go
Outdated
} else if err := os.RemoveAll(m.HostPath); err != nil { | ||
// since the mounts related to the shared dir is umounted | ||
// we need to remove the host path to avoid resource remaining | ||
c.Logger().WithFields(logrus.Fields{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log call essentially simulates a call to WithError()
and although it is simpler than having to call WithError()
, I think it's safer to explicitly call WithError()
to protect against it changing its behaviour at some future point.
Since this is based on the log call code above, I'd do something like:
logger := c.Logger().WithField("host-path", m.HostPath)
if err := syscall.Unmount(m.HostPath, 0); err != nil {
logger.WithError(err).Warn("could not unmount")
} else if err := os.RemoveAll(m.HostPath); err != nil {
logger.WithError(err).Warn("could not remove path")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to change it now~
virtcontainers/container.go
Outdated
@@ -405,6 +405,14 @@ func (c *Container) unmountHostMounts() error { | |||
"error": err, | |||
}).Warn("Could not umount") | |||
return err | |||
} else if err := os.RemoveAll(m.HostPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be able to create a unit test for this in container_test.go
where you set c.mounts[0]
to contain a double bind-mounted path. That way, the Unmount()
will succeed but the RemoveAll()
will fail (as there is still one more Unmount()
call to make).
Codecov Report
@@ Coverage Diff @@
## master #292 +/- ##
==========================================
- Coverage 63.79% 63.77% -0.02%
==========================================
Files 87 87
Lines 8811 8823 +12
==========================================
+ Hits 5621 5627 +6
- Misses 2587 2592 +5
- Partials 603 604 +1
Continue to review full report at Codecov.
|
virtcontainers/sandbox.go
Outdated
@@ -1162,6 +1162,12 @@ func (s *Sandbox) stop() error { | |||
return err | |||
} | |||
|
|||
// vm is stopped remove the sandbox shared dir | |||
sandboxSharedDir := filepath.Join(kataHostSharedDir, s.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be done here. kataHostSharedDir
is specific to the kata_agent.go
implementation, and should never be used by the generic sandbox implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sry for missing this. This should be done more generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jshachm You can introduce an agent interface to cleanup agent specific artifacts after the VM has stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wait, any cleanup specific to an agent implementation should be part of either stopContainer()
if it's related to a container, or stopSandbox()
if it's related to the sandbox itself.
I don't see the value (and actually this would imply duplication) of introducing a new method to the agent interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf The kata host directory can only be deleted after the VM has stopped since it is shared with the VM through 9p. So this cannot be part of stopContainer
or stopSandbox
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes good point, then in this case, a agent.cleanup()
function as you suggested sounds like a good idea.
Need to find a better way to remove sandbox work dirs and solve the case that shared dirs are mounted to different destinations. So do-not-merge by now |
PSS Measurement: Memory inside container: |
@sboeuf @amshinde get Maybe need more unit test as @jodh-intel suggested. Will push those later. |
thanks @jshachm! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few things to update.
virtcontainers/kata_agent.go
Outdated
@@ -528,6 +528,14 @@ func (k *kataAgent) stopSandbox(sandbox *Sandbox) error { | |||
return k.proxy.stop(sandbox, k.state.ProxyPid) | |||
} | |||
|
|||
func (k *kataAgent) cleanupSandbox(sandbox *Sandbox) error { | |||
sandboxSharedDir := filepath.Join(kataHostSharedDir, sandbox.id) | |||
if err := os.RemoveAll(sandboxSharedDir); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please simplify the function this way:
func (k *kataAgent) cleanupSandbox(sandbox *Sandbox) error {
return os.RemoveAll(filepath.Join(kataHostSharedDir, sandbox.id))
}
if err := s.agent.cleanupSandbox(s); err != nil { | ||
// cleanup resource failed shouldn't block destroy sandbox | ||
// just raise a warning | ||
s.Logger().WithError(err).Warnf("cleanup sandbox failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fair !
// cleanup resource failed shouldn't block destroy sandbox | ||
// just raise a warning | ||
s.Logger().WithError(err).Warnf("cleanup sandbox failed") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a blank line to let the code breathe :)
@@ -41,7 +41,7 @@ var testQemuPath = "" | |||
var testHyperstartCtlSocket = "" | |||
var testHyperstartTtySocket = "" | |||
|
|||
// cleanUp Removes any stale sandbox/container state that can affect | |||
// cleanup Removes any stale sandbox/container state that can affect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be part of this PR.
@sboeuf all changes done~ |
Thanks I'll take a look ! |
PSS Measurement: Memory inside container: |
@jshachm LGTM |
@sboeuf Can't find the button in jenkins ... I can only add a |
PSS Measurement: Memory inside container: |
CI failures. This is from the 16.04 log:
I'm seeing the same in the f27 run as well, so I suspect this is not just spurious.
Otherwise, yes, forcing a change of a commit SHA is they other way to force a build - modding the commit message or doing a rebase if something has been merged in the master branch are the most popular ways ;-) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to see a description in the commit message that details:
- what was wrong, and why it failed
- how and why we fixed it
The commit message right now doesn't really tell me if the dir was always left behind, or just under certain circumstances etc.
Having said that, generally for cleaning up :-)
lgtm
virtcontainers/container.go
Outdated
"error": err, | ||
}).Warn("Could not umount") | ||
logger.WithError(err).Warn("Could not umount") | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a change by your patch, but directly related - it seems if we fail one unmount then we quit the loop. Would it maybe be better for us to continue to try and unmount the remaining mounts in c.mounts
, and clean up as much as we can (even though we know we will ultimately fail this function)?
Same for the following RemoveAll
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grahamwhaley great catch and suggestions~
When I coded these part, it also took me a lot of time to think whether continue to next
is better than break from here
.After thinking and discussion with teammates, and some "sad" experience from our production environment, I think umount
failure is a very bad situation maybe caused by mount point leak
and other bad thing... So we need to break right here and raise a warning to upper platform like k8s
.
In other word, If a remaining has already happened, maybe it better to alarm the SRE
department as quick as possible to get involved in this problem.
What do you think about this, glad to hear your suggestion in this situation~~~ ^_^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jshachm Yeah, it is a tricky call if we quit or carry on. I can see your point that 'fail fast' and try not to make things worse is perfectly valid. Maybe that is the safer way which will:
- possibly cause 'less damage' if things are really broken
- give the admins more of a chance to diagnose the problem
So, I'm happy to leave as is. If we do, can we add a big comment saying why we quit on the first failure ;-)
Probably best if we get some other input from maybe @egernst and @gnawux maybe, and others if anybody has a strong opinion on this?
The test failures I think are kata-containers/tests#427 (see #385 (comment)). |
@grahamwhaley thx for the remind of @jodh-intel As you mentioned , CI will not be happy until we change the scripts of ping CNI? |
@jshachm if you rebase your PR on top of master (make sure you fetch), you should get the CI test passing since the CNI issue got fixed. |
@jshachm Yes, CI has been fixed now. Please rebase on top of master. |
PSS Measurement: Memory inside container: |
Before this patch shared dir will reamin when sandox has already removed, espacilly for kata-agent mod. Do clean up shared dirs after all mounts are umounted. Fixes: #291 Signed-off-by: Haomin <caihaomin@huawei.com>
PSS Measurement: Memory inside container: |
logging: Add sandbox field
Before this patch shared dir will reamin when sandox
has already removed.
Do clean up shared dirs after all mounts are umounted
Fix: #291
Signed-off-by: Haomin caihaomin@huawei.com