Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

virtcontainers : fix shared dir resource remaining #292

Merged
merged 1 commit into from
Jun 19, 2018
Merged

virtcontainers : fix shared dir resource remaining #292

merged 1 commit into from
Jun 19, 2018

Conversation

jshachm
Copy link
Member

@jshachm jshachm commented May 9, 2018

Before this patch shared dir will reamin when sandox
has already removed.

Do clean up shared dirs after all mounts are umounted

Fix: #291

Signed-off-by: Haomin caihaomin@huawei.com

} else if err := os.RemoveAll(m.HostPath); err != nil {
// since the mounts related to the shared dir is umounted
// we need to remove the host path to avoid resource remaining
c.Logger().WithFields(logrus.Fields{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log call essentially simulates a call to WithError() and although it is simpler than having to call WithError(), I think it's safer to explicitly call WithError() to protect against it changing its behaviour at some future point.

Since this is based on the log call code above, I'd do something like:

logger := c.Logger().WithField("host-path", m.HostPath)

if err := syscall.Unmount(m.HostPath, 0); err != nil {
    logger.WithError(err).Warn("could not unmount")
} else if err := os.RemoveAll(m.HostPath); err != nil {
    logger.WithError(err).Warn("could not remove path")
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to change it now~

@@ -405,6 +405,14 @@ func (c *Container) unmountHostMounts() error {
"error": err,
}).Warn("Could not umount")
return err
} else if err := os.RemoveAll(m.HostPath); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to create a unit test for this in container_test.go where you set c.mounts[0] to contain a double bind-mounted path. That way, the Unmount() will succeed but the RemoveAll() will fail (as there is still one more Unmount() call to make).

@codecov
Copy link

codecov bot commented May 9, 2018

Codecov Report

Merging #292 into master will decrease coverage by 0.01%.
The diff coverage is 35.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #292      +/-   ##
==========================================
- Coverage   63.79%   63.77%   -0.02%     
==========================================
  Files          87       87              
  Lines        8811     8823      +12     
==========================================
+ Hits         5621     5627       +6     
- Misses       2587     2592       +5     
- Partials      603      604       +1
Impacted Files Coverage Δ
virtcontainers/agent.go 92.15% <ø> (ø) ⬆️
virtcontainers/container.go 46.59% <0%> (-0.1%) ⬇️
virtcontainers/sandbox.go 66.83% <0%> (-0.23%) ⬇️
virtcontainers/noop_agent.go 92% <100%> (+0.33%) ⬆️
virtcontainers/hyperstart_agent.go 58.76% <100%> (+0.18%) ⬆️
virtcontainers/kata_agent.go 30.32% <33.33%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f409b9...8a6d383. Read the comment docs.

@@ -1162,6 +1162,12 @@ func (s *Sandbox) stop() error {
return err
}

// vm is stopped remove the sandbox shared dir
sandboxSharedDir := filepath.Join(kataHostSharedDir, s.id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be done here. kataHostSharedDir is specific to the kata_agent.go implementation, and should never be used by the generic sandbox implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry for missing this. This should be done more generic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jshachm You can introduce an agent interface to cleanup agent specific artifacts after the VM has stopped.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wait, any cleanup specific to an agent implementation should be part of either stopContainer() if it's related to a container, or stopSandbox() if it's related to the sandbox itself.
I don't see the value (and actually this would imply duplication) of introducing a new method to the agent interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf The kata host directory can only be deleted after the VM has stopped since it is shared with the VM through 9p. So this cannot be part of stopContainer or stopSandbox.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes good point, then in this case, a agent.cleanup() function as you suggested sounds like a good idea.

@jshachm
Copy link
Member Author

jshachm commented May 15, 2018

Need to find a better way to remove sandbox work dirs and solve the case that shared dirs are mounted to different destinations. So do-not-merge by now

@jshachm
Copy link
Member Author

jshachm commented Jun 1, 2018

@amshinde Good suggestions. Working on it and maybe it's good for #351

@katabuilder
Copy link

PSS Measurement:
Qemu: 159779 KB
Proxy: 8730 KB
Shim: 10810 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996996 KB

@jshachm
Copy link
Member Author

jshachm commented Jun 5, 2018

@sboeuf @amshinde get cleanupSandbox done.

Maybe need more unit test as @jodh-intel suggested. Will push those later.

@amshinde
Copy link
Member

amshinde commented Jun 5, 2018

thanks @jshachm!
lgtm

Approved with PullApprove

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few things to update.

@@ -528,6 +528,14 @@ func (k *kataAgent) stopSandbox(sandbox *Sandbox) error {
return k.proxy.stop(sandbox, k.state.ProxyPid)
}

func (k *kataAgent) cleanupSandbox(sandbox *Sandbox) error {
sandboxSharedDir := filepath.Join(kataHostSharedDir, sandbox.id)
if err := os.RemoveAll(sandboxSharedDir); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please simplify the function this way:

func (k *kataAgent) cleanupSandbox(sandbox *Sandbox) error {
        return os.RemoveAll(filepath.Join(kataHostSharedDir, sandbox.id))
}

if err := s.agent.cleanupSandbox(s); err != nil {
// cleanup resource failed shouldn't block destroy sandbox
// just raise a warning
s.Logger().WithError(err).Warnf("cleanup sandbox failed")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fair !

// cleanup resource failed shouldn't block destroy sandbox
// just raise a warning
s.Logger().WithError(err).Warnf("cleanup sandbox failed")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a blank line to let the code breathe :)

@@ -41,7 +41,7 @@ var testQemuPath = ""
var testHyperstartCtlSocket = ""
var testHyperstartTtySocket = ""

// cleanUp Removes any stale sandbox/container state that can affect
// cleanup Removes any stale sandbox/container state that can affect
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be part of this PR.

@jshachm
Copy link
Member Author

jshachm commented Jun 13, 2018

@sboeuf all changes done~

@sboeuf
Copy link

sboeuf commented Jun 13, 2018

Thanks I'll take a look !

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 141320 KB
Proxy: 4567 KB
Shim: 8750 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007284 KB

@sboeuf
Copy link

sboeuf commented Jun 15, 2018

@jshachm LGTM
Could you please repush to retrigger the CI now that we should not have glitches anymore.

@jshachm
Copy link
Member Author

jshachm commented Jun 18, 2018

@sboeuf Can't find the button in jenkins ... I can only add a . in my commit message to rebuild CI ....

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 143861 KB
Proxy: 4694 KB
Shim: 9144 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

@grahamwhaley
Copy link
Contributor

CI failures. This is from the 16.04 log:

=== RUN   TestStartStopSandboxHyperstartAgentSuccessfulWithCNINetwork
--- FAIL: TestStartStopSandboxHyperstartAgentSuccessfulWithCNINetwork (0.01s)
	cc_proxy_mock.go:80: [CCProxyMock] listening on /tmp/cc-proxy-test799007796/cc-proxy-test.sock
	api_test.go:1425: missing network name
	cc_proxy_mock.go:80: [CCProxyMock] Client connection already closed
	cc_proxy_mock.go:80: [CCProxyMock] Stopped
=== RUN   TestStartStopSandboxHyperstartAgentSuccessfulWithCNMNetwork
--- FAIL: TestStartStopSandboxHyperstartAgentSuccessfulWithCNMNetwork (0.00s)
	cc_proxy_mock.go:80: [CCProxyMock] listening on /tmp/cc-proxy-test151597571/cc-proxy-test.sock
	api_test.go:1463: sandbox 7f49d00d-1995-4156-8c79-5f5ab24ce138 exists
	cc_proxy_mock.go:80: [CCProxyMock] Client connection already closed
	cc_proxy_mock.go:80: [CCProxyMock] Stopped

I'm seeing the same in the f27 run as well, so I suspect this is not just spurious.
(side note for @jshachm - you can only press the 'rebuild' button on the CI if:

  • you are logged in with your github ID
  • and you are a member of the kata github org

Otherwise, yes, forcing a change of a commit SHA is they other way to force a build - modding the commit message or doing a rebase if something has been merged in the master branch are the most popular ways ;-) )

Copy link
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see a description in the commit message that details:

  • what was wrong, and why it failed
  • how and why we fixed it

The commit message right now doesn't really tell me if the dir was always left behind, or just under certain circumstances etc.

Having said that, generally for cleaning up :-)
lgtm

"error": err,
}).Warn("Could not umount")
logger.WithError(err).Warn("Could not umount")
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a change by your patch, but directly related - it seems if we fail one unmount then we quit the loop. Would it maybe be better for us to continue to try and unmount the remaining mounts in c.mounts, and clean up as much as we can (even though we know we will ultimately fail this function)?
Same for the following RemoveAll as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grahamwhaley great catch and suggestions~

When I coded these part, it also took me a lot of time to think whether continue to next is better than break from here.After thinking and discussion with teammates, and some "sad" experience from our production environment, I think umount failure is a very bad situation maybe caused by mount point leak and other bad thing... So we need to break right here and raise a warning to upper platform like k8s.

In other word, If a remaining has already happened, maybe it better to alarm the SRE department as quick as possible to get involved in this problem.

What do you think about this, glad to hear your suggestion in this situation~~~ ^_^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jshachm Yeah, it is a tricky call if we quit or carry on. I can see your point that 'fail fast' and try not to make things worse is perfectly valid. Maybe that is the safer way which will:

  • possibly cause 'less damage' if things are really broken
  • give the admins more of a chance to diagnose the problem

So, I'm happy to leave as is. If we do, can we add a big comment saying why we quit on the first failure ;-)
Probably best if we get some other input from maybe @egernst and @gnawux maybe, and others if anybody has a strong opinion on this?

@jodh-intel
Copy link
Contributor

The test failures I think are kata-containers/tests#427 (see #385 (comment)).

@jshachm
Copy link
Member Author

jshachm commented Jun 18, 2018

@grahamwhaley thx for the remind of commit msg. I will update it and make it more clear once I get to my workstation next morning... And for CI thing, with the help of @chavafg I finally get my button back~

@jodh-intel As you mentioned , CI will not be happy until we change the scripts of ping CNI?

@sboeuf
Copy link

sboeuf commented Jun 18, 2018

@jshachm if you rebase your PR on top of master (make sure you fetch), you should get the CI test passing since the CNI issue got fixed.

@amshinde
Copy link
Member

@jshachm Yes, CI has been fixed now. Please rebase on top of master.

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 150043 KB
Proxy: 4501 KB
Shim: 8932 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

Before this patch shared dir will reamin when sandox
has already removed, espacilly for kata-agent mod.

Do clean up shared dirs after all mounts are umounted.

Fixes: #291

Signed-off-by: Haomin <caihaomin@huawei.com>
@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 141670 KB
Proxy: 4710 KB
Shim: 8882 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

zklei pushed a commit to zklei/runtime that referenced this pull request Jun 13, 2019
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants