virtcontainers : fix shared dir resource remaining #292

jshachm · 2018-05-09T07:17:29Z

Before this patch shared dir will reamin when sandox
has already removed.

Do clean up shared dirs after all mounts are umounted

Signed-off-by: Haomin caihaomin@huawei.com

jodh-intel · 2018-05-09T08:21:48Z

virtcontainers/container.go

+			} else if err := os.RemoveAll(m.HostPath); err != nil {
+				// since the mounts related to the shared dir is umounted
+				// we need to remove the host path to avoid resource remaining
+				c.Logger().WithFields(logrus.Fields{


This log call essentially simulates a call to WithError() and although it is simpler than having to call WithError(), I think it's safer to explicitly call WithError() to protect against it changing its behaviour at some future point.

Since this is based on the log call code above, I'd do something like:

logger := c.Logger().WithField("host-path", m.HostPath) if err := syscall.Unmount(m.HostPath, 0); err != nil { logger.WithError(err).Warn("could not unmount") } else if err := os.RemoveAll(m.HostPath); err != nil { logger.WithError(err).Warn("could not remove path") }

Going to change it now~

jodh-intel · 2018-05-09T08:27:47Z

virtcontainers/container.go

@@ -405,6 +405,14 @@ func (c *Container) unmountHostMounts() error {
 					"error":     err,
 				}).Warn("Could not umount")
 				return err
+			} else if err := os.RemoveAll(m.HostPath); err != nil {


I think you should be able to create a unit test for this in container_test.go where you set c.mounts[0] to contain a double bind-mounted path. That way, the Unmount() will succeed but the RemoveAll() will fail (as there is still one more Unmount() call to make).

codecov · 2018-05-09T09:11:15Z

Codecov Report

Merging #292 into master will decrease coverage by 0.01%.
The diff coverage is 35.29%.

@@            Coverage Diff             @@
##           master     #292      +/-   ##
==========================================
- Coverage   63.79%   63.77%   -0.02%     
==========================================
  Files          87       87              
  Lines        8811     8823      +12     
==========================================
+ Hits         5621     5627       +6     
- Misses       2587     2592       +5     
- Partials      603      604       +1

Impacted Files	Coverage Δ
virtcontainers/agent.go	`92.15% <ø> (ø)`	⬆️
virtcontainers/container.go	`46.59% <0%> (-0.1%)`	⬇️
virtcontainers/sandbox.go	`66.83% <0%> (-0.23%)`	⬇️
virtcontainers/noop_agent.go	`92% <100%> (+0.33%)`	⬆️
virtcontainers/hyperstart_agent.go	`58.76% <100%> (+0.18%)`	⬆️
virtcontainers/kata_agent.go	`30.32% <33.33%> (+0.07%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f409b9...8a6d383. Read the comment docs.

sboeuf · 2018-05-14T15:13:26Z

virtcontainers/sandbox.go

@@ -1162,6 +1162,12 @@ func (s *Sandbox) stop() error {
 		return err
 	}

+	// vm is stopped remove the sandbox shared dir
+	sandboxSharedDir := filepath.Join(kataHostSharedDir, s.id)


This cannot be done here. kataHostSharedDir is specific to the kata_agent.go implementation, and should never be used by the generic sandbox implementation.

Sry for missing this. This should be done more generic.

@jshachm You can introduce an agent interface to cleanup agent specific artifacts after the VM has stopped.

But wait, any cleanup specific to an agent implementation should be part of either stopContainer() if it's related to a container, or stopSandbox() if it's related to the sandbox itself.
I don't see the value (and actually this would imply duplication) of introducing a new method to the agent interface.

@sboeuf The kata host directory can only be deleted after the VM has stopped since it is shared with the VM through 9p. So this cannot be part of stopContainer or stopSandbox.

Oh yes good point, then in this case, a agent.cleanup() function as you suggested sounds like a good idea.

jshachm · 2018-05-15T02:36:22Z

Need to find a better way to remove sandbox work dirs and solve the case that shared dirs are mounted to different destinations. So do-not-merge by now

jshachm · 2018-06-01T01:44:41Z

@amshinde Good suggestions. Working on it and maybe it's good for #351

katabuilder · 2018-06-05T06:46:01Z

PSS Measurement:
Qemu: 159779 KB
Proxy: 8730 KB
Shim: 10810 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996996 KB

jshachm · 2018-06-05T08:53:10Z

@sboeuf @amshinde get cleanupSandbox done.

Maybe need more unit test as @jodh-intel suggested. Will push those later.

amshinde · 2018-06-05T17:44:53Z

thanks @jshachm!
lgtm

sboeuf

Looks good, just a few things to update.

sboeuf · 2018-06-12T18:15:20Z

virtcontainers/kata_agent.go

@@ -528,6 +528,14 @@ func (k *kataAgent) stopSandbox(sandbox *Sandbox) error {
 	return k.proxy.stop(sandbox, k.state.ProxyPid)
 }

+func (k *kataAgent) cleanupSandbox(sandbox *Sandbox) error {
+	sandboxSharedDir := filepath.Join(kataHostSharedDir, sandbox.id)
+	if err := os.RemoveAll(sandboxSharedDir); err != nil {


Please simplify the function this way:

func (k *kataAgent) cleanupSandbox(sandbox *Sandbox) error { return os.RemoveAll(filepath.Join(kataHostSharedDir, sandbox.id)) }

sboeuf · 2018-06-12T18:17:08Z

virtcontainers/sandbox.go

+	if err := s.agent.cleanupSandbox(s); err != nil {
+		// cleanup resource failed shouldn't block destroy sandbox
+		// just raise a warning
+		s.Logger().WithError(err).Warnf("cleanup sandbox failed")


Looks fair !

sboeuf · 2018-06-12T18:17:52Z

virtcontainers/sandbox.go

+		// cleanup resource failed shouldn't block destroy sandbox
+		// just raise a warning
+		s.Logger().WithError(err).Warnf("cleanup sandbox failed")
+	}


Please add a blank line to let the code breathe :)

sboeuf · 2018-06-12T18:18:27Z

virtcontainers/virtcontainers_test.go

@@ -41,7 +41,7 @@ var testQemuPath = ""
 var testHyperstartCtlSocket = ""
 var testHyperstartTtySocket = ""

-// cleanUp Removes any stale sandbox/container state that can affect
+// cleanup Removes any stale sandbox/container state that can affect


This should not be part of this PR.

jshachm · 2018-06-13T01:28:10Z

@sboeuf all changes done~

sboeuf · 2018-06-13T04:59:14Z

Thanks I'll take a look !

katacontainersbot · 2018-06-13T05:12:38Z

PSS Measurement:
Qemu: 141320 KB
Proxy: 4567 KB
Shim: 8750 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007284 KB

sboeuf · 2018-06-15T18:03:32Z

@jshachm LGTM
Could you please repush to retrigger the CI now that we should not have glitches anymore.

jshachm · 2018-06-18T05:21:44Z

@sboeuf Can't find the button in jenkins ... I can only add a . in my commit message to rebuild CI ....

katacontainersbot · 2018-06-18T05:32:39Z

PSS Measurement:
Qemu: 143861 KB
Proxy: 4694 KB
Shim: 9144 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

grahamwhaley · 2018-06-18T10:25:57Z

CI failures. This is from the 16.04 log:

=== RUN   TestStartStopSandboxHyperstartAgentSuccessfulWithCNINetwork
--- FAIL: TestStartStopSandboxHyperstartAgentSuccessfulWithCNINetwork (0.01s)
	cc_proxy_mock.go:80: [CCProxyMock] listening on /tmp/cc-proxy-test799007796/cc-proxy-test.sock
	api_test.go:1425: missing network name
	cc_proxy_mock.go:80: [CCProxyMock] Client connection already closed
	cc_proxy_mock.go:80: [CCProxyMock] Stopped
=== RUN   TestStartStopSandboxHyperstartAgentSuccessfulWithCNMNetwork
--- FAIL: TestStartStopSandboxHyperstartAgentSuccessfulWithCNMNetwork (0.00s)
	cc_proxy_mock.go:80: [CCProxyMock] listening on /tmp/cc-proxy-test151597571/cc-proxy-test.sock
	api_test.go:1463: sandbox 7f49d00d-1995-4156-8c79-5f5ab24ce138 exists
	cc_proxy_mock.go:80: [CCProxyMock] Client connection already closed
	cc_proxy_mock.go:80: [CCProxyMock] Stopped

I'm seeing the same in the f27 run as well, so I suspect this is not just spurious.
(side note for @jshachm - you can only press the 'rebuild' button on the CI if:

you are logged in with your github ID
and you are a member of the kata github org

Otherwise, yes, forcing a change of a commit SHA is they other way to force a build - modding the commit message or doing a rebase if something has been merged in the master branch are the most popular ways ;-) )

grahamwhaley

I'd really like to see a description in the commit message that details:

what was wrong, and why it failed
how and why we fixed it

The commit message right now doesn't really tell me if the dir was always left behind, or just under certain circumstances etc.

Having said that, generally for cleaning up :-)
lgtm

grahamwhaley · 2018-06-18T10:30:25Z

virtcontainers/container.go

-					"error":     err,
-				}).Warn("Could not umount")
+				logger.WithError(err).Warn("Could not umount")
+				return err


Not a change by your patch, but directly related - it seems if we fail one unmount then we quit the loop. Would it maybe be better for us to continue to try and unmount the remaining mounts in c.mounts, and clean up as much as we can (even though we know we will ultimately fail this function)?
Same for the following RemoveAll as well.

@grahamwhaley great catch and suggestions~

When I coded these part, it also took me a lot of time to think whether continue to next is better than break from here.After thinking and discussion with teammates, and some "sad" experience from our production environment, I think umount failure is a very bad situation maybe caused by mount point leak and other bad thing... So we need to break right here and raise a warning to upper platform like k8s.

In other word, If a remaining has already happened, maybe it better to alarm the SRE department as quick as possible to get involved in this problem.

What do you think about this, glad to hear your suggestion in this situation~~~ ^_^

Hi @jshachm Yeah, it is a tricky call if we quit or carry on. I can see your point that 'fail fast' and try not to make things worse is perfectly valid. Maybe that is the safer way which will:

possibly cause 'less damage' if things are really broken

give the admins more of a chance to diagnose the problem

So, I'm happy to leave as is. If we do, can we add a big comment saying why we quit on the first failure ;-)
Probably best if we get some other input from maybe @egernst and @gnawux maybe, and others if anybody has a strong opinion on this?

jodh-intel · 2018-06-18T10:58:47Z

The test failures I think are kata-containers/tests#427 (see #385 (comment)).

jshachm · 2018-06-18T15:37:40Z

@grahamwhaley thx for the remind of commit msg. I will update it and make it more clear once I get to my workstation next morning... And for CI thing, with the help of @chavafg I finally get my button back~

@jodh-intel As you mentioned , CI will not be happy until we change the scripts of ping CNI?

sboeuf · 2018-06-18T16:27:28Z

@jshachm if you rebase your PR on top of master (make sure you fetch), you should get the CI test passing since the CNI issue got fixed.

amshinde · 2018-06-18T16:36:50Z

@jshachm Yes, CI has been fixed now. Please rebase on top of master.

katacontainersbot · 2018-06-19T11:17:41Z

PSS Measurement:
Qemu: 150043 KB
Proxy: 4501 KB
Shim: 8932 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

Before this patch shared dir will reamin when sandox has already removed, espacilly for kata-agent mod. Do clean up shared dirs after all mounts are umounted. Fixes: #291 Signed-off-by: Haomin <caihaomin@huawei.com>

katacontainersbot · 2018-06-19T12:52:24Z

PSS Measurement:
Qemu: 141670 KB
Proxy: 4710 KB
Shim: 8882 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

logging: Add sandbox field

jodh-intel reviewed May 9, 2018

View reviewed changes

sboeuf suggested changes May 14, 2018

View reviewed changes

jshachm added the do-not-merge label May 15, 2018

jshachm removed the do-not-merge label Jun 5, 2018

sboeuf suggested changes Jun 12, 2018

View reviewed changes

sboeuf approved these changes Jun 15, 2018

View reviewed changes

jshachm mentioned this pull request Jun 18, 2018

Add cleanup task for failed creates #404

Closed

grahamwhaley approved these changes Jun 18, 2018

View reviewed changes

virtcontainers : fix shared dir resource remaining

8a6d383

Before this patch shared dir will reamin when sandox has already removed, espacilly for kata-agent mod. Do clean up shared dirs after all mounts are umounted. Fixes: #291 Signed-off-by: Haomin <caihaomin@huawei.com>

sboeuf merged commit 4f102f8 into kata-containers:master Jun 19, 2018

jshachm mentioned this pull request Oct 27, 2018

Lots of folders and files remain in disk and memory when create container failed #862

Closed

zklei pushed a commit to zklei/runtime that referenced this pull request Jun 13, 2019

Merge pull request kata-containers#292 from jodh-intel/log-sandbox-id

0d814d1

logging: Add sandbox field

virtcontainers : fix shared dir resource remaining #292

virtcontainers : fix shared dir resource remaining #292

Conversation

jshachm commented May 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 9, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jshachm commented May 15, 2018

jshachm commented Jun 1, 2018

katabuilder commented Jun 5, 2018

jshachm commented Jun 5, 2018 • edited Loading

amshinde commented Jun 5, 2018 • edited Loading

sboeuf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jshachm commented Jun 13, 2018

sboeuf commented Jun 13, 2018

katacontainersbot commented Jun 13, 2018

sboeuf commented Jun 15, 2018

jshachm commented Jun 18, 2018

katacontainersbot commented Jun 18, 2018

grahamwhaley commented Jun 18, 2018

grahamwhaley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jodh-intel commented Jun 18, 2018

jshachm commented Jun 18, 2018

sboeuf commented Jun 18, 2018

amshinde commented Jun 18, 2018

katacontainersbot commented Jun 19, 2018

katacontainersbot commented Jun 19, 2018

codecov bot commented May 9, 2018 •

edited

Loading

jshachm commented Jun 5, 2018 •

edited

Loading

amshinde commented Jun 5, 2018 •

edited

Loading