Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

virtcontainers: process the case that kata-agent doesn't start in VM #351

Merged
merged 4 commits into from
Jul 24, 2018

Conversation

flyflypeng
Copy link
Contributor

virtcontainers: process the case that kata-agent doesn't start in VM

If kata-agent process is not launched to run in VM, kata-runtime will exit because of timeout,
however kata-proxy and qemu process will still resident in host

Call Check grpc function to check whether grpc server is running in kata-agent process in VM,
if not and then do some rollback operation to stop kata-proxy and qemu processes

Fixes: #297

Signed-off-by: flyflypeng jiangpengfei9@huawei.com

@katabuilder
Copy link

PSS Measurement:
Qemu: 144708 KB
Proxy: 6742 KB
Shim: 8796 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2013332 KB

@flyflypeng flyflypeng force-pushed the fix-no-kata-agent branch from f4152d6 to f4b92c2 Compare May 30, 2018 09:24
@katabuilder
Copy link

PSS Measurement:
Qemu: 144050 KB
Proxy: 6795 KB
Shim: 10930 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2011712 KB

@codecov
Copy link

codecov bot commented May 30, 2018

Codecov Report

Merging #351 into master will decrease coverage by 0.07%.
The diff coverage is 28%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #351      +/-   ##
=========================================
- Coverage   66.88%   66.8%   -0.08%     
=========================================
  Files          93      93              
  Lines        9475    9495      +20     
=========================================
+ Hits         6337    6343       +6     
- Misses       2466    2474       +8     
- Partials      672     678       +6
Impacted Files Coverage Δ
virtcontainers/sandbox.go 66.18% <0%> (-0.22%) ⬇️
virtcontainers/kata_agent.go 60.86% <28.57%> (-0.23%) ⬇️
virtcontainers/api.go 63.51% <31.25%> (-0.99%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99954d5...2993cb3. Read the comment docs.

@flyflypeng flyflypeng force-pushed the fix-no-kata-agent branch from f4b92c2 to f3028fe Compare May 30, 2018 11:25
@katabuilder
Copy link

PSS Measurement:
Qemu: 142416 KB
Proxy: 6825 KB
Shim: 10819 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2011464 KB

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising. I think we need @sboeuf's input on this one. If I'm understanding this, thescenario this PR covers is highly unlikely to occur in normal operation, unless maybe that agent crashed on startup?

There is also the question of performance - what sort of hit are we having to take by connecting and then disconnecting every time on the very small likelihood the agent isn't running I wonder?

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if _, err := k.client.Check(ctx, &grpc.CheckRequest{}); err != nil {
k.Logger().Error("grpc server check failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

k.Logger().WithError(err).Error("grpc server check failed")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log output is used to locate the problem from kata-runtime logfile when kata-agent is crashed in VM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but your code is not logging the actual error, so I'm suggesting you keep your error message and also add the original error (which might include more detail).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is a great suggestion about the error log.Back to your first comment, I agree with your opinion that kata-agent isn't running is very small likelihood . But I have another question, do we have some measures to make sure that kata-agent will not fail in VM?

@@ -940,7 +940,13 @@ func (s *Sandbox) startVM() error {
// Once startVM is done, we want to guarantee
// that the sandbox is manageable. For that we need
// to start the sandbox inside the VM.
return s.agent.startSandbox(s)
err := s.agent.startSandbox(s)
defer func() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is defer needed? why not just

if err != nil {
	s.hypervisor.stopSandbox()
}

🤔

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or

if err != nil {
	return s.hypervisor.stopSandbox()
}

in case of stopSandbox fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I agree with your last proposed code in case of stopSandbox fails

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not semantically correct. We should not call stopSandbox() after startSandbox() returned some error. We would be running stopSandbox(), which assumes the sandbox has been properly started, when this is not true.

The valid case for such a thing could be to run stopSandbox() if further in the code, an error would occur, and we would want to rollback from there.

And if you want to do such thing, I think it should be part of its own commit for more clarity ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we have some misunderstanding about s.agent.stopSandbox() and s.hypervisor.stopSandbox().

s.agent.stopSandbox() is used to stop the Sandbox and Containers which are running in VM, in other words, both sandbox and containers are started successfully in VM.

s.hypervisor.stopSandbox() is used to stop qemu process and release related resources, in other words, sandbox is not created successfully in VM.

In this place, if s.agent.startSandbox() return error , it indicates that sandbox is not created successfully in VM, so I want to call s.hypervisor.stopSandbox() to stop VM process.

return err
}

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 5? should that number be a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, maybe I need to consider which value is best for this scenario, in this place I think 5 seconds is an appropriate choose

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, just define a constant and add a comment, for example

// timeout to check grpc server is serving
const  grpcCheckTimeout = 5*time.Second
....
func (k *kataAgent) startSandbox(sandbox *Sandbox) error {
....
    ctx, cancel := context.WithTimeout(context.Background(), grpcCheckTimeout)

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are good ideas in this PR, but this needs to be refactored a bit.

@@ -940,7 +940,13 @@ func (s *Sandbox) startVM() error {
// Once startVM is done, we want to guarantee
// that the sandbox is manageable. For that we need
// to start the sandbox inside the VM.
return s.agent.startSandbox(s)
err := s.agent.startSandbox(s)
defer func() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not semantically correct. We should not call stopSandbox() after startSandbox() returned some error. We would be running stopSandbox(), which assumes the sandbox has been properly started, when this is not true.

The valid case for such a thing could be to run stopSandbox() if further in the code, an error would occur, and we would want to rollback from there.

And if you want to do such thing, I think it should be part of its own commit for more clarity ;)

k.Logger().Error("grpc server check failed")

//kill kata-proxy to release resources
k.proxy.stop(sandbox, k.state.ProxyPid)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback of the proxy being started sounds like a good idea. But if we want this to be properly implemented, we should create a defer right after the proxy has been correctly started. You need to declare a global err error for the function, and if when the defer is executed, the err != nil, then we should stop the proxy. This will apply for any error case that could happen in this function after the proxy has been started.
You also want to handle the case where the agent fails to start, which is great, but a separate thing IMO, that's why it should be split into 2 different commits. First you introduce proper rollback for the proxy, and then you introduce the check for the agent availability.

@flyflypeng flyflypeng force-pushed the fix-no-kata-agent branch 2 times, most recently from 9cb0a88 to f232158 Compare May 31, 2018 07:00
@katabuilder
Copy link

PSS Measurement:
Qemu: 163900 KB
Proxy: 6792 KB
Shim: 10786 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996624 KB

@devimc
Copy link

devimc commented May 31, 2018

@woshijpf do me a favour, run next command with this patch

docker run --runtime=kata-runtime -ti busybox bash

if you don't see a qemu process running then this patch looks good to me

@jodh-intel
Copy link
Contributor

You have to look closely at the previous command to notice that it won't work as bash isn't included with busybox. Equivalent:

$ docker run --runtime=kata-runtime -ti does-not-exist

@devimc
Copy link

devimc commented May 31, 2018

@jodh-intel I think you mean

docker run --runtime=kata-runtime -ti busybox does-not-exist

@jodh-intel
Copy link
Contributor

@devimc - I do! Thanks! 😄

@flyflypeng
Copy link
Contributor Author

flyflypeng commented May 31, 2018

@devimc I test this patch in my ubuntu 16.04 x86_64 laptop is good with next two cases:

  1. kata-agent is running in VM
    This case works well, container can run in VM
# need to run as root 
# docker run --runtime=kata-runtime -ti busybox sh
/ # ls
bin   dev   etc   home  proc  root  sys   tmp   usr   var
  1. no kata-agent in VM
    First, we need to create a rootfs image without kata-agent, you can follow next steps:
$ cd <kata-containers>/osbuilder/rootfs-builder

# use euleros as example
$ ./rootfs.sh euleros

# remove kata-agent.service, so systemd can't launch the kata-agent process
$ cd ./rootfs-EulerOS/usr/lib/systemd/system/
$ rm kata-agent.service
$ rm rm kata-containers.target

# make rootfs-EulerOS into rootfs.img
$ cd <kata-containers>/osbuilder/image-builder

# we need to modify the rootfs make shell scipt
$ cp image_builder.sh image_builder-no-agent.sh
$ vim image_builder-no-agent.sh
 # comment follow lines
129 # [ "${AGENT_INIT}" == "yes" ] || [ -x "${ROOTFS}/usr/bin/${AGENT_BIN}" ] || \
130 #        die "/usr/bin/${AGENT_BIN} is not installed in ${ROOTFS}
131  #       use AGENT_BIN env variable to change the expected agent binary name"
132 # OK "Agent installed"

# next, we can run  image_builder-no-agent.sh to make rootfs image
$ ./image_builder-no-agent.sh ../rootfs-builder/rootfs-EulerOS/

# finally, we get the Euleros rootfs.img without kata-agent
$ ls 
 ls
Dockerfile  image_builder-no-agent.sh  image_builder.sh  kata-containers.img   README.md

# we can copy kata-containers.img to anywhere, but you need to modify the "image = "  field 
# in the /usr/share/defaults/kata-containers/configuration.toml file

Now, we can use the same command as previous case to test this patch

# need to run as root
# docker run --runtime=kata-runtime -ti busybox sh
docker: Error response from daemon: OCI runtime create failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded: unknown.

# we get the error output as expected, and let's see if any kata-proxy and qemu processed is resident in host system
$ ps -auxww --sort=start_time

......
root     10857  0.0  0.0      0     0 ?        S    23:11   0:00 [kworker/0:1]
root     10858  0.0  0.0      0     0 ?        S    23:11   0:00 [kworker/1:3]
root     11118  0.0  0.0  44744  3752 pts/17   R+   23:14   0:00 ps -auxww --sort=start_time

we get the result as expected, so this patch works well.

@devimc
Copy link

devimc commented May 31, 2018

@woshijpf nice and thanks, and what about docker run --runtime=kata-runtime -ti busybox does-not-exist " ? is qemu running?

@flyflypeng
Copy link
Contributor Author

@devimc Yep,I just run the command docker run --runtime=kata-runtime -ti busybox does-not-exist, the result surprises me that qemu and kata-proxy processes both still reside in the host system, it is a bug need to be fixed.
Maybe we need to do some rollback if container doesn't run successfully.

if err != nil && pid > 0{
k.proxy.stop(sandbox, pid)
}
}()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this defer block after the check for err:

        pid, uri, err := k.proxy.start(sandbox, proxyParams)
        if err != nil {
 		return err
 	}

        // If error occurs after kata-proxy start, rollback to kill kata-proxy process
	defer func() {
		if err != nil && pid > 0{
			k.proxy.stop(sandbox, pid)
		}
	}()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf Will the defer function be called at all if it is moved after the check for err?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the defer is called all the time (when the function returns), and if you look further at the code in this function, there are numerous cases where we can return a non-nil error, meaning the proxy will get stopped.

Copy link
Contributor Author

@flyflypeng flyflypeng Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf Since I am not sure when the defer function will be called, I write a simple code to test it, finally I get the this conclusion that defer function isn't called all the time after function end, for example, if a return code is called before defer in the function body, then this defer function will not be called.
But in the place, I will move defer block after the check for err as you suggested, because it has no effect .
If k.proxy.start() return error that indicates the kata-proxy process doesn't start, so we don't need to do kata-proxy rollback operation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, if the defer statement is reached, then it will be called when the function returns.
And yes, I have asked for the defer to be called after the error check of k.proxy.start() because we don't need any rollback if err != nil in this case.

@@ -434,7 +437,7 @@ func (k *kataAgent) startSandbox(sandbox *Sandbox) error {

// If error occurs after kata-proxy start, rollback to kill kata-proxy process
defer func() {
if err != nil && pid > 0{
if err != nil && pid > 0 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added space (go formatting) should be part of the first commit.


ctx, cancel := context.WithTimeout(context.Background(), grpcCheckTimeout)
defer cancel()
if _, err = k.client.Check(ctx, &grpc.CheckRequest{}); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I was thinking that in order to avoid duplication of code, the check with timeout should be available through the agent interface itself. Thus, you could enhance the agent interface regarding check() function, and then rely on this new version to call it from here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply use k.check() here, and it handles grpc connect/disconnect properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf k.check doesn‘t have timeout. Need to enhance it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jshachm Take a look at installReqFunc(). Check request has builtin a 5 second timeout, same as you defined in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf sry for miss it~~ Yep, with the builtin timeout, k.check will make a difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf Thanks, it's a great suggestion, it makes code more clean and simple

k.Logger().Error("grpc server check failed, get the error: ", err)
return err
}
k.disconnect()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check for err here.

if err != nil {
//VM is started but sandbox is not created successfully in VM
//so we need to rollback to stop VM qemu process
s.hypervisor.stopSandbox()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't realize we were talking about hypervisor.stopSandbox()....
Could you isolate this into its own commit as this addresses the rollback of the VM.
Also, you should follow the same model you followed for the proxy rollback, meaning you defer the hypervisor.stopSandbox() after you properly started the VM. This way you will both cover the error case from waitSandbox() and agent.startSandbox() that could happen.

@sboeuf
Copy link

sboeuf commented May 31, 2018

@woshijpf there is a lot of work related to rollback, and I'd be more than happy if you want to look into it. We need to be careful about not having rollbacks that would collide with each others.

@flyflypeng
Copy link
Contributor Author

@sboeuf Thanks for your careful review.I'm glad to deep into it to solve rollback related problems.

@flyflypeng
Copy link
Contributor Author

@sboeuf @jodh-intel @devimc Hi buddies~I am fixing the problem of qemu and kata-proxy processes reside in system while createSandboxFromConfig() is failed, but I meet a problem and I want to get help from your nice guys.
The problem is that how can I get to know whether qemu process is running or not in system when err := s.startVM() return error in createSandboxFromConfig() function, since the error return by s.startVM() may caused by s.hypervisor.startSandbox() or s.hypervisor.waitSandbox or s.agent.startSandbox() in startVM() function.Thanks!

@sboeuf
Copy link

sboeuf commented Jun 4, 2018

@woshijpf you could move the s.agent.startSandbox() out of the startVM() function since it is not really part of starting the VM. The function startVM() being only used inside virtcontainers/api.go, this should be okay. And then, you need to call into s.hypervisor.stopSandbox() for the rollback.

@flyflypeng flyflypeng force-pushed the fix-no-kata-agent branch 2 times, most recently from 031b9c5 to 004b83b Compare June 5, 2018 09:31
@katabuilder
Copy link

PSS Measurement:
Qemu: 164140 KB
Proxy: 6803 KB
Shim: 10818 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996136 KB

@flyflypeng
Copy link
Contributor Author

@sboeuf Could you help me to review the code again, I modify the code as you suggested.
Finally thanks for your patient and detailed reviews, I have learned a lot from you.

@sboeuf
Copy link

sboeuf commented Jun 19, 2018

@woshijpf I have just added one last comment. Thanks for your quick fixes !

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 144523 KB
Proxy: 4612 KB
Shim: 8819 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good ! Thanks

@jshachm
Copy link
Member

jshachm commented Jun 21, 2018

Since review have been done. I will rebuild the CI and wait for green then merge it

@sboeuf
Copy link

sboeuf commented Jul 2, 2018

@woshijpf we just fixed the CI, could you rebase this PR on latest master, and it's good to be merged once the CI will pass.

@sboeuf
Copy link

sboeuf commented Jul 9, 2018

@woshijpf any updates on this ?

@jshachm
Copy link
Member

jshachm commented Jul 11, 2018

@sboeuf he is out of office for about a mouth. And will be back next week ,so updates will be promoted next week~

@sboeuf
Copy link

sboeuf commented Jul 11, 2018

@jshachm thx for the info !

@egernst
Copy link
Member

egernst commented Jul 23, 2018

@jshachm @woshijpf -- back yet? I want to get this contribution merged.

@flyflypeng
Copy link
Contributor Author

@egernst I'm sorry for leaving so long time, and I am going to rebase the code to pull a new request.

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 163197 KB
Proxy: 5130 KB
Shim: 8826 KB

Memory inside container:
Total Memory: 2045968 KB
Free Memory: 2007556 KB

@egernst
Copy link
Member

egernst commented Jul 24, 2018

no worries @woshijpf - hope it was a good break. Hope the CI goes through smoothly.

@opendev-zuul
Copy link

opendev-zuul bot commented Jul 24, 2018

Build succeeded (third-party-check pipeline).

@opendev-zuul
Copy link

opendev-zuul bot commented Jul 24, 2018

Build succeeded (third-party-check pipeline).

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 164185 KB
Proxy: 5131 KB
Shim: 8944 KB

Memory inside container:
Total Memory: 2045968 KB
Free Memory: 2007212 KB

@jshachm jshachm merged commit 8939fd8 into kata-containers:master Jul 24, 2018
If error occurs after sandbox network created successfully, we need to rollback
to remove the created sandbox network

Fixes: kata-containers#297

Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If some errors occur after kata-proxy start, we need to
rollback to kill kata-proxy process

Fixes: kata-containers#297

Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If some errors occur after qemu process start, then we need to
rollback to kill qemu process

Fixes: kata-containers#297

Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If kata-agent doesn't start in VM, we need to do some rollback
operations to release related resources.

add grpc check() to check kata-agent is running or not

Fixes: kata-containers#297

Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
@sboeuf sboeuf added bug Incorrect behaviour stable-candidate labels Sep 12, 2018
zklei pushed a commit to zklei/runtime that referenced this pull request Jun 13, 2019
Host device's major-minor numbers are mapped to guest major-minor numbers,
for example in the host the major-minor number for /dev/loop0p1 is 259:1,
when it's attached to the VM now the major-minor number is 8:0, this
conversion must be reflected in devices and resources lists, the first list
is used to mount the device in the container and the second one is to update
the devices cgroup.

fixes kata-containers#351

Signed-off-by: Julio Montes <julio.montes@intel.com>
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
bug Incorrect behaviour
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants