-
Notifications
You must be signed in to change notification settings - Fork 373
virtcontainers: process the case that kata-agent doesn't start in VM #351
Conversation
PSS Measurement: Memory inside container: |
f4152d6
to
f4b92c2
Compare
PSS Measurement: Memory inside container: |
Codecov Report
@@ Coverage Diff @@
## master #351 +/- ##
=========================================
- Coverage 66.88% 66.8% -0.08%
=========================================
Files 93 93
Lines 9475 9495 +20
=========================================
+ Hits 6337 6343 +6
- Misses 2466 2474 +8
- Partials 672 678 +6
Continue to review full report at Codecov.
|
f4b92c2
to
f3028fe
Compare
PSS Measurement: Memory inside container: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising. I think we need @sboeuf's input on this one. If I'm understanding this, thescenario this PR covers is highly unlikely to occur in normal operation, unless maybe that agent crashed on startup?
There is also the question of performance - what sort of hit are we having to take by connecting and then disconnecting every time on the very small likelihood the agent isn't running I wonder?
virtcontainers/kata_agent.go
Outdated
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
defer cancel() | ||
if _, err := k.client.Check(ctx, &grpc.CheckRequest{}); err != nil { | ||
k.Logger().Error("grpc server check failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
k.Logger().WithError(err).Error("grpc server check failed")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log output is used to locate the problem from kata-runtime logfile when kata-agent is crashed in VM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but your code is not logging the actual error, so I'm suggesting you keep your error message and also add the original error (which might include more detail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is a great suggestion about the error log.Back to your first comment, I agree with your opinion that kata-agent isn't running is very small likelihood . But I have another question, do we have some measures to make sure that kata-agent will not fail in VM?
virtcontainers/sandbox.go
Outdated
@@ -940,7 +940,13 @@ func (s *Sandbox) startVM() error { | |||
// Once startVM is done, we want to guarantee | |||
// that the sandbox is manageable. For that we need | |||
// to start the sandbox inside the VM. | |||
return s.agent.startSandbox(s) | |||
err := s.agent.startSandbox(s) | |||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is defer
needed? why not just
if err != nil {
s.hypervisor.stopSandbox()
}
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or
if err != nil {
return s.hypervisor.stopSandbox()
}
in case of stopSandbox
fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I agree with your last proposed code in case of stopSandbox
fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not semantically correct. We should not call stopSandbox()
after startSandbox()
returned some error. We would be running stopSandbox()
, which assumes the sandbox has been properly started, when this is not true.
The valid case for such a thing could be to run stopSandbox()
if further in the code, an error would occur, and we would want to rollback from there.
And if you want to do such thing, I think it should be part of its own commit for more clarity ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we have some misunderstanding about s.agent.stopSandbox
() and s.hypervisor.stopSandbox()
.
s.agent.stopSandbox()
is used to stop the Sandbox and Containers which are running in VM, in other words, both sandbox and containers are started successfully in VM.
s.hypervisor.stopSandbox()
is used to stop qemu process and release related resources, in other words, sandbox is not created successfully in VM.
In this place, if s.agent.startSandbox()
return error , it indicates that sandbox is not created successfully in VM, so I want to call s.hypervisor.stopSandbox()
to stop VM process.
virtcontainers/kata_agent.go
Outdated
return err | ||
} | ||
|
||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 5? should that number be a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, maybe I need to consider which value is best for this scenario, in this place I think 5 seconds is an appropriate choose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, just define a constant and add a comment, for example
// timeout to check grpc server is serving
const grpcCheckTimeout = 5*time.Second
....
func (k *kataAgent) startSandbox(sandbox *Sandbox) error {
....
ctx, cancel := context.WithTimeout(context.Background(), grpcCheckTimeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are good ideas in this PR, but this needs to be refactored a bit.
virtcontainers/sandbox.go
Outdated
@@ -940,7 +940,13 @@ func (s *Sandbox) startVM() error { | |||
// Once startVM is done, we want to guarantee | |||
// that the sandbox is manageable. For that we need | |||
// to start the sandbox inside the VM. | |||
return s.agent.startSandbox(s) | |||
err := s.agent.startSandbox(s) | |||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not semantically correct. We should not call stopSandbox()
after startSandbox()
returned some error. We would be running stopSandbox()
, which assumes the sandbox has been properly started, when this is not true.
The valid case for such a thing could be to run stopSandbox()
if further in the code, an error would occur, and we would want to rollback from there.
And if you want to do such thing, I think it should be part of its own commit for more clarity ;)
virtcontainers/kata_agent.go
Outdated
k.Logger().Error("grpc server check failed") | ||
|
||
//kill kata-proxy to release resources | ||
k.proxy.stop(sandbox, k.state.ProxyPid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rollback of the proxy being started sounds like a good idea. But if we want this to be properly implemented, we should create a defer
right after the proxy has been correctly started. You need to declare a global err error
for the function, and if when the defer is executed, the err != nil
, then we should stop the proxy. This will apply for any error case that could happen in this function after the proxy has been started.
You also want to handle the case where the agent fails to start, which is great, but a separate thing IMO, that's why it should be split into 2 different commits. First you introduce proper rollback for the proxy, and then you introduce the check for the agent availability.
9cb0a88
to
f232158
Compare
PSS Measurement: Memory inside container: |
@woshijpf do me a favour, run next command with this patch
if you don't see a qemu process running then this patch looks good to me |
You have to look closely at the previous command to notice that it won't work as
|
@jodh-intel I think you mean
|
@devimc - I do! Thanks! 😄 |
@devimc I test this patch in my ubuntu 16.04 x86_64 laptop is good with next two cases:
# need to run as root
# docker run --runtime=kata-runtime -ti busybox sh
/ # ls
bin dev etc home proc root sys tmp usr var
$ cd <kata-containers>/osbuilder/rootfs-builder
# use euleros as example
$ ./rootfs.sh euleros
# remove kata-agent.service, so systemd can't launch the kata-agent process
$ cd ./rootfs-EulerOS/usr/lib/systemd/system/
$ rm kata-agent.service
$ rm rm kata-containers.target
# make rootfs-EulerOS into rootfs.img
$ cd <kata-containers>/osbuilder/image-builder
# we need to modify the rootfs make shell scipt
$ cp image_builder.sh image_builder-no-agent.sh
$ vim image_builder-no-agent.sh
# comment follow lines
129 # [ "${AGENT_INIT}" == "yes" ] || [ -x "${ROOTFS}/usr/bin/${AGENT_BIN}" ] || \
130 # die "/usr/bin/${AGENT_BIN} is not installed in ${ROOTFS}
131 # use AGENT_BIN env variable to change the expected agent binary name"
132 # OK "Agent installed"
# next, we can run image_builder-no-agent.sh to make rootfs image
$ ./image_builder-no-agent.sh ../rootfs-builder/rootfs-EulerOS/
# finally, we get the Euleros rootfs.img without kata-agent
$ ls
ls
Dockerfile image_builder-no-agent.sh image_builder.sh kata-containers.img README.md
# we can copy kata-containers.img to anywhere, but you need to modify the "image = " field
# in the /usr/share/defaults/kata-containers/configuration.toml file Now, we can use the same command as previous case to test this patch # need to run as root
# docker run --runtime=kata-runtime -ti busybox sh
docker: Error response from daemon: OCI runtime create failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded: unknown.
# we get the error output as expected, and let's see if any kata-proxy and qemu processed is resident in host system
$ ps -auxww --sort=start_time
......
root 10857 0.0 0.0 0 0 ? S 23:11 0:00 [kworker/0:1]
root 10858 0.0 0.0 0 0 ? S 23:11 0:00 [kworker/1:3]
root 11118 0.0 0.0 44744 3752 pts/17 R+ 23:14 0:00 ps -auxww --sort=start_time we get the result as expected, so this patch works well. |
@woshijpf nice and thanks, and what about |
@devimc Yep,I just run the command |
virtcontainers/kata_agent.go
Outdated
if err != nil && pid > 0{ | ||
k.proxy.stop(sandbox, pid) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this defer block after the check for err
:
pid, uri, err := k.proxy.start(sandbox, proxyParams)
if err != nil {
return err
}
// If error occurs after kata-proxy start, rollback to kill kata-proxy process
defer func() {
if err != nil && pid > 0{
k.proxy.stop(sandbox, pid)
}
}()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf Will the defer function be called at all if it is moved after the check for err?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the defer is called all the time (when the function returns), and if you look further at the code in this function, there are numerous cases where we can return a non-nil error, meaning the proxy will get stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf Since I am not sure when the defer function will be called, I write a simple code to test it, finally I get the this conclusion that defer function isn't called all the time after function end, for example, if a return code is called before defer in the function body, then this defer function will not be called.
But in the place, I will move defer block after the check for err
as you suggested, because it has no effect .
If k.proxy.start()
return error that indicates the kata-proxy process doesn't start, so we don't need to do kata-proxy rollback operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, if the defer statement is reached, then it will be called when the function returns.
And yes, I have asked for the defer to be called after the error check of k.proxy.start()
because we don't need any rollback if err != nil
in this case.
virtcontainers/kata_agent.go
Outdated
@@ -434,7 +437,7 @@ func (k *kataAgent) startSandbox(sandbox *Sandbox) error { | |||
|
|||
// If error occurs after kata-proxy start, rollback to kill kata-proxy process | |||
defer func() { | |||
if err != nil && pid > 0{ | |||
if err != nil && pid > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This added space (go formatting) should be part of the first commit.
virtcontainers/kata_agent.go
Outdated
|
||
ctx, cancel := context.WithTimeout(context.Background(), grpcCheckTimeout) | ||
defer cancel() | ||
if _, err = k.client.Check(ctx, &grpc.CheckRequest{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but I was thinking that in order to avoid duplication of code, the check with timeout should be available through the agent interface itself. Thus, you could enhance the agent interface regarding check()
function, and then rely on this new version to call it from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simply use k.check() here, and it handles grpc connect/disconnect properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bergwolf k.check doesn‘t have timeout
. Need to enhance it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jshachm Take a look at installReqFunc()
. Check request has builtin a 5 second timeout, same as you defined in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bergwolf sry for miss it~~ Yep, with the builtin timeout, k.check
will make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bergwolf Thanks, it's a great suggestion, it makes code more clean and simple
virtcontainers/kata_agent.go
Outdated
k.Logger().Error("grpc server check failed, get the error: ", err) | ||
return err | ||
} | ||
k.disconnect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check for err
here.
virtcontainers/sandbox.go
Outdated
if err != nil { | ||
//VM is started but sandbox is not created successfully in VM | ||
//so we need to rollback to stop VM qemu process | ||
s.hypervisor.stopSandbox() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't realize we were talking about hypervisor.stopSandbox()
....
Could you isolate this into its own commit as this addresses the rollback of the VM.
Also, you should follow the same model you followed for the proxy rollback, meaning you defer the hypervisor.stopSandbox()
after you properly started the VM. This way you will both cover the error case from waitSandbox()
and agent.startSandbox()
that could happen.
@woshijpf there is a lot of work related to rollback, and I'd be more than happy if you want to look into it. We need to be careful about not having rollbacks that would collide with each others. |
@sboeuf Thanks for your careful review.I'm glad to deep into it to solve rollback related problems. |
@sboeuf @jodh-intel @devimc Hi buddies~I am fixing the problem of qemu and kata-proxy processes reside in system while |
@woshijpf you could move the |
031b9c5
to
004b83b
Compare
PSS Measurement: Memory inside container: |
@sboeuf Could you help me to review the code again, I modify the code as you suggested. |
@woshijpf I have just added one last comment. Thanks for your quick fixes ! |
2df2345
to
6670b5d
Compare
PSS Measurement: Memory inside container: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good ! Thanks
Since review have been done. I will rebuild the CI and wait for green then merge it |
@woshijpf we just fixed the CI, could you rebase this PR on latest master, and it's good to be merged once the CI will pass. |
@woshijpf any updates on this ? |
@sboeuf he is out of office for about a mouth. And will be back next week ,so updates will be promoted next week~ |
@jshachm thx for the info ! |
@jshachm @woshijpf -- back yet? I want to get this contribution merged. |
@egernst I'm sorry for leaving so long time, and I am going to rebase the code to pull a new request. |
6670b5d
to
bd80266
Compare
PSS Measurement: Memory inside container: |
no worries @woshijpf - hope it was a good break. Hope the CI goes through smoothly. |
Build succeeded (third-party-check pipeline).
|
bd80266
to
2993cb3
Compare
Build succeeded (third-party-check pipeline).
|
PSS Measurement: Memory inside container: |
If error occurs after sandbox network created successfully, we need to rollback to remove the created sandbox network Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If some errors occur after kata-proxy start, we need to rollback to kill kata-proxy process Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If some errors occur after qemu process start, then we need to rollback to kill qemu process Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If kata-agent doesn't start in VM, we need to do some rollback operations to release related resources. add grpc check() to check kata-agent is running or not Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
Host device's major-minor numbers are mapped to guest major-minor numbers, for example in the host the major-minor number for /dev/loop0p1 is 259:1, when it's attached to the VM now the major-minor number is 8:0, this conversion must be reflected in devices and resources lists, the first list is used to mount the device in the container and the second one is to update the devices cgroup. fixes kata-containers#351 Signed-off-by: Julio Montes <julio.montes@intel.com>
virtcontainers: process the case that kata-agent doesn't start in VM
If kata-agent process is not launched to run in VM, kata-runtime will exit because of timeout,
however kata-proxy and qemu process will still resident in host
Call Check grpc function to check whether grpc server is running in kata-agent process in VM,
if not and then do some rollback operation to stop kata-proxy and qemu processes
Fixes: #297
Signed-off-by: flyflypeng jiangpengfei9@huawei.com