Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Service network alias is not inmediately available #1977

Closed
jsoriano opened this issue Jul 16, 2024 · 4 comments
Closed

Service network alias is not inmediately available #1977

jsoriano opened this issue Jul 16, 2024 · 4 comments

Comments

@jsoriano
Copy link
Member

When a service is started, it is connected to the network where the agent is running, with an alias prefixed with svc-. It looks like adding this alias doesn't make the service inmediately available in the network of the agent, and the agent may fail to connect with the service if it tries soon enough.

We saw this in elastic/integrations#9926, where the agent could not connect to the service, but other commands executed manually could do it. In this particular case the input didn't have any retry mechanism, so after the first failure the test could not succeed. DNS could be resolved, because we saw the IP in the logs:

Input 'websocket' failed with: input.go:130: input websocket-websocket.websocket-d1bff21f-ea88-4732-8635-686c972c7019 failed (id=websocket-websocket.websocket-d1bff21f-ea88-4732-8635-686c972c7019)
	dial tcp 172.23.0.6:80: connect: connection refused

This might be related to some flaky issues caused by not found documents in system tests.

Service deployer setup should not return if the service is still unreachable in the agent network.

@cmacknz
Copy link
Member

cmacknz commented Jul 16, 2024

In the sense that this may slow down test execution, it is worth fixing.

From a testing perspective, IMO this is a feature, every network request we ever make should have retries on it. Outside of the system tests, real servers are not always going to be available when they should be.

@chrisberkhout
Copy link
Contributor

The system tests pass for elastic/integrations#9926 do pass for me locally on Linux.

Looking at this part of the logs of a failing CI build,

2024/07/16 15:11:03 DEBUG running command: /usr/bin/docker compose -f _dev/deploy/docker/docker-compose.yml --progress plain -p elastic-package-service-17350 ps -a -q
2024/07/16 15:11:03 DEBUG Wait for healthy containers: 55455e6e61f07df9def97a98e25781ff37b0b44397b88c7b95fad59b50417086
2024/07/16 15:11:03 DEBUG output command: /usr/bin/docker inspect 55455e6e61f07df9def97a98e25781ff37b0b44397b88c7b95fad59b50417086
2024/07/16 15:11:03 DEBUG Container 55455e6e61f07df9def97a98e25781ff37b0b44397b88c7b95fad59b50417086 status: running (no health status)
2024/07/16 15:11:03 DEBUG run command: /usr/bin/docker network connect elastic-package-agent-websocket-14811_default elastic-package-service-17350-websocket-mock-service-1 --alias svc-websocket-mock-service

I notice status: running (no health status), and I wonder if the adding a health check would be a good way to avoid connecting to the service before it's ready.

Health check options for docker-compose are described in the documentation here and here.

@jsoriano
Copy link
Member Author

The system tests pass for elastic/integrations#9926 do pass for me locally on Linux.

Yes, they pass in some machines. But for example it constantly fails for me locally, also on Linux, and they fail in CI too.

I notice status: running (no health status), and I wonder if the adding a health check would be a good way to avoid connecting to the service before it's ready.

Good point, during some time the container is unresponsive because it is compiling the binary. So this is probably all, and no issue on elastic-package side, I will close this issue.

Tested on my machine to add a healthcheck and it fixes the issue, @ShourieG could you add this in your PR to unblock it?

This is the patch I am testing with:

diff --git a/packages/websocket/_dev/deploy/docker/docker-compose.yml b/packages/websocket/_dev/deploy/docker/docker-compose.yml
index ba3545e54b..d7a9fdf7a9 100644
--- a/packages/websocket/_dev/deploy/docker/docker-compose.yml
+++ b/packages/websocket/_dev/deploy/docker/docker-compose.yml
@@ -7,5 +7,9 @@ services:
     volumes:
       - ./websocket-mock-service:/app
     ports:
-      - "80:80"
+      - "3000:3000"
     command: ["go", "run", "main.go"]
+    healthcheck:
+      test: "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"
+      interval: 10s
+      timeout: 5s
diff --git a/packages/websocket/_dev/deploy/docker/websocket-mock-service/main.go b/packages/websocket/_dev/deploy/docker/websocket-mock-service/main.go
index 86f5f829ce..44f7bfa53a 100644
--- a/packages/websocket/_dev/deploy/docker/websocket-mock-service/main.go
+++ b/packages/websocket/_dev/deploy/docker/websocket-mock-service/main.go
@@ -13,10 +13,16 @@ import (
 
 func main() {
        http.HandleFunc("/", handleWebSocket)
-       log.Fatal(http.ListenAndServe(":80", nil))
+       log.Fatal(http.ListenAndServe(":3000", nil))
 }
 
 func handleWebSocket(w http.ResponseWriter, r *http.Request) {
+       log.Println(r.Method, r.URL.String())
+
+       if r.URL.Path == "/health" {
+               return
+       }
+
        if r.URL.Path == "/testbasicauth" {
                // Check if the 'Authorization' header is set for basic authentication
                authHeader := r.Header.Get("Authorization")

@ShourieG
Copy link

Yup, the tests are now passing, this is awesome. Thanks a lot @chrisberkhout & @jsoriano. Till now I was under the impression that the health checks were automatically added downstream somewhere, but this clarified that my understanding was wrong. This knowledge will help a lot in future with similar mock services.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants