Fix zombie healtcheck threads #597

josecelano · 2024-01-11T17:36:47Z

Parent issue: #582

The live demo is not working because the services crash periodically due to high memory consumption.

I'm not sure if the problem comes from having a lot of health check threads, but it's something we have to fix anyway.

In the past, we have a similar problem when you disable the cronjob to remove peerless torrents. See these related issues:

This should not be merged. The intention is only to document the bug.

josecelano · 2024-01-12T15:16:48Z

Hi @da2ce7 @WarmBeer. The problem was solved after merging this PR.

The healthcheck does not have a timeout for the request, so it waits indefinitely. As we discussed in our meeting.

It happened that:

The UDP was actually not working on production, but working only on tests. Because the bootstrapping is not exactly the same.
The healthcheck was probably waiting because the socket was open but the server was not running. Binding the address and running the listener are not done in the same thread. That's the reason why the request was made and not refused but it did not get any response (I guess).

The following is the latest version but in the previous one binding and running the server are also done in different threads.

impl Udp {
    /// It starts the UDP server instance with graceful shutdown.
    ///
    /// # Panics
    ///
    /// It panics if unable to bind to udp socket, and get the address from the udp socket.
    /// It also panics if unable to send address of socket.
    async fn start_with_graceful_shutdown(
        tracker: Arc<Tracker>,
        bind_to: SocketAddr,
        tx_start: Sender<Started>,
        rx_halt: Receiver<Halted>,
    ) -> JoinHandle<()> {
        let socket = Arc::new(UdpSocket::bind(bind_to).await.expect("Could not bind to {self.socket}."));
        let address = socket.local_addr().expect("Could not get local_addr from {binding}.");

        info!(target: "UDP Tracker", "Starting on: udp://{}", address);

        let running = tokio::task::spawn(async move {
            let halt = tokio::task::spawn(async move {
                debug!(target: "UDP Tracker", "Waiting for halt signal for socket address: udp://{address}  ...");

                shutdown_signal_with_message(
                    rx_halt,
                    format!("Shutting down UDP server on socket address: udp://{address}"),
                )
                .await;
            });

            let listen = async move {
                debug!(target: "UDP Tracker", "Waiting for packets on socket address: udp://{address} ...");

                loop {
                    let mut data = [0; MAX_PACKET_SIZE];
                    let socket_clone = socket.clone();

                    match socket_clone.recv_from(&mut data).await {
                        Ok((valid_bytes, remote_addr)) => {
                            let payload = data[..valid_bytes].to_vec();

                            debug!(target: "UDP Tracker", "Received {} bytes", payload.len());
                            debug!(target: "UDP Tracker", "From: {}", &remote_addr);
                            debug!(target: "UDP Tracker", "Payload: {:?}", payload);

                            let response = handle_packet(remote_addr, payload, &tracker).await;

                            Udp::send_response(socket_clone, remote_addr, response).await;
                        }
                        Err(err) => {
                            error!("Error reading UDP datagram from socket. Error: {:?}", err);
                        }
                    }
                }
            };

            pin_mut!(halt);
            pin_mut!(listen);

            tx_start
                .send(Started { address })
                .expect("the UDP Tracker service should not be dropped");

            tokio::select! {
                _ = & mut halt => { debug!(target: "UDP Tracker", "Halt signal spawned task stopped on address: udp://{address}"); },
                () = & mut listen => { debug!(target: "UDP Tracker", "Socket listener stopped on address: udp://{address}"); },
            }
        });

        info!(target: "UDP Tracker", "Started on: udp://{}", address);

        running
    }

Just for the record, we did that because we wanted to send back the bound address when the configuration uses port 0 and it's assigned a free port dynamically.

I'm going to keep this issue to add a timeout to the http_health_check.rs binary.

When I implemented the http_health_check I did not add a timeout because I thought it was included in the HEALTCHECK instruction in the Containerfile.

HEALTHCHECK --interval=5s --timeout=5s --start-period=3s --retries=3 \  
  CMD /usr/bin/http_health_check http://localhost:${HEALTH_CHECK_API_PORT}/health_check \
    || exit 1

But docker does not abort the command. The timeout only means docker does not consider the service to be healthy if after 5 seconds it does not respond. So we need anyway the timeout in the http_health_check. I'm going to add a hardcoded timeout of 5 seconds. It could be a parameter in the future.

I've created a PR to reproduce and document the problem: #602

Finally, there are more places where not having timeouts could be a problem. Some of them are documented like this issue in the Index. But there could be more of them. For example, we do not have a timeout for handling the UDP request, if I'm not wrong.

I'm going to open a new issue to collect and track all places where we could have problems due to missing timeouts.

josecelano · 2024-01-12T15:54:27Z

josecelano · 2024-01-12T16:03:06Z

I've created a new issue to implement the timeout for the http_health_check binary.

josecelano · 2024-01-12T17:43:03Z

6 hours

josecelano · 2024-01-13T15:53:25Z

24 hours

josecelano · 2024-01-16T09:39:00Z

7 days

Fixed via: #608

josecelano added the Bug Incorrect Behavior label Jan 11, 2024

josecelano mentioned this issue Jan 11, 2024

Error updating the tracker to the latest docker image in the live demo #582

Closed

4 tasks

cgbosse added this to Torrust Solution Jan 12, 2024

josecelano added a commit to josecelano/torrust-tracker that referenced this issue Jan 12, 2024

[torrust#597] reproduce bug for zombie health check threads

8d3d682

This should not be merged. The intention is only to document the bug.

josecelano mentioned this issue Jan 12, 2024

Reproduce bug for zombie health check threads #602

Closed

josecelano mentioned this issue Jan 12, 2024

Review missing timeouts #603

Open

9 tasks

josecelano mentioned this issue Jan 12, 2024

Add timeout to the http_health_check.rs binary #604

Closed

josecelano self-assigned this Jan 16, 2024

josecelano closed this as completed Jan 16, 2024

github-project-automation bot moved this to Done in Torrust Solution Jan 16, 2024

josecelano mentioned this issue Apr 25, 2024

There are a lot of zombie processes torrust/torrust-demo#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zombie healtcheck threads #597

Fix zombie healtcheck threads #597

josecelano commented Jan 11, 2024 •

edited

Loading

josecelano commented Jan 12, 2024 •

edited

Loading

josecelano commented Jan 12, 2024

josecelano commented Jan 12, 2024

josecelano commented Jan 12, 2024 •

edited

Loading

josecelano commented Jan 13, 2024

josecelano commented Jan 16, 2024

Fix zombie healtcheck threads #597

Fix zombie healtcheck threads #597

Comments

josecelano commented Jan 11, 2024 • edited Loading

josecelano commented Jan 12, 2024 • edited Loading

josecelano commented Jan 12, 2024

josecelano commented Jan 12, 2024

josecelano commented Jan 12, 2024 • edited Loading

josecelano commented Jan 13, 2024

josecelano commented Jan 16, 2024

josecelano commented Jan 11, 2024 •

edited

Loading

josecelano commented Jan 12, 2024 •

edited

Loading

josecelano commented Jan 12, 2024 •

edited

Loading