-
Notifications
You must be signed in to change notification settings - Fork 48
Fix zombie healtcheck threads #597
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
This should not be merged. The intention is only to document the bug.
Hi @da2ce7 @WarmBeer. The problem was solved after merging this PR. The healthcheck does not have a timeout for the request, so it waits indefinitely. As we discussed in our meeting. It happened that:
The following is the latest version but in the previous one binding and running the server are also done in different threads. impl Udp {
/// It starts the UDP server instance with graceful shutdown.
///
/// # Panics
///
/// It panics if unable to bind to udp socket, and get the address from the udp socket.
/// It also panics if unable to send address of socket.
async fn start_with_graceful_shutdown(
tracker: Arc<Tracker>,
bind_to: SocketAddr,
tx_start: Sender<Started>,
rx_halt: Receiver<Halted>,
) -> JoinHandle<()> {
let socket = Arc::new(UdpSocket::bind(bind_to).await.expect("Could not bind to {self.socket}."));
let address = socket.local_addr().expect("Could not get local_addr from {binding}.");
info!(target: "UDP Tracker", "Starting on: udp://{}", address);
let running = tokio::task::spawn(async move {
let halt = tokio::task::spawn(async move {
debug!(target: "UDP Tracker", "Waiting for halt signal for socket address: udp://{address} ...");
shutdown_signal_with_message(
rx_halt,
format!("Shutting down UDP server on socket address: udp://{address}"),
)
.await;
});
let listen = async move {
debug!(target: "UDP Tracker", "Waiting for packets on socket address: udp://{address} ...");
loop {
let mut data = [0; MAX_PACKET_SIZE];
let socket_clone = socket.clone();
match socket_clone.recv_from(&mut data).await {
Ok((valid_bytes, remote_addr)) => {
let payload = data[..valid_bytes].to_vec();
debug!(target: "UDP Tracker", "Received {} bytes", payload.len());
debug!(target: "UDP Tracker", "From: {}", &remote_addr);
debug!(target: "UDP Tracker", "Payload: {:?}", payload);
let response = handle_packet(remote_addr, payload, &tracker).await;
Udp::send_response(socket_clone, remote_addr, response).await;
}
Err(err) => {
error!("Error reading UDP datagram from socket. Error: {:?}", err);
}
}
}
};
pin_mut!(halt);
pin_mut!(listen);
tx_start
.send(Started { address })
.expect("the UDP Tracker service should not be dropped");
tokio::select! {
_ = & mut halt => { debug!(target: "UDP Tracker", "Halt signal spawned task stopped on address: udp://{address}"); },
() = & mut listen => { debug!(target: "UDP Tracker", "Socket listener stopped on address: udp://{address}"); },
}
});
info!(target: "UDP Tracker", "Started on: udp://{}", address);
running
} Just for the record, we did that because we wanted to send back the bound address when the configuration uses port 0 and it's assigned a free port dynamically. I'm going to keep this issue to add a timeout to the http_health_check.rs binary. When I implemented the HEALTHCHECK --interval=5s --timeout=5s --start-period=3s --retries=3 \
CMD /usr/bin/http_health_check http://localhost:${HEALTH_CHECK_API_PORT}/health_check \
|| exit 1 But docker does not abort the command. The timeout only means docker does not consider the service to be healthy if after 5 seconds it does not respond. So we need anyway the timeout in the I've created a PR to reproduce and document the problem: #602 Finally, there are more places where not having timeouts could be a problem. Some of them are documented like this issue in the Index. But there could be more of them. For example, we do not have a timeout for handling the UDP request, if I'm not wrong. I'm going to open a new issue to collect and track all places where we could have problems due to missing timeouts. |
7 days Fixed via: #608 |
Parent issue: #582
The live demo is not working because the services crash periodically due to high memory consumption.
I'm not sure if the problem comes from having a lot of health check threads, but it's something we have to fix anyway.
live-demo-console-log.txt
In the past, we have a similar problem when you disable the cronjob to remove peerless torrents. See these related issues:
The text was updated successfully, but these errors were encountered: