Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Celery worker/consumer loses connection to RabbitMQ broker after 20 to 30 minutes #434

Open
griffinschulte opened this issue May 29, 2024 · 0 comments

Comments

@griffinschulte
Copy link

Hi folks,

I'm developing a Flask application that utilizes Celery with RabbitMQ as the broker. Below are the version details of my dependencies:

Flask=3.0.3
amqp=5.2.0
celery=5.4.0
kombu=5.3.7
RabbitMQ=3.12.13

I'm running into an issue where Celery workers handling long running tasks (> 30 minutes) appear to lose connection to RabbitMQ after about 20-30 minutes of execution. The worker continues to handle task execution; however, once the task completes, my Celery worker returns the following error, and no longer responds to new tasks that are generated:

[2024-05-29 09:36:19,714: INFO/MainProcess] Task project.tasks.validation_workflow[0f2dda43-4a09-41a9-ad52-2edb22721d57] succeeded in 1800.813s: None [2024-05-29 09:36:19,717: CRITICAL/MainProcess] Couldn't ack 1, reason:SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2396)') Traceback (most recent call last): File "C:\Users\User_Name\project\venv\lib\site-packages\kombu\message.py", line 131, in ack_log_error self.ack(multiple=multiple) File "C:\Users\User_Name\project\venv\lib\site-packages\kombu\message.py", line 126, in ack self.channel.basic_ack(self.delivery_tag, multiple=multiple) File "C:\Users\User_Name\project\venv\lib\site-packages\amqp\channel.py", line 1407, in basic_ack return self.send_method( File "C:\Users\User_Name\project\venv\lib\site-packages\amqp\abstract_channel.py", line 70, in send_method conn.frame_writer(1, self.channel_id, sig, args, content) File "C:\Users\User_Name\project\venv\lib\site-packages\amqp\method_framing.py", line 186, in write_frame write(buffer_store.view[:offset]) File "C:\Users\User_Name\project\venv\lib\site-packages\amqp\transport.py", line 347, in write self._write(s) File "C:\Users\User_Name\project\venv\lib\site-packages\amqp\transport.py", line 597, in _write n = write(s) File "C:\Program Files\Python310\lib\ssl.py", line 1149, in write return self._sslobj.write(data) ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2396)

As a result, the task is added back into the queue, and begins to execute again after I restart the worker. Around the 20-30 minute mark during task execution, I can see the consumer drop from the Celery queue. The Celery worker returns no errors until after the task completes execution. If the task is successful, the result is reflected as successful in the Celery worker console, and in Flower, but is not acknowledged. The error is displayed immediately after the success message in the worker console. I initially thought this was a consumer_timeout issue, but after increasing the default value from 30 minutes to 10 hours, I'm still getting the above error.

I'm having a hard time identifying what may be the issue here. Any help would be greatly appreciated.

Thank you!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant