LISTEN channel stops working due to postgres bug #3147

steve-chavez · 2024-01-09T17:46:40Z

Problem

This just happened on a production instance. NOTIFY pgrst wasn't triggering a schema cache reload while according to pg_stat_activity the LISTEN channel was up for several days:

select * from pg_stat_activity where application_name ilike '%postgrest%' and query ilike '%listen%';
-[ RECORD 1 ]----+------------------------------
datid            | 12974
datname          | postgres
pid              | 1368652
leader_pid       | 
usesysid         | 16665
usename          | authenticator
application_name | postgrest
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 53768
backend_start    | 2023-12-28 11:23:50.967734+00
xact_start       | 
query_start      | 2023-12-28 11:23:50.985951+00
state_change     | 2023-12-28 11:23:50.986375+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle
backend_xid      | 
backend_xmin     | 
query_id         | 7372465010149019010
query            | LISTEN "pgrst"
backend_type     | client backend

Further investigation lead to finding this PostgreSQL bug:

postgres=# listen pgrst;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen pgrst;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen xx;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen yy;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

More details:

So seems the LISTEN channel is broken. I can't reproduce this locally though.

The only way to repair this error is by restarting the PostgreSQL instance AFAICT.

Workaround

Killing the LISTEN channel will force PostgREST to reload.

select pg_terminate_backend(pid) from pg_stat_activity 
where application_name ilike '%postgrest%' and query ilike '%listen%';

Further measures

Do we have any way to prevent this? Would a shorter lifetime in the LISTEN channel help?

The text was updated successfully, but these errors were encountered:

steve-chavez · 2024-06-20T21:27:52Z

Just had this again. It seems the LISTEN on primary, conn pool on replicas thing makes it more likely to happen:

Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: {"code":"PGRST000","details":"connection to server at \"localhost\" (::1), port 5432 failed: FATAL:  the database system is not yet accepting connections\nDETAIL:  Consistent recovery state has not been yet reached.\nconnection to server at \"[redacted]\", port 5432 failed: session is not read-only\n","hint":null,"message":"Database connection error. Retrying the connection."}
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: Attempting to reconnect to the database in 0 seconds...
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: Failed listening for notifications on the "pgrst" channel. ERROR:  could not access status of transaction 2251772 DETAIL:  Could not open file "pg_xact/0002": No such file or directory.

The Listener just kept retrying while slowing down requests since the connection pool was normal, only the Listener failed.

Apparently, just running:

select pg_terminate_backend(pid)
from pg_stat_activity
where application_name = 'postgrest';

On the primary made things work again, without the need to restart the pg instance.

steve-chavez · 2024-06-20T21:32:17Z

Since it's too hard to reproduce, we need to stop retrying the schema cache whenever the LISTENER fails. That would help mitigate the damage. Also related to #3313 (comment).

steve-chavez · 2024-06-25T23:32:01Z

To mitigate this, I think the easiest way would be to revert 4beac10

acierto · 2025-02-21T15:07:18Z

I see this issue too by using Backstage's event-backend plugin which relies on this feature.

Any news to fix it?

steve-chavez added the QOS label Jan 9, 2024

steve-chavez changed the title ~~LISTEN channel stops working~~ LISTEN channel stops working due to postgres bug Jun 20, 2024

steve-chavez mentioned this issue Jun 26, 2024

fix: don't reload cache on every listener fail #3620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LISTEN channel stops working due to postgres bug #3147

LISTEN channel stops working due to postgres bug #3147

steve-chavez commented Jan 9, 2024

steve-chavez commented Jun 20, 2024 •

edited

Loading

steve-chavez commented Jun 20, 2024

steve-chavez commented Jun 25, 2024 •

edited

Loading

acierto commented Feb 21, 2025

LISTEN channel stops working due to postgres bug #3147

LISTEN channel stops working due to postgres bug #3147

Comments

steve-chavez commented Jan 9, 2024

Problem

Workaround

Further measures

steve-chavez commented Jun 20, 2024 • edited Loading

steve-chavez commented Jun 20, 2024

steve-chavez commented Jun 25, 2024 • edited Loading

acierto commented Feb 21, 2025

steve-chavez commented Jun 20, 2024 •

edited

Loading

steve-chavez commented Jun 25, 2024 •

edited

Loading