Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

LISTEN channel stops working due to postgres bug #3147

Open
steve-chavez opened this issue Jan 9, 2024 · 4 comments
Open

LISTEN channel stops working due to postgres bug #3147

steve-chavez opened this issue Jan 9, 2024 · 4 comments
Labels

Comments

@steve-chavez
Copy link
Member

Problem

This just happened on a production instance. NOTIFY pgrst wasn't triggering a schema cache reload while according to pg_stat_activity the LISTEN channel was up for several days:

select * from pg_stat_activity where application_name ilike '%postgrest%' and query ilike '%listen%';
-[ RECORD 1 ]----+------------------------------
datid            | 12974
datname          | postgres
pid              | 1368652
leader_pid       | 
usesysid         | 16665
usename          | authenticator
application_name | postgrest
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 53768
backend_start    | 2023-12-28 11:23:50.967734+00
xact_start       | 
query_start      | 2023-12-28 11:23:50.985951+00
state_change     | 2023-12-28 11:23:50.986375+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle
backend_xid      | 
backend_xmin     | 
query_id         | 7372465010149019010
query            | LISTEN "pgrst"
backend_type     | client backend

Further investigation lead to finding this PostgreSQL bug:

postgres=# listen pgrst;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen pgrst;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen xx;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

postgres=# listen yy;
ERROR:  could not access status of transaction 1044524795
DETAIL:  Could not open file "pg_xact/03E4": No such file or directory.

More details:

So seems the LISTEN channel is broken. I can't reproduce this locally though.

The only way to repair this error is by restarting the PostgreSQL instance AFAICT.

Workaround

Killing the LISTEN channel will force PostgREST to reload.

select pg_terminate_backend(pid) from pg_stat_activity 
where application_name ilike '%postgrest%' and query ilike '%listen%';

Further measures

Do we have any way to prevent this? Would a shorter lifetime in the LISTEN channel help?

@steve-chavez steve-chavez changed the title LISTEN channel stops working LISTEN channel stops working due to postgres bug Jun 20, 2024
@steve-chavez
Copy link
Member Author

steve-chavez commented Jun 20, 2024

Just had this again. It seems the LISTEN on primary, conn pool on replicas thing makes it more likely to happen:

Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: {"code":"PGRST000","details":"connection to server at \"localhost\" (::1), port 5432 failed: FATAL:  the database system is not yet accepting connections\nDETAIL:  Consistent recovery state has not been yet reached.\nconnection to server at \"[redacted]\", port 5432 failed: session is not read-only\n","hint":null,"message":"Database connection error. Retrying the connection."}
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: Attempting to reconnect to the database in 0 seconds...
Jun 20 19:09:38 postgrest[2384]: 20/Jun/2024:19:09:38 +0000: Failed listening for notifications on the "pgrst" channel. ERROR:  could not access status of transaction 2251772 DETAIL:  Could not open file "pg_xact/0002": No such file or directory.

The Listener just kept retrying while slowing down requests since the connection pool was normal, only the Listener failed.

Apparently, just running:

select pg_terminate_backend(pid)
from pg_stat_activity
where application_name = 'postgrest';

On the primary made things work again, without the need to restart the pg instance.

@steve-chavez
Copy link
Member Author

Since it's too hard to reproduce, we need to stop retrying the schema cache whenever the LISTENER fails. That would help mitigate the damage. Also related to #3313 (comment).

@steve-chavez
Copy link
Member Author

steve-chavez commented Jun 25, 2024

To mitigate this, I think the easiest way would be to revert 4beac10

@acierto
Copy link

acierto commented Feb 21, 2025

I see this issue too by using Backstage's event-backend plugin which relies on this feature.

Any news to fix it?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Development

No branches or pull requests

2 participants