fix: listener retries with exponential backoff #3536

steve-chavez · 2024-05-19T19:05:53Z

Continues #3533

Now it works like:

$ postgrest-with-postgresql-16 --replica -f test/spec/fixtures/load.sql postgrest-run

19/May/2024:20:24:28 -0500: Retrying listening for notifications in 1 seconds...
19/May/2024:20:24:28 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery
19/May/2024:20:24:30 -0500: Retrying listening for notifications in 2 seconds...
19/May/2024:20:24:30 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery
19/May/2024:20:24:34 -0500: Retrying listening for notifications in 4 seconds...
19/May/2024:20:24:34 -0500: Failed listening for notifications on the "pgrst" channel. ERROR:  cannot execute LISTEN during recovery

Limitation

Once the listen channel is recovered, the retry status is not reset. So if the last backoff was 4 seconds, the next time recovery kicks in the backoff will be 8 seconds. If a recovery reaches 32 seconds, the backoff will stay there.

This is a problem with the interaction between hasql-notifications and retry. The Hasql.Notifications.waitForNotifications uses a forever loop that only finishes when it throws an exception, retry recovers on an exception and succeeds (here it restarts the retry status) when the function finishes .

I've left this as a TODO for now, it's still better than retrying without backoff.

steve-chavez · 2024-05-19T20:57:04Z

Really puzzled by this error that only happens on CI, not locally:

def set_statement_timeout(postgrest, role, milliseconds):
        """Set the statement timeout for the given role.
        For this to work reliably with low previous timeout settings,
        use a postgrest instance that doesn't use the affected role."""
    
        response = postgrest.session.post(
            "/rpc/set_statement_timeout", data={"role": role, "milliseconds": milliseconds}
        )
>       assert response.text == ""
E       assert '{"code":"57P01","details":null,"hint":null,"message":"terminating connection due to administrator command"}' == ''
E         
E         + {"code":"57P01","details":null,"hint":null,"message":"terminating connection due to administrator command"}

Whenever io tests fail, they're really hard to debug.

steve-chavez · 2024-05-19T21:34:45Z

It's like something is calling pg_terminate_backend. AFAICT there's no relationship between statement_timeout and LISTEN.

The only place that's done is here:

postgrest/test/io/test_io.py

Lines 1220 to 1222 in 1ad6dc6

    
           os.system( 
        
               f'psql -d {defaultenv["PGDATABASE"]} -U {defaultenv["PGUSER"]} -h {defaultenv["PGHOST"]} --set ON_ERROR_STOP=1 -a -c "SELECT terminate_pgrst()"' 
        
           )

Managed to reproduce the error locally, the only related db logs:

2024-05-19 21:16:48.888 UTC [908423] LOG:  statement: SELECT terminate_pgrst()
2024-05-19 21:16:48.889 UTC [907910] FATAL:  terminating connection due to administrator command
2024-05-19 21:16:48.889 UTC [908419] FATAL:  terminating connection due to administrator command
2024-05-19 21:16:48.889 UTC [908420] FATAL:  terminating connection due to administrator command

steve-chavez · 2024-05-19T23:22:18Z

It's like something is calling pg_terminate_backend

That was it, I just skipped this test:

postgrest/test/io/test_io.py

Lines 1212 to 1226 in 4416050

    
           def test_fail_with_automatic_recovery_disabled_and_terminated_using_query(defaultenv): 
        
               "Should fail without retries when automatic recovery is disabled and pg_terminate_backend(pid) is called" 
        
               env = { 
        
                   **defaultenv, 
        
                   "PGRST_DB_POOL_AUTOMATIC_RECOVERY": "false", 
        
               } 
        
               with run(env=env) as postgrest: 
        
                   os.system( 
        
                       f'psql -d {defaultenv["PGDATABASE"]} -U {defaultenv["PGUSER"]} -h {defaultenv["PGHOST"]} --set ON_ERROR_STOP=1 -a -c "SELECT terminate_pgrst()"' 
        
                   ) 
        
                   exitCode = wait_until_exit(postgrest) 
        
                   assert exitCode == 1

And the tests no longer fail.

The thing that has changed in this PR is that the listener no longer waits for the connectionWorker (since the listener now has its own backoff), so looks like the pg_terminate_backend was affecting the LISTEN connection too.

I think this can be fixed by isolating terminate_pgrst with a particular application name.

Fixed on #3537

Also corrects the admin ready response which now considers the listener state.

Clears the limitation mentioned on PostgREST#3536 The Listener no longer uses the https://hackage.haskell.org/package/retry package and instead uses a much simpler IORef in AppState for the delays. Additionally it no longer uses exception throwing/catching, which is rather messy and brings some concerns(PostgREST#3569 (comment)).

Clears the limitation mentioned on #3536 The Listener no longer uses the https://hackage.haskell.org/package/retry package and instead uses a much simpler IORef in AppState for the delays. Additionally it no longer uses exception throwing/catching, which is rather messy and brings some concerns(#3569 (comment)).

steve-chavez marked this pull request as draft May 19, 2024 20:14

steve-chavez mentioned this pull request May 20, 2024

test: isolate pg_terminate_backend to appname #3537

Merged

steve-chavez force-pushed the backoff-replica-fail branch from d3d8499 to b95f55c Compare May 20, 2024 01:23

steve-chavez marked this pull request as ready for review May 20, 2024 01:31

fix: listener retries with exponential backoff

5c51bd7

Also corrects the admin ready response which now considers the listener state.

steve-chavez force-pushed the backoff-replica-fail branch from b95f55c to 5c51bd7 Compare May 20, 2024 01:33

steve-chavez merged commit 3cf5656 into PostgREST:main May 20, 2024
24 of 25 checks passed

This was referenced May 20, 2024

feat: force listener to work on primary cluster #3462

Merged

The schema cache load is not retried with exponential backoff #3523

Closed

steve-chavez mentioned this pull request Jun 5, 2024

correct exponential backoff on Listener #3572

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: listener retries with exponential backoff #3536

fix: listener retries with exponential backoff #3536

steve-chavez commented May 19, 2024 •

edited

Loading

steve-chavez commented May 19, 2024 •

edited

Loading

steve-chavez commented May 19, 2024 •

edited

Loading

steve-chavez commented May 19, 2024 •

edited

Loading

fix: listener retries with exponential backoff #3536

fix: listener retries with exponential backoff #3536

Conversation

steve-chavez commented May 19, 2024 • edited Loading

Limitation

steve-chavez commented May 19, 2024 • edited Loading

steve-chavez commented May 19, 2024 • edited Loading

steve-chavez commented May 19, 2024 • edited Loading

steve-chavez commented May 19, 2024 •

edited

Loading

steve-chavez commented May 19, 2024 •

edited

Loading

steve-chavez commented May 19, 2024 •

edited

Loading

steve-chavez commented May 19, 2024 •

edited

Loading