Fail master not replaced ? #322

LordFPL · 2017-07-25T13:50:25Z

Hello,

Due to a power outage, all my nodes have been halted... and poweron is giving me a strange situation :

stolonctl status
=== Active sentinels ===

ID              LEADER
98ed7841        true
cdeaa3fc        false

=== Active proxies ===

ID
04cec75f
32f3e610
f13e3c8f
fac929c0

=== Keepers ===

UID     PG LISTENADDRESS        HEALTHY PGWANTEDGENERATION      PGCURRENTGENERATION
cald00  xx.xx.xx.xx:30394      true    5                       5
cald01  xx.xx.xx.xx:28013      true    1                       0
cald02  xx.xx.xx.xx:45050      true    1                       0
cald03  xx.xx.xx.xx:50395      false   7                       6

=== Cluster Info ===

Master: cald03

===== Keepers tree =====

cald03 (master)
├─cald00
├─cald02
└─cald01

My master is on a non healthy node... and logs on the sentinel are :

[E] 2017-07-25T13:48:58Z sentinel.go:268: no keeper info available db=99093732 keeper=cald03
[I] 2017-07-25T13:48:58Z sentinel.go:861: master db is failed db=99093732 keeper=cald03
[I] 2017-07-25T13:48:58Z sentinel.go:867: db not converged db=99093732 keeper=cald03
[I] 2017-07-25T13:48:58Z sentinel.go:872: trying to find a new master to replace failed master
[E] 2017-07-25T13:48:58Z sentinel.go:875: no eligible masters

Why sentinel is not switching to cald00 which was last master before outage ?
How can i force a new master ?

Thx in advance :)

The text was updated successfully, but these errors were encountered:

LordFPL · 2017-07-25T13:57:59Z

(as i need my cluster, workaround found : restart the non healthy node... and kill it after all other resync ok)

sgotti · 2017-07-25T13:58:00Z

@LordFPL Can you provide more infos on the keepers states (their logs)? You could also run the sentinel with --debug so we can see why it's not choosing cald00.

From the above status looks like cald01 and cald02 are not in a good state to be elected (not converged since the wanted and the current generation are different). Perhaps cald00 is too behind with the defined maxStandbyLag in the cluster spec (defaults to 1MiB)?

LordFPL · 2017-07-25T14:13:41Z

On keeper i have theses logs on cald00 (synced) :

[I] 2017-07-25T13:53:01Z keeper.go:1269: our db requested role is standby followedDB=99093732
[I] 2017-07-25T13:53:01Z keeper.go:1288: already standby
[I] 2017-07-25T13:53:01Z keeper.go:1410: postgres parameters not changed
FATAL:  could not connect to the primary server: could not connect to server: Connection refused
		Is the server running on host "cald03.IP" and accepting
		TCP/IP connections on port 50395?

On cald01 (not synced) :

[E] 2017-07-25T13:53:37Z keeper.go:853: db failed to initialize or resync
[I] 2017-07-25T13:53:37Z keeper.go:903: current db UID different than cluster data db UID db= cdDB=fd550dd6
[I] 2017-07-25T13:53:37Z keeper.go:1048: resyncing the database cluster
[I] 2017-07-25T13:53:37Z keeper.go:1077: database cluster not initialized
[I] 2017-07-25T13:53:37Z postgresql.go:642: running pg_basebackup
pg_basebackup: could not connect to server: could not connect to server: Connection refused
        Is the server running on host "cald03.IP" and accepting
        TCP/IP connections on port 50395?
[E] 2017-07-25T13:53:37Z keeper.go:1105: failed to resync from followed instance error=sync error: error: exit status 1

I can't add --debug as problem is gone with the restart of the failed master, sorry.

sgotti · 2017-07-25T14:34:55Z

@LordFPL From these logs I can see that cald00 and cald01 weren't able to talk to cald03, probably because it died before them. Looks like your nodes died multiple time and at differen times since cald01 was instructed to resync and this usually can happen when it's unelected as master. But I cannot see their current xlogpos (since they're printed only at debug level or can be shown with a stolonctl clusterdata command but now it's too late). So I can only suppose that cald00 wasn't choosen as the new master because it was lagging from the failed master for more than maxStandbyLag.

As a note:
Depending on how many data/transactions you accept to lose you can increase your maxStandbyLag (default 1MiB). If you don't want to lose any transaction instead enable synchronous replication.

Next time something like this happens you could try saving a stolonctl clusterdata (just added some days ago) before doing any action so you can provide it here later.

I'll open a PR to move the sentinel logs that reports why a standby is skipped from new master decision from Debug to Info level since it makes more sense.

How can i force a new master?

If you want to force a db to become master (be sure that its state is good or you'll end with a master with a lot of lost transactions) you can reinitialize the cluster using an existing keeper:
https://github.com/sorintlab/stolon/blob/master/doc/initialization.md#initialize-a-new-stolon-cluster-using-an-existing-keeper

LordFPL · 2017-07-25T19:45:57Z

@sgotti Many thx for this more than complete answer. If this problem happen again, i will take more logs before any action.
All the best :)

LordFPL closed this as completed Jul 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail master not replaced ? #322

Fail master not replaced ? #322

LordFPL commented Jul 25, 2017 •

edited

Loading

LordFPL commented Jul 25, 2017

sgotti commented Jul 25, 2017

LordFPL commented Jul 25, 2017

sgotti commented Jul 25, 2017

LordFPL commented Jul 25, 2017

Fail master not replaced ? #322

Fail master not replaced ? #322

Comments

LordFPL commented Jul 25, 2017 • edited Loading

LordFPL commented Jul 25, 2017

sgotti commented Jul 25, 2017

LordFPL commented Jul 25, 2017

sgotti commented Jul 25, 2017

LordFPL commented Jul 25, 2017

LordFPL commented Jul 25, 2017 •

edited

Loading