Skip to content

Intro to working with pacemaker

DrDaveD edited this page Dec 9, 2024 · 6 revisions

Log messages for pacemaker go into /var/log/pacemaker/pacemaker.log and for pcsd go into /var/log/pcsd/pcsd.log. When there are problems, it is important to consult them.

Learn about how to set up a virtual IP address in pacemaker using one of the internet guides such as this one.

To find out which machine is the master, run the command pcs status. The name of the current master is shown on the line labled "Current DC". You can also do 'ip addr show' or 'ifconfig' and find out if a service IP is being served if you have one of those configured.

In order to manually swap which machine is master, run pcs cluster stop HOSTNAME && pcs cluster start HOSTNAME where HOSTNAME is the name of the current master.

It's important to test the manual swaps, and to also simulate a crash of the master machine to make sure that the backup takes over automatically. If you have physical access to the machine, power-cycle it. If you don't have physical access but have console access, you can stop networking and then reboot. If you have neither, on the machine to test temporarily override systemd restart by doing

# systemctl edit pacemaker

and inserting the lines

[Service]
Restart=no

Then you can do a systemctl kill --signal=9 pacemaker, and reboot. Don't forget to remove the Restart=no when done testing. The backup should takeover after 15 seconds and keep control even after the master is fully back up. Do the test process twice, once when each of the machines is master, to make sure it automatically switches in both directions.

Note that if networking is stopped on one machine for more than 15 seconds and then restarted, or one machine freezes for that long, or for some other reason they can't communicate for that long, after communication starts both of the machines will think they're master. This is known as a "split brain" situation, and pacemaker has a very hard time automatically recovering from it. It's best to avoid that situation completely. It helps to have a second point-to-point network between the machines so they can still communicate if there's a network outage. If it does get in the situation, the logs will show there's a split brain, pcs status will show no Current DC. To manually recover, you may be able to do service pacemaker stop on one of the machines, but even that might not work. Sometimes you need to shut both of them down hard by disabling the Restart as above and killing, and then restart them one at a time.

Note that in a split brain situation when both machines think they're serving the service IP, connections to the service IP from clients should continue to be served by the last machine that enabled the address. That is because when an address is enabled, a reverse-ARP message is sent out to tell routers to flush their ARP caches of IP address to MAC address translations and to use the new MAC address. On the other hand when a machine is off the network and comes back it doesn't send out one of those reverse-ARP messages. The way this problem will show itself on a cvmfs-hastratum1 system is that cvmfs repository updates will stop, because neither machine will be master.

Clone this wiki locally