From 1a982cf4ceb4d408e8458919b6ea857c1eb2d5fc Mon Sep 17 00:00:00 2001
From: toby lorne <toby@toby.codes>
Date: Fri, 18 Oct 2024 19:43:29 +0200
Subject: [PATCH] sites/www.toby.codes: split zookeeper post(s)

also fix a bug with the date regex which did not know about days ending
in 0

Signed-off-by: toby lorne <toby@toby.codes>
---
 sites/www.toby.codes/main.go                  |  2 +-
 ...=> 2024-04-ZooKeeper-connection-limits.md} | 11 ++---
 .../posts/2024-10-ZooKeeper-and-quorum.md     | 46 +++++++++++++++++++
 3 files changed, 52 insertions(+), 7 deletions(-)
 rename sites/www.toby.codes/posts/{ZooKeeper.md => 2024-04-ZooKeeper-connection-limits.md} (89%)
 create mode 100644 sites/www.toby.codes/posts/2024-10-ZooKeeper-and-quorum.md
diff --git a/sites/www.toby.codes/main.go b/sites/www.toby.codes/main.go
index c21b0fd..7ae7261 100644
--- a/sites/www.toby.codes/main.go
+++ b/sites/www.toby.codes/main.go
@@ -25,7 +25,7 @@ import (
 
 var (
 	pathRx = regexp.MustCompile("^/posts/(?P<post>[-_a-zA-Z0-9]+)$")
-	dateRx = regexp.MustCompile("^(2[0-9]{3}-[0-1][1-9])-(.*)")
+	dateRx = regexp.MustCompile("^(2[0-9]{3}-[0-1][0-9])-(.*)")
 
 	//go:embed posts/*
 	postsFS embed.FS
diff --git a/sites/www.toby.codes/posts/ZooKeeper.md b/sites/www.toby.codes/posts/2024-04-ZooKeeper-connection-limits.md
similarity index 89%
rename from sites/www.toby.codes/posts/ZooKeeper.md
rename to sites/www.toby.codes/posts/2024-04-ZooKeeper-connection-limits.md
index 4227bfc..6aa40e6 100644
--- a/sites/www.toby.codes/posts/ZooKeeper.md
+++ b/sites/www.toby.codes/posts/2024-04-ZooKeeper-connection-limits.md
@@ -1,12 +1,11 @@
-# ZooKeeper
+# ZooKeeper connection limits
 
-_In this post I am collecting interesting things I have observed with
-[Apache ZooKeeper](https://zookeeper.apache.org). We run multiple large
-ZooKeeper clusters at [Booking.com](www.booking.com) and this makes for some
+_Written in 2024-04, split from a post collecting notes on
+[Apache ZooKeeper](https://zookeeper.apache.org).
+We run multiple large ZooKeeper clusters at
+[Booking.com](www.booking.com) and this makes for some
 interesting problem solving opportunities_
 
-## 2022
-
 ZooKeeper clients usually maintain an open connection (session) to ZooKeeper.
 In many cases this is the point. Especially important for
 [ephemeral nodes](https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#Ephemeral+Nodes).
diff --git a/sites/www.toby.codes/posts/2024-10-ZooKeeper-and-quorum.md b/sites/www.toby.codes/posts/2024-10-ZooKeeper-and-quorum.md
new file mode 100644
index 0000000..6fc46df
--- /dev/null
+++ b/sites/www.toby.codes/posts/2024-10-ZooKeeper-and-quorum.md
@@ -0,0 +1,46 @@
+# ZooKeeper and quorum
+
+_Written on 2024-10-18, split from a post collecting notes on
+[Apache ZooKeeper](https://zookeeper.apache.org).
+We run multiple large ZooKeeper clusters at
+[Booking.com](www.booking.com) and this makes for some
+interesting problem solving opportunities_
+
+ZooKeeper is (naturally) good at building service discovery and distributed
+configuration systems. However it is far less popular, due to its specialised
+nature, compared to Postgres, MySQL, Redis, etc.
+
+If you do not need a distributed system, do not build one. It is quite easy to
+(ab)use redis into being a very simple service discovery database. But it will
+be very difficult to get Redis to have the same (distributed) uptime you can
+have with ZooKeeper.
+
+What is odd (at least from my experience with ZooKeeper elsewhere) about our
+ZooKeeper deployment is that we deploy a single cluster across multiple
+regions. We have multiple links between each region, and size our regions which
+contain ZooKeeper primaries equally. ZooKeeper clients (servers and pods) are
+configured only to talk to the ZooKeeper servers in the same region.
+
+During outages and drills, where we have lost a single region, we still have
+enough ZooKeeper primaries in the remaining two regions. We still have enough
+nodes to afford to lose multiple primaries in the remaining regions.
+
+I have grown to value this stability highly, but it comes at a cost of write
+performance, because each write gets acknowledged, potentially via multiple
+regional links. This isn't necessarily problematic, because service discovery
+workloads (should) have much higher reads than writes. Links between (european)
+regions are private, so latency is small (<12ms) and predictable. Global quorum
+would not be possible because latency would be too high (eg Australia/NZ).
+
+We scale out the capacity to serve many reads by using
+[ZooKeeper observers](https://zookeeper.apache.org/doc/r3.8.3/zookeeperObservers.html)
+as we found that we saw degraded performance with more than 10k (see below,
+2022) connections per ZK node. Using observers has allowed us to get better
+write performance, because clients connect to the observers and ZK primaries
+deal only with leader election, writes, and traffic from the observers.
+
+With less than 30 ZooKeeper servers (not all primaries), we are able to handle
+multiple hundreds of thousands of simultaneous connections, split across >3
+regions of both private and public cloud workloads. With Kubernetes adoption
+growing into multiple thousands of services, there is high turnover of (pod) IP
+addresses, which has proven (thus far) the architecture.