restore-cluster (to different cluster) fails when pssh pool size is smaller than the cluster size #803

serban21 · 2024-09-16T13:52:54Z

When --pssh-pool-size is smaller than the cluster size the list of hosts is split in multiple batches (see https://github.com/thelastpickle/cassandra-medusa/blob/master/medusa/orchestration.py#L57). The problem is that the list of hosts sent to pssh as host_args is not split (the list of old hosts taken from --host-list). So when a cluster of 12 nodes is restored to a different cluster with the same number of nodes with the pool size of 3 the first 3 nodes will get the correct data from the first 3 old nodes in the host list. But the next 3 will get the same data, from the same first 3 nodes (token ranges and SSTables). The end result is quite strange, Cassandra 4 will actually start on all 12 nodes, with errors in logs, and with nodetool status reporting only the first 3 nodes (but running on all nodes).

The solution is simple, split the host lists too. I'll create a PR without tests today, and try to see if I can add tests too.

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: MED-96

The text was updated successfully, but these errors were encountered:

adejanovski added this to K8ssandra Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restore-cluster (to different cluster) fails when pssh pool size is smaller than the cluster size #803

restore-cluster (to different cluster) fails when pssh pool size is smaller than the cluster size #803

serban21 commented Sep 16, 2024 •

edited by sync-by-unito bot

Loading

restore-cluster (to different cluster) fails when pssh pool size is smaller than the cluster size #803

restore-cluster (to different cluster) fails when pssh pool size is smaller than the cluster size #803

Comments

serban21 commented Sep 16, 2024 • edited by sync-by-unito bot Loading

serban21 commented Sep 16, 2024 •

edited by sync-by-unito bot

Loading