|
| 1 | +--- |
| 2 | +CIP: 135 |
| 3 | +Title: Disaster Recovery Plan for Cardano networks |
| 4 | +Category: Tools |
| 5 | +Status: Active |
| 6 | +Authors: |
| 7 | + - Kevin Hammond <kevin.hammond@iohk.io> |
| 8 | + - Sam Leathers <samuel.leathers@iohk.io> |
| 9 | + - Alex Moser <alex.moser@cardanofoundation.org> |
| 10 | + - Steve Wagendorp <steve.wagendorp@cardanofoundation.org> |
| 11 | + - Andrew Westberg <andrewwestberg@gmail.com> |
| 12 | + - Nicholas Clarke <nicholas.clarke@tweag.io> |
| 13 | +Implementors: N/A |
| 14 | +Discussions: |
| 15 | + - https://github.com/cardano-foundation/CIPs/pull/893 |
| 16 | +Created: 2024-06-17 |
| 17 | +License: CC-BY-4.0 |
| 18 | +--- |
| 19 | + |
| 20 | +## Abstract |
| 21 | + |
| 22 | +While the Cardano mainnet and other networks have proven to be highly resilient, it is necessary to proactively |
| 23 | +consider the possible recovery mechanisms and procedures that may be required in the unlikely |
| 24 | +event of a major failure where the network is unable to recover itself. |
| 25 | + |
| 26 | +This CIP considers three representative scenarios and addresses specific considerations relevant |
| 27 | +in each case: |
| 28 | + |
| 29 | +Scenario 1 - __Long-Lived Network Partition__ |
| 30 | +Scenario 2 - __Failure to Make Blocks for an Extended Period of Time__ |
| 31 | +Scenario 3 - __Bad Blocks Minted on Chain__ |
| 32 | + |
| 33 | +To ensure successful recovery in the event of a chain failure, it's crucial to establish effective |
| 34 | +communication channels and exercise recovery procedures in advance to familiarize the community and |
| 35 | +stake pool operators (SPOs) with the process. |
| 36 | + |
| 37 | +This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by internal |
| 38 | +documentation and discussions that have not been publicly released. It should be considered to be a living |
| 39 | +document that is reviewed and revised on a regular basis. |
| 40 | + |
| 41 | +Note that although the focus of disaster recovery is on Cardano mainnet, since this is the greatest risk |
| 42 | +of loss of funds, the recovery procedures are generic and apply to other Cardano |
| 43 | +networks, including SanchoNet, Preview, PreProd or private networks. |
| 44 | +Appropriate adjustments may need to be made to reflect differences in timing or other concerns. |
| 45 | + |
| 46 | + |
| 47 | +## Motivation: why is this CIP necessary? |
| 48 | + |
| 49 | +This CIP is needed to familiarize stakeholders with the processes and procedures that should be |
| 50 | +followed in the unlikely event that the Cardano mainnet, or another Cardano network, encounters |
| 51 | +a situation where the built-in on-chain recovery mechanisms fail. |
| 52 | + |
| 53 | +## Specification |
| 54 | + |
| 55 | +While the exact recovery process will depend on the unique nature of the failure, there are three main scenarios we can consider. |
| 56 | + |
| 57 | +### Scenario 1: Long-Lived Network Partition |
| 58 | + |
| 59 | +Ouroboros Praos is designed to cope with real-world networking |
| 60 | +conditions, in which some nodes may temporarily be disconnected from |
| 61 | +the network. In this case, the network will continue to make blocks, |
| 62 | +perhaps at some lower chain density (reflecting the temporary loss of |
| 63 | +stake to the network as a whole). As nodes rejoin the network, they |
| 64 | +will then participate in normal block production once again. In this |
| 65 | +way, the network remains resilient to changes in connectivity. |
| 66 | + |
| 67 | +If many nodes become disconnected, the network could divide into two |
| 68 | +or more completely disconnected parts. Each part of the network could |
| 69 | +then form its own chain, backed by the stake that is participating in |
| 70 | +its own partition. Under normal conditions, Praos will also deal with |
| 71 | +this situation. When the partitioned group of nodes reconnects, the |
| 72 | +longest chain will dominate, and the shorter chain will be discarded. |
| 73 | +The nodes on the shorter chain will automatically rollback to the |
| 74 | +point where the fork occurred, and then rejoin the main chain. This |
| 75 | +is perfectly normal. Such forks will typically last only a few |
| 76 | +blocks. |
| 77 | + |
| 78 | +However, in an extreme situation, the partition may persist beyond the |
| 79 | +Praos rollback limit of *k* blocks (currently 2,160 blocks on mainnet). |
| 80 | +In this case, the nodes will not be able to rollback to rejoin the main chain, since this |
| 81 | +would violate the required Praos guarantees. |
| 82 | + |
| 83 | + |
| 84 | +#### Remediations |
| 85 | + |
| 86 | +Disconnected nodes must be reconnected to the main chain by their operators. This can be done |
| 87 | +by truncating the local block database to a point before the chain fork and then resyncing |
| 88 | +against the main network, using the `db-truncator` tool, for example. |
| 89 | + |
| 90 | +Full node wallets can also be recovered in the same way, though this may require technical |
| 91 | +skills that the end users do not possess. It may be easier, if slower, for them to simply |
| 92 | +resynchronize their nodes from the start of the chain (i.e. from the genesis block). |
| 93 | + |
| 94 | +Ouroboros Genesis provides additional resilience when recovering from long lived network partitions. |
| 95 | +In Praos nodes resyncing from a point before the chain fork could still in some cases follow the |
| 96 | +alternative chain (if it is the first one seen) and extra mechanisms may be needed to avoid this |
| 97 | +possibility. In Praos, for example, this may require that all participants on the alternative chain |
| 98 | +truncate the local block database prior to the partition being resolved. In Ouroboros Genesis |
| 99 | +when resyncing from a point before the chain fork, the chain selection rules will ensure |
| 100 | +selection of the correct path for the main chain assuming the partition has been resolved. |
| 101 | + |
| 102 | +Alternative methods to resynchronise the node to the main chain might |
| 103 | +include the use of Mithril or other signed snapshots. These would |
| 104 | +allow faster recovery. However, in this case, care needs to be taken |
| 105 | +to achieve the correct balance of trust against speed of recovery. |
| 106 | + |
| 107 | +#### Additional Effects on Cardano Users |
| 108 | + |
| 109 | +Although block producing nodes will rejoin the main network following the remediation |
| 110 | +described above, the blocks that they have |
| 111 | +minted while they were disconnected will not be included in the main |
| 112 | +chain. This may have real world effects that will not be |
| 113 | +automatically remedied when the nodes rejoin the main chain. For |
| 114 | +example, transactions may have been processed that have significant |
| 115 | +real world value, or assumptions may have been made about chains of |
| 116 | +evidence/validity, or the timing of transactions. End users should be |
| 117 | +aware of the possibility and include provisions in their contracts to |
| 118 | +cover this eventuality. It may be necessary to resubmit some or all of the |
| 119 | +transactions that were processed on the minority chain onto the main chain. |
| 120 | +To avoid unexpected effects, this should be done by the end users/applications, and not |
| 121 | +by block producers acting on their behalf. |
| 122 | + |
| 123 | +If they are not observant, stake pools, full node wallets and |
| 124 | +other node users (e.g. explorers) could continue indefinitely on the minority |
| 125 | +chain. Such users should take care to be aware of this situation and |
| 126 | +take steps to rejoin the main chain as quickly as possible. |
| 127 | +A reliable and trusted public warning system should be considered that can alert users |
| 128 | +and advise them on how to rejoin the main chain. |
| 129 | + |
| 130 | + |
| 131 | +#### Timing Considerations |
| 132 | + |
| 133 | +On Cardano mainnet, partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano mainnet settings, this represents |
| 134 | +a period of up to 12 hours during which automatic rollback will occur. If the partition exceeds 2,160 blocks, then the |
| 135 | +procedure described above will be necessary to allow nodes to rejoin the main chain. Other Cardano networks may have different |
| 136 | +timing characteristics. |
| 137 | + |
| 138 | + |
| 139 | +### Scenario 2: Failure to Make Blocks for an Extended Period of Time |
| 140 | + |
| 141 | +Ouroboros Praos requires *at least* one block to be produced every *3k/f* slots. With the current Cardano mainnet |
| 142 | +settings, that is a 36 hour period. Such an event is extremely unlikely, but if it were to happen then the network |
| 143 | +would be unable to make any further blocks. |
| 144 | + |
| 145 | +#### Mitigation |
| 146 | + |
| 147 | +It is recommended to monitor the chain for block production. If a low density period is observed, then block producers |
| 148 | +should be notified, and efforts made to mint new blocks prior to the expiry of the *3k/f* window. If this is not possible |
| 149 | +then the remediation procedures should be followed. |
| 150 | + |
| 151 | +#### Remediation |
| 152 | + |
| 153 | +Identify a small group of block producing nodes that will be used to recover the chain. For Cardano mainnet, this group should have |
| 154 | +sufficient delegated stake to be capable of generating at least 9 blocks in a 36 hour window. |
| 155 | +It should be isolated from the rest of the network. |
| 156 | +The chain can then be recovered by resetting the wall clocks on the group of block producing nodes, |
| 157 | +restarting them from the last good block on the Cardano network, playing forward the chain production |
| 158 | +at high speed (10x usual speed is recommended), while inserting new empty blocks at the slots which |
| 159 | +are allocated to the block producers. The recovery nodes can then be restarted with normal settings, including |
| 160 | +connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize |
| 161 | +with the newly restored chain. This would leave one or more gaps in the chain, interspersed with empty blocks. |
| 162 | + |
| 163 | +##### Rewards Donation by Recovery Block Producers |
| 164 | + |
| 165 | +In order to avoid allegations of unfair behaviour, block producing nodes that are used to recover the network should |
| 166 | +donate any rewards that they receive during recovery to the treasury. |
| 167 | + |
| 168 | + |
| 169 | +#### Additional Effects on Cardano Users |
| 170 | + |
| 171 | +Unlike Scenario 1, no transactions will be submitted that need to be resubmitted on the chain. |
| 172 | +Users will, however, experience an extended period during which the chain is unavailable. |
| 173 | +Cardano applications and contracts should be designed with this possibility in mind. |
| 174 | +Full node wallets and other node users should recover quickly once the network is restarted |
| 175 | +but there may be a period of instability while network connections are re-established |
| 176 | +and the Ouroboros Genesis snapshot is distributed across all nodes. |
| 177 | + |
| 178 | + |
| 179 | +#### Timing Considerations |
| 180 | + |
| 181 | +The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano mainnet settings). |
| 182 | +A period of low chain density could have security implications that affect dynamic availability |
| 183 | +and leave open the possibility for future long range attacks. This may be particularly |
| 184 | +relevant should chain recovery be performed as described above (using less stake than is required |
| 185 | +for an honest majority). To mitigate the presence of an extended period of low chain density we may |
| 186 | +need to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively, Mithril |
| 187 | +could also be used to provide certified snapshots to stake pools as a means to verify the correct state of the ledger. |
| 188 | + |
| 189 | +The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risks |
| 190 | +for the types of users on the network that do not participate in consensus. |
| 191 | + |
| 192 | +As described below, Ouroboros Genesis snapshots may also be useful as part of the recovery process. |
| 193 | + |
| 194 | + |
| 195 | +### Scenario 3: Bad Blocks Minted on Chain |
| 196 | + |
| 197 | +In the event that a bad block was to be minted on-chain, then some or all validators might be unable to process the block. |
| 198 | +They would therefore stop, and be unable to restart. Wallet and other nodes might be unable to synchronise beyond the |
| 199 | +point of the bad block. |
| 200 | + |
| 201 | +#### Remediation |
| 202 | + |
| 203 | +Depending on the cause of the issue and its severity, alternative remediations might be possible. |
| 204 | + |
| 205 | +**Scenario 3.1**: if some existing node versions were able to process the block, but others were not, then |
| 206 | +the chain would continue to grow at a lower chain density. SPOs would need to be persuaded to upgrade (or downgrade) |
| 207 | +to a suitable node version that would allow the chain to continue. The chain density would then gradually recover to its normal level. |
| 208 | +Other users would need to upgrade (or downgrade) to a version of the node that could follow the full chain. |
| 209 | + |
| 210 | +**Scenario 3.2**: if no node version was able to process the block and a |
| 211 | +gap of less than *3k/f* slots existed, then the chain could be rolled |
| 212 | +back immediately before the bad block was created, and nodes |
| 213 | +restarted from this point. The chain would then grow as normal, with a small gap around the bad block. |
| 214 | +In this case, care would need to be taken that the rogue transaction was not accidentally reinserted into the chain. |
| 215 | +This might involve clearing node mempools, applying filters on the transaction, or developing and deploying a new node version that |
| 216 | +rejected the bad block. |
| 217 | + |
| 218 | +**Scenario 3.3**: an alternative to rolling back would be to develop and deploy a "hot-fix" node that could |
| 219 | +accept the bad block, either as an exception, or as new acceptable behaviour. |
| 220 | +Nodes would then be able to incorporate the bad block as part of the chain, |
| 221 | +minting new blocks as usual, or following the chain. |
| 222 | +In this case, the bad block would persist on-chain indefinitely and future nodes |
| 223 | +would also need to accept the bad block. Such an approach is best used when the rejected block has behaviour |
| 224 | +that was unanticipated, but which is benign in nature. This will leave no abnormal gaps in the chain. |
| 225 | + |
| 226 | +**Scenario 3.4**: if more than *3k/f* slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediately |
| 227 | +prior to the bad block as in Scenario 3.2, and then proceed as described for Scenario 2. As with Scenario 2, this will leave |
| 228 | +a series of gaps in the chain that are interspersed with empty blocks. |
| 229 | + |
| 230 | +#### Timing Considerations |
| 231 | + |
| 232 | +If more than *3k/f* slots have passed since the bad block was minted on-chain (36 hours with current Cardano mainnet settings), |
| 233 | +then a mix of recovery techniques will be needed, as described in Scenario 3.4. When deciding on the correct recovery |
| 234 | +technique for Scenarios 3.1-3.3, consideration should be given as to whether the recovery can be successfully completed before *3k/f* slots |
| 235 | +have elapsed. In case of doubt, the procedure for Scenario 3.4 should be followed. |
| 236 | + |
| 237 | +### Using Ouroboros Genesis Snapshots |
| 238 | + |
| 239 | +Any of the above conditions may result in a period of lower chain density. The |
| 240 | +updated consensus mechanism introduced in Ouroboros Genesis relies on making |
| 241 | +chain density comparisons to assist a node when catching up with the network, |
| 242 | +in order to reduce the reliance on having trusted peers when syncing. As |
| 243 | +such, low-density periods pose a potential security risk for the future; they |
| 244 | +are periods where a motivated adversary could perform a long-range attack by |
| 245 | +building a higher density chain. |
| 246 | + |
| 247 | +In order to mitigate this, Genesis introduces the concepts of lightweight |
| 248 | +checkpoints. A lightweight checkpoint is effectively a block point - a |
| 249 | +combination of block number and hash - which can be distributed along with the |
| 250 | +node. Unlike Mithril Snapshots (see below), Genesis lightweight snapshots are not assured by any committee - rather, they form part of the trusted codebase distributed with the node, or by other parties. |
| 251 | + |
| 252 | +When syncing, a Genesis node will refuse to validate past the block number of any lightweight checkpoint if the chain does not contain the correct block at that point. |
| 253 | + |
| 254 | +Genesis snapshots play two potential roles in disaster recovery: |
| 255 | + |
| 256 | +1. In scenarios where the network is split, a lightweight snapshot could guide |
| 257 | + a node from the abandoned partition in connecting to the main partition. In |
| 258 | + general this should not be needed, however, since the main partition should win |
| 259 | + out in any Genesis density comparisons. This usage also falls closer to |
| 260 | + scenario 2, in that it relies on an external source imposing a chain selection, |
| 261 | + which must then be trusted by all parties. |
| 262 | +2. Following a disaster recovery procedure, a sufficient number of blocks |
| 263 | + covering the low density period should be added to the list of lightweight |
| 264 | + checkpoints. These would serve the purpose of preventing a subsequent |
| 265 | + long-range attack. |
| 266 | + |
| 267 | +Note that, in this second scenario, concerns about the legitimacy of the |
| 268 | +checkpoint are much less salient. The checkpoint can be issued post disaster |
| 269 | +recovery, at such a time where the points it contains are in the past, and are |
| 270 | +both agreed upon and easy to verify for all honest parties. |
| 271 | + |
| 272 | + |
| 273 | +### Using Mithril Snapshots |
| 274 | + |
| 275 | +Mithril is a stake-based threshold multi-signatures scheme. One of the applications of this protocol in Cardano |
| 276 | +is to create certified snapshots of the Cardano blockchain. Mithril snapshots allow nodes or applications |
| 277 | +to obtain a verified copy of the current state of the blockchain without having to download and verify the full history. |
| 278 | + |
| 279 | +SPOs that participate in the Mithril network provide signed snapshots to a Mithril aggregator that |
| 280 | +is responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature. |
| 281 | +Using this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain that |
| 282 | +can potentially be used as a trusted source for recovery purposes. |
| 283 | + |
| 284 | +Provided that it gains sufficient adoption on the Cardano network and that |
| 285 | +snapshots continue to be signed by an honest majority of stake pools |
| 286 | +following a chain recovery event, Mithril may therefore provide an |
| 287 | +alternative solution to Ouroboros Genesis checkpoints as a way to |
| 288 | +verify the correct state of the ledger |
| 289 | + |
| 290 | + |
| 291 | +### Recommended Actions for Cardano mainnet |
| 292 | + |
| 293 | +1. Monitor Cardano mainnet for periods of low density and take early action if an extended period is observed. |
| 294 | +2. Identify a collection of block producer nodes that has sufficient stake to mint at least 9 blocks in any 36 hour window. |
| 295 | +3. Set up emergency communication channels with stake pool operators and other community members. |
| 296 | +4. Practice disaster recovery procedures on a regular basis. |
| 297 | +5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot. |
| 298 | +6. Determine how to employ Ouroboros Genesis snapshots as part of the disaster recovery process |
| 299 | + |
| 300 | +#### Community Engagement |
| 301 | + |
| 302 | +One of the key requirements for successful disaster recovery will be proper engagement with the community. |
| 303 | + |
| 304 | +1. Identify stake pool operators (SPOs) who can assist with disaster recovery |
| 305 | +2. Discuss disaster recovery requirements with Intersect's Technical Working Groups and Security Council |
| 306 | +3. Identify and establish the right communications channels with the community, including Intersect |
| 307 | +4. Set up regular disaster recovery practice sessions |
| 308 | + |
| 309 | + |
| 310 | +## Rationale: how does this CIP achieve its goals? |
| 311 | + |
| 312 | +This CIP outlines key disaster recovery scenarios that the Cardano community should understand to mitigate |
| 313 | +potential network outages. As a living document, it will be regularly reviewed and updated to inform |
| 314 | +stakeholders and encourage more detailed contingency planning. The CIP aims to facilitate discussions, |
| 315 | +establish recovery procedures, and encourage regular recovery practice exercises to ensure preparedness |
| 316 | +and validation of recovery actions in the event of an outage. |
| 317 | + |
| 318 | +## Path to Active |
| 319 | + |
| 320 | +### Acceptance criteria |
| 321 | + |
| 322 | +- [x] The proposal has been reviewed by the community and sufficiently advertised on various channels. |
| 323 | + - [x] Intersect Technical Groups |
| 324 | + - [x] Intersect Discord Channels |
| 325 | + - [x] Cardano Forum |
| 326 | + |
| 327 | +- [x] All major concerns or feedback have been addressed. |
| 328 | + |
| 329 | +### Implementation Plan |
| 330 | + |
| 331 | +N/A |
| 332 | + |
| 333 | +## Change Log |
| 334 | + |
| 335 | +| Version | Date | Description | |
| 336 | +| -------- | -------- | ------- | |
| 337 | +| 0.1 | 2024-08-30 | Initial submitted version | |
| 338 | +| 0.2 | 2024-09-10 | Revised version to emphasize genericity of recovery techniques | |
| 339 | +| 0.3 | 2024-09-18 | Revised version following CIP editors meeting | |
| 340 | + |
| 341 | +## References |
| 342 | + |
| 343 | +[Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/) |
| 344 | + |
| 345 | +[Cardano Incident Reports](https://updates.cardano.intersectmbo.org/tags/incident) |
| 346 | + |
| 347 | +[January 2023 Block Production Temporary Outage](https://updates.cardano.intersectmbo.org/2023-04-17-ledger) |
| 348 | + |
| 349 | +[DB Truncator Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app) |
| 350 | + |
| 351 | +[DB Synthesizer Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app) |
| 352 | + |
| 353 | +[Ouroboros Genesis](https://iohk.io/en/research/library/papers/ouroboros-genesis-composable-proof-of-stake-blockchains-with-dynamic-availability/) |
| 354 | + |
| 355 | +[Mithril](https://github.com/input-output-hk/mithril) |
| 356 | + |
| 357 | + |
| 358 | +## Copyright |
| 359 | + |
| 360 | + This CIP is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). |
0 commit comments