Skip to content

Commit 8f96366

Browse files
kevinhammondmrwol5nc6rphairRyun1
authored
CIP-0135? | Disaster Recovery Plan for Cardano Networks (#893)
* Created CIP * addtions * minor edits * added Scenarios 3 * untabify * re-added Scenario 3.4 * minor edits * update abstract * update motivation * update scenario_1 * update scenario_2 * update mithril * update rationale * Small improvements following PR merges * Update README.md Small text change * Add section on Genesis checkpoints Expand on what the lightweight checkpoints introduced with Genesis are, and how they can assist with recovery from a disaster. * renamed ro CIP-911 * updated authors * updated authors * added change log * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * Update CIP-0911/README.md Co-authored-by: Ryan <44342099+Ryun1@users.noreply.github.com> * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * Update CIP-0911/README.md Co-authored-by: Robert Phair <rphair@cosd.com> * edited to make it clear which procedures are generic, and which apply to mainnet; removed TODO * renamed directory * added "path to active" plus small editing changes * tweaked path to active * Restructured to single section "Specification" * Update CIP-0135/README.md Co-authored-by: Ryan <44342099+Ryun1@users.noreply.github.com> * Update CIP-0135/README.md Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com> * Update CIP-0135/README.md Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com> * Update CIP-0135/README.md Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com> * Update CIP-0135/README.md Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com> * removed spaces * added some path to active checks * updated acceptance criteria * updated acceptance criteria * changed status to active --------- Co-authored-by: swagendorp <15338420+swagendorp@users.noreply.github.com> Co-authored-by: Nicholas Clarke <nick@topos.org.uk> Co-authored-by: Robert Phair <rphair@cosd.com> Co-authored-by: Ryan <44342099+Ryun1@users.noreply.github.com> Co-authored-by: Thomas Vellekoop <107037423+perturbing@users.noreply.github.com>
1 parent 53884aa commit 8f96366

File tree

1 file changed

+360
-0
lines changed

1 file changed

+360
-0
lines changed

CIP-0135/README.md

+360
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
---
2+
CIP: 135
3+
Title: Disaster Recovery Plan for Cardano networks
4+
Category: Tools
5+
Status: Active
6+
Authors:
7+
- Kevin Hammond <kevin.hammond@iohk.io>
8+
- Sam Leathers <samuel.leathers@iohk.io>
9+
- Alex Moser <alex.moser@cardanofoundation.org>
10+
- Steve Wagendorp <steve.wagendorp@cardanofoundation.org>
11+
- Andrew Westberg <andrewwestberg@gmail.com>
12+
- Nicholas Clarke <nicholas.clarke@tweag.io>
13+
Implementors: N/A
14+
Discussions:
15+
- https://github.com/cardano-foundation/CIPs/pull/893
16+
Created: 2024-06-17
17+
License: CC-BY-4.0
18+
---
19+
20+
## Abstract
21+
22+
While the Cardano mainnet and other networks have proven to be highly resilient, it is necessary to proactively
23+
consider the possible recovery mechanisms and procedures that may be required in the unlikely
24+
event of a major failure where the network is unable to recover itself.
25+
26+
This CIP considers three representative scenarios and addresses specific considerations relevant
27+
in each case:
28+
29+
Scenario 1 - __Long-Lived Network Partition__
30+
Scenario 2 - __Failure to Make Blocks for an Extended Period of Time__
31+
Scenario 3 - __Bad Blocks Minted on Chain__
32+
33+
To ensure successful recovery in the event of a chain failure, it's crucial to establish effective
34+
communication channels and exercise recovery procedures in advance to familiarize the community and
35+
stake pool operators (SPOs) with the process.
36+
37+
This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by internal
38+
documentation and discussions that have not been publicly released. It should be considered to be a living
39+
document that is reviewed and revised on a regular basis.
40+
41+
Note that although the focus of disaster recovery is on Cardano mainnet, since this is the greatest risk
42+
of loss of funds, the recovery procedures are generic and apply to other Cardano
43+
networks, including SanchoNet, Preview, PreProd or private networks.
44+
Appropriate adjustments may need to be made to reflect differences in timing or other concerns.
45+
46+
47+
## Motivation: why is this CIP necessary?
48+
49+
This CIP is needed to familiarize stakeholders with the processes and procedures that should be
50+
followed in the unlikely event that the Cardano mainnet, or another Cardano network, encounters
51+
a situation where the built-in on-chain recovery mechanisms fail.
52+
53+
## Specification
54+
55+
While the exact recovery process will depend on the unique nature of the failure, there are three main scenarios we can consider.
56+
57+
### Scenario 1: Long-Lived Network Partition
58+
59+
Ouroboros Praos is designed to cope with real-world networking
60+
conditions, in which some nodes may temporarily be disconnected from
61+
the network. In this case, the network will continue to make blocks,
62+
perhaps at some lower chain density (reflecting the temporary loss of
63+
stake to the network as a whole). As nodes rejoin the network, they
64+
will then participate in normal block production once again. In this
65+
way, the network remains resilient to changes in connectivity.
66+
67+
If many nodes become disconnected, the network could divide into two
68+
or more completely disconnected parts. Each part of the network could
69+
then form its own chain, backed by the stake that is participating in
70+
its own partition. Under normal conditions, Praos will also deal with
71+
this situation. When the partitioned group of nodes reconnects, the
72+
longest chain will dominate, and the shorter chain will be discarded.
73+
The nodes on the shorter chain will automatically rollback to the
74+
point where the fork occurred, and then rejoin the main chain. This
75+
is perfectly normal. Such forks will typically last only a few
76+
blocks.
77+
78+
However, in an extreme situation, the partition may persist beyond the
79+
Praos rollback limit of *k* blocks (currently 2,160 blocks on mainnet).
80+
In this case, the nodes will not be able to rollback to rejoin the main chain, since this
81+
would violate the required Praos guarantees.
82+
83+
84+
#### Remediations
85+
86+
Disconnected nodes must be reconnected to the main chain by their operators. This can be done
87+
by truncating the local block database to a point before the chain fork and then resyncing
88+
against the main network, using the `db-truncator` tool, for example.
89+
90+
Full node wallets can also be recovered in the same way, though this may require technical
91+
skills that the end users do not possess. It may be easier, if slower, for them to simply
92+
resynchronize their nodes from the start of the chain (i.e. from the genesis block).
93+
94+
Ouroboros Genesis provides additional resilience when recovering from long lived network partitions.
95+
In Praos nodes resyncing from a point before the chain fork could still in some cases follow the
96+
alternative chain (if it is the first one seen) and extra mechanisms may be needed to avoid this
97+
possibility. In Praos, for example, this may require that all participants on the alternative chain
98+
truncate the local block database prior to the partition being resolved. In Ouroboros Genesis
99+
when resyncing from a point before the chain fork, the chain selection rules will ensure
100+
selection of the correct path for the main chain assuming the partition has been resolved.
101+
102+
Alternative methods to resynchronise the node to the main chain might
103+
include the use of Mithril or other signed snapshots. These would
104+
allow faster recovery. However, in this case, care needs to be taken
105+
to achieve the correct balance of trust against speed of recovery.
106+
107+
#### Additional Effects on Cardano Users
108+
109+
Although block producing nodes will rejoin the main network following the remediation
110+
described above, the blocks that they have
111+
minted while they were disconnected will not be included in the main
112+
chain. This may have real world effects that will not be
113+
automatically remedied when the nodes rejoin the main chain. For
114+
example, transactions may have been processed that have significant
115+
real world value, or assumptions may have been made about chains of
116+
evidence/validity, or the timing of transactions. End users should be
117+
aware of the possibility and include provisions in their contracts to
118+
cover this eventuality. It may be necessary to resubmit some or all of the
119+
transactions that were processed on the minority chain onto the main chain.
120+
To avoid unexpected effects, this should be done by the end users/applications, and not
121+
by block producers acting on their behalf.
122+
123+
If they are not observant, stake pools, full node wallets and
124+
other node users (e.g. explorers) could continue indefinitely on the minority
125+
chain. Such users should take care to be aware of this situation and
126+
take steps to rejoin the main chain as quickly as possible.
127+
A reliable and trusted public warning system should be considered that can alert users
128+
and advise them on how to rejoin the main chain.
129+
130+
131+
#### Timing Considerations
132+
133+
On Cardano mainnet, partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano mainnet settings, this represents
134+
a period of up to 12 hours during which automatic rollback will occur. If the partition exceeds 2,160 blocks, then the
135+
procedure described above will be necessary to allow nodes to rejoin the main chain. Other Cardano networks may have different
136+
timing characteristics.
137+
138+
139+
### Scenario 2: Failure to Make Blocks for an Extended Period of Time
140+
141+
Ouroboros Praos requires *at least* one block to be produced every *3k/f* slots. With the current Cardano mainnet
142+
settings, that is a 36 hour period. Such an event is extremely unlikely, but if it were to happen then the network
143+
would be unable to make any further blocks.
144+
145+
#### Mitigation
146+
147+
It is recommended to monitor the chain for block production. If a low density period is observed, then block producers
148+
should be notified, and efforts made to mint new blocks prior to the expiry of the *3k/f* window. If this is not possible
149+
then the remediation procedures should be followed.
150+
151+
#### Remediation
152+
153+
Identify a small group of block producing nodes that will be used to recover the chain. For Cardano mainnet, this group should have
154+
sufficient delegated stake to be capable of generating at least 9 blocks in a 36 hour window.
155+
It should be isolated from the rest of the network.
156+
The chain can then be recovered by resetting the wall clocks on the group of block producing nodes,
157+
restarting them from the last good block on the Cardano network, playing forward the chain production
158+
at high speed (10x usual speed is recommended), while inserting new empty blocks at the slots which
159+
are allocated to the block producers. The recovery nodes can then be restarted with normal settings, including
160+
connections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronize
161+
with the newly restored chain. This would leave one or more gaps in the chain, interspersed with empty blocks.
162+
163+
##### Rewards Donation by Recovery Block Producers
164+
165+
In order to avoid allegations of unfair behaviour, block producing nodes that are used to recover the network should
166+
donate any rewards that they receive during recovery to the treasury.
167+
168+
169+
#### Additional Effects on Cardano Users
170+
171+
Unlike Scenario 1, no transactions will be submitted that need to be resubmitted on the chain.
172+
Users will, however, experience an extended period during which the chain is unavailable.
173+
Cardano applications and contracts should be designed with this possibility in mind.
174+
Full node wallets and other node users should recover quickly once the network is restarted
175+
but there may be a period of instability while network connections are re-established
176+
and the Ouroboros Genesis snapshot is distributed across all nodes.
177+
178+
179+
#### Timing Considerations
180+
181+
The chain will tolerate a gap of up to *3k/f* slots (36 hours with current Cardano mainnet settings).
182+
A period of low chain density could have security implications that affect dynamic availability
183+
and leave open the possibility for future long range attacks. This may be particularly
184+
relevant should chain recovery be performed as described above (using less stake than is required
185+
for an honest majority). To mitigate the presence of an extended period of low chain density we may
186+
need to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively, Mithril
187+
could also be used to provide certified snapshots to stake pools as a means to verify the correct state of the ledger.
188+
189+
The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risks
190+
for the types of users on the network that do not participate in consensus.
191+
192+
As described below, Ouroboros Genesis snapshots may also be useful as part of the recovery process.
193+
194+
195+
### Scenario 3: Bad Blocks Minted on Chain
196+
197+
In the event that a bad block was to be minted on-chain, then some or all validators might be unable to process the block.
198+
They would therefore stop, and be unable to restart. Wallet and other nodes might be unable to synchronise beyond the
199+
point of the bad block.
200+
201+
#### Remediation
202+
203+
Depending on the cause of the issue and its severity, alternative remediations might be possible.
204+
205+
**Scenario 3.1**: if some existing node versions were able to process the block, but others were not, then
206+
the chain would continue to grow at a lower chain density. SPOs would need to be persuaded to upgrade (or downgrade)
207+
to a suitable node version that would allow the chain to continue. The chain density would then gradually recover to its normal level.
208+
Other users would need to upgrade (or downgrade) to a version of the node that could follow the full chain.
209+
210+
**Scenario 3.2**: if no node version was able to process the block and a
211+
gap of less than *3k/f* slots existed, then the chain could be rolled
212+
back immediately before the bad block was created, and nodes
213+
restarted from this point. The chain would then grow as normal, with a small gap around the bad block.
214+
In this case, care would need to be taken that the rogue transaction was not accidentally reinserted into the chain.
215+
This might involve clearing node mempools, applying filters on the transaction, or developing and deploying a new node version that
216+
rejected the bad block.
217+
218+
**Scenario 3.3**: an alternative to rolling back would be to develop and deploy a "hot-fix" node that could
219+
accept the bad block, either as an exception, or as new acceptable behaviour.
220+
Nodes would then be able to incorporate the bad block as part of the chain,
221+
minting new blocks as usual, or following the chain.
222+
In this case, the bad block would persist on-chain indefinitely and future nodes
223+
would also need to accept the bad block. Such an approach is best used when the rejected block has behaviour
224+
that was unanticipated, but which is benign in nature. This will leave no abnormal gaps in the chain.
225+
226+
**Scenario 3.4**: if more than *3k/f* slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediately
227+
prior to the bad block as in Scenario 3.2, and then proceed as described for Scenario 2. As with Scenario 2, this will leave
228+
a series of gaps in the chain that are interspersed with empty blocks.
229+
230+
#### Timing Considerations
231+
232+
If more than *3k/f* slots have passed since the bad block was minted on-chain (36 hours with current Cardano mainnet settings),
233+
then a mix of recovery techniques will be needed, as described in Scenario 3.4. When deciding on the correct recovery
234+
technique for Scenarios 3.1-3.3, consideration should be given as to whether the recovery can be successfully completed before *3k/f* slots
235+
have elapsed. In case of doubt, the procedure for Scenario 3.4 should be followed.
236+
237+
### Using Ouroboros Genesis Snapshots
238+
239+
Any of the above conditions may result in a period of lower chain density. The
240+
updated consensus mechanism introduced in Ouroboros Genesis relies on making
241+
chain density comparisons to assist a node when catching up with the network,
242+
in order to reduce the reliance on having trusted peers when syncing. As
243+
such, low-density periods pose a potential security risk for the future; they
244+
are periods where a motivated adversary could perform a long-range attack by
245+
building a higher density chain.
246+
247+
In order to mitigate this, Genesis introduces the concepts of lightweight
248+
checkpoints. A lightweight checkpoint is effectively a block point - a
249+
combination of block number and hash - which can be distributed along with the
250+
node. Unlike Mithril Snapshots (see below), Genesis lightweight snapshots are not assured by any committee - rather, they form part of the trusted codebase distributed with the node, or by other parties.
251+
252+
When syncing, a Genesis node will refuse to validate past the block number of any lightweight checkpoint if the chain does not contain the correct block at that point.
253+
254+
Genesis snapshots play two potential roles in disaster recovery:
255+
256+
1. In scenarios where the network is split, a lightweight snapshot could guide
257+
a node from the abandoned partition in connecting to the main partition. In
258+
general this should not be needed, however, since the main partition should win
259+
out in any Genesis density comparisons. This usage also falls closer to
260+
scenario 2, in that it relies on an external source imposing a chain selection,
261+
which must then be trusted by all parties.
262+
2. Following a disaster recovery procedure, a sufficient number of blocks
263+
covering the low density period should be added to the list of lightweight
264+
checkpoints. These would serve the purpose of preventing a subsequent
265+
long-range attack.
266+
267+
Note that, in this second scenario, concerns about the legitimacy of the
268+
checkpoint are much less salient. The checkpoint can be issued post disaster
269+
recovery, at such a time where the points it contains are in the past, and are
270+
both agreed upon and easy to verify for all honest parties.
271+
272+
273+
### Using Mithril Snapshots
274+
275+
Mithril is a stake-based threshold multi-signatures scheme. One of the applications of this protocol in Cardano
276+
is to create certified snapshots of the Cardano blockchain. Mithril snapshots allow nodes or applications
277+
to obtain a verified copy of the current state of the blockchain without having to download and verify the full history.
278+
279+
SPOs that participate in the Mithril network provide signed snapshots to a Mithril aggregator that
280+
is responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature.
281+
Using this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain that
282+
can potentially be used as a trusted source for recovery purposes.
283+
284+
Provided that it gains sufficient adoption on the Cardano network and that
285+
snapshots continue to be signed by an honest majority of stake pools
286+
following a chain recovery event, Mithril may therefore provide an
287+
alternative solution to Ouroboros Genesis checkpoints as a way to
288+
verify the correct state of the ledger
289+
290+
291+
### Recommended Actions for Cardano mainnet
292+
293+
1. Monitor Cardano mainnet for periods of low density and take early action if an extended period is observed.
294+
2. Identify a collection of block producer nodes that has sufficient stake to mint at least 9 blocks in any 36 hour window.
295+
3. Set up emergency communication channels with stake pool operators and other community members.
296+
4. Practice disaster recovery procedures on a regular basis.
297+
5. Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot.
298+
6. Determine how to employ Ouroboros Genesis snapshots as part of the disaster recovery process
299+
300+
#### Community Engagement
301+
302+
One of the key requirements for successful disaster recovery will be proper engagement with the community.
303+
304+
1. Identify stake pool operators (SPOs) who can assist with disaster recovery
305+
2. Discuss disaster recovery requirements with Intersect's Technical Working Groups and Security Council
306+
3. Identify and establish the right communications channels with the community, including Intersect
307+
4. Set up regular disaster recovery practice sessions
308+
309+
310+
## Rationale: how does this CIP achieve its goals?
311+
312+
This CIP outlines key disaster recovery scenarios that the Cardano community should understand to mitigate
313+
potential network outages. As a living document, it will be regularly reviewed and updated to inform
314+
stakeholders and encourage more detailed contingency planning. The CIP aims to facilitate discussions,
315+
establish recovery procedures, and encourage regular recovery practice exercises to ensure preparedness
316+
and validation of recovery actions in the event of an outage.
317+
318+
## Path to Active
319+
320+
### Acceptance criteria
321+
322+
- [x] The proposal has been reviewed by the community and sufficiently advertised on various channels.
323+
- [x] Intersect Technical Groups
324+
- [x] Intersect Discord Channels
325+
- [x] Cardano Forum
326+
327+
- [x] All major concerns or feedback have been addressed.
328+
329+
### Implementation Plan
330+
331+
N/A
332+
333+
## Change Log
334+
335+
| Version | Date | Description |
336+
| -------- | -------- | ------- |
337+
| 0.1 | 2024-08-30 | Initial submitted version |
338+
| 0.2 | 2024-09-10 | Revised version to emphasize genericity of recovery techniques |
339+
| 0.3 | 2024-09-18 | Revised version following CIP editors meeting |
340+
341+
## References
342+
343+
[Cardano Disaster Recovery Plan (May 2021)](https://iohk.io/en/research/library/papers/cardano-disaster-recovery-plan/)
344+
345+
[Cardano Incident Reports](https://updates.cardano.intersectmbo.org/tags/incident)
346+
347+
[January 2023 Block Production Temporary Outage](https://updates.cardano.intersectmbo.org/2023-04-17-ledger)
348+
349+
[DB Truncator Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app)
350+
351+
[DB Synthesizer Tool](https://github.com/IntersectMBO/ouroboros-consensus/tree/486753d0b7d6b0d09621d1ef8be85e5117ff3d1e/ouroboros-consensus-cardano/app)
352+
353+
[Ouroboros Genesis](https://iohk.io/en/research/library/papers/ouroboros-genesis-composable-proof-of-stake-blockchains-with-dynamic-availability/)
354+
355+
[Mithril](https://github.com/input-output-hk/mithril)
356+
357+
358+
## Copyright
359+
360+
This CIP is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode).

0 commit comments

Comments
 (0)