cip10
titleCoordinated network upgrades
descriptionProtocol for coordinating major network upgrades
authorCallum Waters (@cmwaters)
discussions-tohttps://forum.celestia.org/t/cip-coordinated-network-upgrades/1367
statusFinal
typeStandards Track
categoryCore
created2023-12-07

Abstract

Use a pre-programmed height for the next major upgrade. Following major upgrades will use an in-protocol signalling mechanism. Validators will submit messages to signal their ability and preference to use the next version. Once a quorum of 5/6 has signalled the same version, the network will migrate to that version.

Motivation

The Celestia network needs to be able to upgrade across different state machines so new features and bugs that are protocol breaking can be supported. Versions of the Celestia consensus node are required to support all prior state machines, thus nodes are able to upgrade any time ahead of the coordinated upgrade height and very little downtime is experienced in transition. The problem addressed in this CIP is to define a protocol for coordinating that upgrade height.

Specification

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 and RFC 8174.

The next upgrade will be coordinated using a hard coded height that is passed as a flag when the node commences. This is an exception necessary to introduce the upgrading protocol which will come into affect for the following upgrade.

The network introduces two new message types MsgSignalVersion and MsgTryUpgrade

message MsgSignalVersion {
  string validator_address = 1;
  uint64 version = 2;
}

message MsgTryUpgrade { string signer = 1; }

Only validators can submit MsgSignalVersion. The Celestia state machine tracks which version each validator has signalled for. The signalled version MUST either be the same version or the next. There is no support for skipping versions or downgrading.

Clients may query the tally for each version as follows:

message QueryVersionTallyRequest { uint64 version = 1; }

message QueryVersionTallyResponse {
  uint64 voting_power = 1;
  uint64 threshold_power = 2;
  uint64 total_voting_power = 3;
}

When voting_power is greater or equal to theshold_power , the network MAY upgrade. This is done through a “crank” transaction, MsgTryUpgrade, which can be submitted by any account. The account covers the gas required for the calculation. If the quorum is met, the chain will update the AppVersion in the ConsensusParams returned in EndBlock. Celestia will reset the tally and perform all necessary migrations at the end of processing that block in Commit. The proposer of the following height will include the new version in the block. As is currently, nodes will only vote for blocks that match the same network version as theirs.

If the network agrees to move to a version that is not supported by the node, the node will gracefully shutdown.

The threshold_power is calcualted as 5/6ths of the total voting power. Rationale is provided below.

Rationale

When attempting a major upgrade, there is increased risk that the network halts. At least 2/3 in voting power of the network is needed to migrate and agree upon the next block. However this does not take into account the actions of byzantine validators. As an example, say there is 100 in total voting power. 66 have signalled v2 and 33 have yet to signal i.e. they are still signalling v1. It takes 1 byzantine voting power to pass the threshold, convincing the network to propose v2 blocks and then omitting their vote leaving the network failing to reach consensus until one of the 33 are able to upgrade. At the other end of the spectrum, increasing the necessary threshold means less voting power required to veto the upgrade. The middle point here is setting a quorum of 5/6ths which provides 1/6 byzantine fault tolerance to liveness and requiring at least 1/6th of the network to veto the upgrade.

Validators are permitted to signal for the current network version as a means of cancelling their prior decision to upgrade. This is important in the case that new information arises that convinces the network that the next version is not ready.

An on-chain tallying system was decided over an off-chain as it canonicalises the information which is important for coordination and adds accountability to validators. As a result of this decision, validators will need to pay fees to upgrade which is necessary to avoid spamming the chain.

Backwards Compatibility

This feature modifies the functionality of the state machine in a breaking way as the state machine can now dictate version changes. This will require a major upgrade to implement (thus the protocol won’t come into affect until the following major upgrade).

As the API is additive, there is no need to consider backwards compatibility for clients.

Test Cases

All implementations are advised to test the following scenarios:

  • A version x node can run on a version y network where x >= y.
  • A MsgTryUpgrade should not modify the app version if there has been less than 5/6th and set the new app version when that threshold has been reached.
  • A version x node should gracefully shutdown and not continue to validate blocks on a version y network when y > x.
  • MsgSignal should correctly tally the accounts voting power. Signalling multiple times by the same validator should not increase the tally. A validator should be able to resignal a different version at any time.

Reference Implementation

The golang implementation of the signal module can be found here

Security Considerations

See the section on rationale for understanding the network halting risk.

Copyright and related rights waived via CC0.