*Disclaimer: This is not intended as a criticism of any particular client. It is highly probable that each client and possibly even the specification has its own oversights and bugs. Eth2 is an intricate protocol, and the individuals implementing it are only human. This article aims to underscore how and why the risks are alleviated.*
With the introduction of the Medalla testnet, individuals were motivated to trial various clients. And right from its inception, we understood why: Nimbus and Lodestar nodes struggled to handle the demands of a complete testnet and became unresponsive. [0][1] Consequently, Medalla was unable to finalize during the initial thirty minutes of its lifespan.
On August 14th, Prysm nodes lost synchronization when one of the time servers they relied on suddenly advanced a day into the future. These nodes then began producing blocks and attestations as though they, too, were situated in the future. When the timings on these nodes were rectified (either by updating the client or when the time server reverted to the accurate time), those who had disabled the default slashing protections found their stakes reduced.
What transpired is a tad more nuanced, and I strongly suggest reading Raul Jordan’s account of the event.
Clock Failure – The worsening
At the moment Prysm nodes began shifting through time, they represented approximately ~62% of the network. This meant that the requisite majority for finalizing blocks (>2/3 on a single chain) could not be achieved. Even worse, these nodes could not locate the chain they anticipated (there existed a 4 hour “gap” in the temporal history and they all advanced to slightly different moments) and thus inundated the network with short forks as they speculated about the “missing” information.
Currently, Prysm constitutes 82% of Medalla nodes 😳 ! [ethernodes.org]
At this stage, the network was overwhelmed with thousands of varied conjectures regarding what the head of the chain was, and all the clients began to buckle under the heightened demand of discerning which chain was legitimate. This resulted in nodes falling behind, necessitating synchronization, depleting memory, and causing other forms of disorder, all of which exacerbated the situation.
In the end, this was beneficial, as it enabled us not only to resolve the underlying issue related to the clocks but also to stress test the clients under conditions of extensive node failure and network strain. Nevertheless, this failure need not have been so severe, and the root cause in this instance was Prysm’s predominance.
Promoting Decentralization – Part I, it’s beneficial for eth2
As I have mentioned before, 1/3 is the critical threshold when it comes to secure, asynchronous BFT algorithms. If over 1/3 of validators go offline, epochs can no longer be finalized. Therefore, while the chain continues to grow, it is no longer possible to reference a block and ensure that it will persist as part of the canonical chain.
Promoting Decentralization – Part II, it’s advantageous for you
To the greatest possible degree, validators are incentivized to act in favor of the network rather than simply being relied upon to act rightly.
If more than 1/3 of nodes are offline, penalties for those offline nodes begin to escalate. This is referred to as the inactivity penalty.
This implies that, as a validator, you should strive to ensure that if something is likely to take your node offline, it is unlikely to impact many other nodes concurrently.
The same principle applies for being slashed. Although there’s always a possibility that your validators are penalized due to a specification or software error/bug, the consequences for individual slashing actions are “only” 1 ETH.
Nevertheless, if numerous validators are penalized at the same time as you, then the consequences can escalate to as much as 32 ETH. The point at which this occurs is again the crucial 1/3 threshold. [A comprehensive explanation of why this is the case can be found here].
These motivators are termed liveness anti-correlation and safety anti-correlation correspondingly, and are deliberate components of eth2’s architecture. Anti-correlation mechanisms encourage validators to make choices that benefit the network by linking individual penalties to how much each validator is influencing the network.
Promoting Decentralization – Part III, the statistics
Eth2 is being developed by numerous independent teams, each crafting separate clients in accordance with the specifications primarily outlined by the eth2 research team. This guarantees that there are multiple beacon node & validator client implementations, each making distinct decisions regarding the technology, languages, optimizations, trade-offs, etc., necessary to construct an eth2 client. In this manner, a flaw in any layer of the system will only affect those operating a specific client and not the entire network.
If, in the case of the Prysm Medalla time-bug, only 20% of eth2 nodes were utilizing Prysm and 85% of individuals were online, then the inactivity penalty would not have been triggered for Prysm nodes, and the issue could have been resolved with merely minor penalties and some restless nights for the developers.
Conversely, due to the large number of participants using the same client (many of whom had switched off slashing protection), between 3500 and 5000 validators were penalized in a brief span of time.* The significant level of correlation resulted in slashings being approximately 16 ETH for these validators since they were operating a widely-used client.
* At the time of writing, slashing actions are still being reported, so a final figure remains unavailable.
Experiment with something new
Now is the ideal moment to experiment with various clients. Identify a client that a smaller fraction of validators are employing (you can check the distribution here). Lighthouse, Teku, Nimbus, and Prysm are all fairly stable at the moment, while Lodestar is rapidly catching up.
Most crucially, TRY A NEW CLIENT! We have a chance to cultivate a healthier distribution on Medalla in anticipation of a decentralized mainnet.