Hive: Our Journey Towards a Pristine Fork

wsjcrypto

1 anno fa

The DAO soft-fork endeavor proved to be challenging. Not only did we misjudge the repercussions on the consensus protocol (i.e. DoS vulnerability), but we also inadvertently introduced a data race into the hurried implementation that was a potential disaster waiting to occur. It was less than optimal, and although mitigated at the last moment, the rapidly approaching hard-fork deadline appeared quite grim, to say the least. We required a fresh approach…

The initial step toward this was an idea adapted from Google (thanks to Nick Johnson): crafting a thorough postmortem of the incident, aiming to evaluate the fundamental causes of the problem, concentrating exclusively on the technical aspects and suitable measures to prevent a reoccurrence.

Technical resolutions can scale and endure; attributing blame does not. ~ Nick

From the postmortem, one fascinating finding from the viewpoint of this blog entry was revealed. The soft-fork code within [go-ethereum](https://github.com/ethereum/go-ethereum) appeared robust from all angles: a) it was comprehensively covered by unit tests with a 3:1 test-to-code ratio; b) it underwent meticulous review by six foundation developers; and c) it was even manually live tested on a private network… Yet, a critical data race persisted, which could have potentially instigated severe network disruption.

It was discovered that the defect could only manifest in a network composed of multiple nodes, multiple miners, and multiple blocks being created concurrently. Even if all these conditions were met, there remained only a slight probability for the bug to appear. Unit tests failed to catch it, code reviewers might overlook it, and manual testing would likely miss it as well. Our conclusion was that the development teams required additional tools to conduct reproducible tests that would encapsulate the intricate interactions of multiple nodes in a simultaneous networked scenario. Lacking such a tool, manually verifying the various edge cases becomes cumbersome; and without conducting these checks continuously as part of the development cycle, rare errors would be nearly impossible to uncover in time.

Thus, hive was conceived…

What is hive?

Ethereum expanded to the extent that testing implementations became a significant burden. Unit tests are suitable for examining various implementation idiosyncrasies, but validating that a client adheres to a certain baseline quality, or ensuring that clients can operate harmoniously in a multi-client environment, is far from straightforward.

Hive is designed to function as an easily extendable test harness where anyone can contribute tests (whether they are simple validations or network simulations) in any programming language of their preference, with hive capable of simultaneously executing those tests against all potential clients. Consequently, the harness is intended for black box testing where no client-specific internal details/state can be tested or examined; rather, the emphasis is placed on compliance with official specifications or behaviors under varying circumstances.

Most critically, hive was crafted from the ground up to operate as part of any client’s CI workflow!

How does hive work?

Hive’s essence is [docker](https://www.docker.com/). Each client implementation is a docker image; every validation suite is a docker image; and each network simulation is a docker image. Hive itself is an all-encompassing docker image. This is a highly effective abstraction…

Since Ethereum clients are docker images within hive, client developers can construct the optimal environment for their clients to operate in (in terms of dependencies, tools, and configuration). Hive will initiate as many instances as necessary, all of which run in their own Linux systems.

Likewise, as test suites that validate Ethereum clients are docker images, the author of the tests can utilize any programming environment they are most proficient in. Hive will ensure that a client is operational when it activates the tester, who can then verify if the specific client adheres to the desired behavior.

Finally, network simulations are once more defined by docker images, but unlike simple tests, simulators not only execute code against a functioning client but can also initiate and terminate clients as necessary. These clients operate within the same virtual network and can connect with one another freely (or as governed by the simulator container), forming an on-demand private Ethereum network.

How did hive assist the fork?

Hive is neither a substitute for unit testing nor for thorough evaluations. All currently practiced methods are crucial to achieving a clean implementation of any feature. Hive can offer validation that extends beyond what is practical from an average developer’s viewpoint: executing extensive tests that may necessitate complex execution settings; and examining networking edge cases that could take hours to establish.

In the context of the DAO hard-fork, in addition to all the consensus and unit tests, it was imperative to ensure, most importantly, that nodes partitioned correctly into two subsets at the networking level: one supporting and one opposing the fork. This was vital as it is impossible to foresee what detrimental effects running two competing chains within one network might incur, especially from the minority’s standpoint.

Consequently, we implemented three specific network simulations in hive:

The first to verify that miners creating the full Ethash DAGs generate correct block extra-data fields for both pro-forkers and no-forkers, even when trying to deceptively spoof.
The second to confirm that a network composed of mixed pro-fork and no-fork nodes/miners accurately splits into two upon the arrival of the fork block, also maintaining the separation thereafter.
The third to verify that given an already forked network, newly joining nodes can synchronize, fast sync, and light sync with the chain they prefer.

The intriguing question, however, is: did hive genuinely detect any mistakes, or did it merely serve as an additional assurance that everything was functioning properly? The response is, both. Hive identified three fork-unrelated bugs in Geth but also significantly contributed to Geth’s hard-fork development by constantly offering feedback on how adjustments influenced network dynamics.

There was some critique of the go-ethereum team for taking their time on the hard-fork execution. Hopefully, individuals will now comprehend what we were engaged in, meanwhile executing the fork itself. Overall, I believe hive ended up playing a quite significant role in the cleanliness of this transition.

What does the future hold for hive?

The Ethereum GitHub organization presents [4 test tools already](https://github.com/ethereum?utf8=%E2%9C%93&query=test), with at least one EVM benchmarking tool being developed in some external repository. They are not being leveraged to their full potential. They possess numerous dependencies, produce a great deal of clutter, and are notably complex to operate.

With hive, we intend to consolidate all the diverse scattered tests under one universal client validator that has minimal dependencies, can be enhanced by anyone, and can function as part of the daily CI workflow of client developers.

We invite everyone to contribute to the project, whether that includes adding new clients for validation, validators for testing, or simulators to uncover intriguing networking challenges. In the meantime, we’ll strive to further refine hive itself, incorporating support for running benchmarks as well as mixed-client simulations.

With some effort, perhaps we’ll even achieve support for executing hive in the cloud, enabling it to perform network simulations at a much more compelling scale.

Source link