*This is installment #1 of a collection where anyone can pose inquiries about Geth and I’ll strive to respond to the most popular one each week with a brief writeup. This week’s top voted question was: Would you mind explaining how the flat database structure differs from the legacy structure?*
State in Ethereum
Before delving into an optimization structure, let’s review what Ethereum refers to as state and how it is currently maintained at its various layers of abstraction.
Ethereum holds two distinct types of state: the collection of accounts; and a series of storage slots for each contract account. From a purely conceptual viewpoint, both are straightforward key/value mappings. The collection of accounts maps addresses to their nonce, balance, etc. A storage section of a single contract maps arbitrary keys – defined and utilized by the contract – to arbitrary values.
Regrettably, while storing these key-value pairs as flat data would be quite efficient, validating their accuracy becomes computationally unmanageable. Each time a change is made, we’d need to hash all that data from the beginning.
Rather than hashing the entire dataset repeatedly, we could partition it into small contiguous segments and construct a tree on top! The original relevant data would reside in the leaves, and each internal node would represent a hash of all data beneath it. This would permit us to only recalculate a logarithmic number of hashes when modifications occur. This data structure actually has a designation, it is the renowned Merkle tree.
Unfortunately, we still experience a limitation in computational complexity. The aforementioned Merkle tree configuration is very effective for managing changes to existing data, but insertions and deletions shift the segment boundaries and invalidate all calculated hashes.
Instead of indiscriminately chunking the dataset, we could utilize the keys themselves to arrange the data into a tree structure based on shared prefixes! This method ensures that an insertion or deletion wouldn’t shift all nodes, but would only alter the logarithmic path from root to leaf. This data structure is referred to as a Patricia tree.
By merging the two concepts – the tree structure of a Patricia tree and the hashing mechanism of a Merkle tree – you arrive at a Merkle Patricia tree, the actual data structure utilized to represent state in Ethereum. It guarantees logarithmic complexity for modifications, insertions, deletions, and verifications! An additional minor detail is that keys are hashed prior to insertion to balance the tries.
State storage in Ethereum
The preceding explanation clarifies why Ethereum utilizes a Merkle Patricia tree for its state. However, as rapid as the desired operations are, every option entails a trade-off. The cost of logarithmic updates and logarithmic verification corresponds to logarithmic reads and logarithmic storage for every single key. This arises because every internal trie node must be individually stored on disk.
I don’t possess an exact figure for the depth of the account trie at this moment, but approximately a year ago, we were reaching the depth of 7. This suggests that every trie operation (e.g., reading balance, writing nonce) interacts with at least 7-8 internal nodes, thereby necessitating at least 7-8 persistent database access attempts. LevelDB also organizes its data across a maximum of 7 levels, resulting in an additional multiplier from that. The overall consequence is that a single state access is predicted to escalate into 25-50 random disk accesses. Multiply this by all the state read and write operations that every transaction in a block entails and you arrive at a daunting figure.
[Of course, all client implementations strive to minimize this burden. Geth employs large memory spaces for caching trie nodes; and also uses in-memory pruning to avoid writing to disk nodes that are deleted anyway after several blocks. That topic belongs to a different blog post, however.]
Despite how alarming these figures may be, these are the costs associated with operating an Ethereum node and maintaining the ability to cryptographically validate all state at all times. But can we improve?
Not all accesses are created equal
Ethereum relies on cryptographic proofs to establish its state. There is no way to circumvent the disk amplifications if we wish to maintain our capacity to verify all the data. That being said, we can – and do – trust the data we’ve previously verified.
There is no value in validating and re-verifying each state item, every time we retrieve it from disk. The Merkle Patricia tree is crucial for writes, but it presents an overhead for reads. We cannot eliminate it, nor can we slim it down; but that doesn’t imply we must necessarily employ it everywhere.
An Ethereum node accesses state through several different channels:
- When importing a new block, EVM code execution typically involves a roughly equal number of state reads and writes. A denial-of-service block might, however, produce significantly more reads than writes.
- When a node operator retrieves state (e.g., eth_call and family), EVM code execution primarily involves reads (it can write too, but those changes are discarded at the end and not persisted).
- During synchronization, a node requests state from remote nodes that need to retrieve and service it over the network.
Considering the above access patterns, if we can bypass reads from hitting the state trie, a multitude of node operations will become considerably faster. It might evenenable some innovative access patterns (such as state iteration) which were previously prohibitively costly.
Naturally, there’s always a compromise. Unless we eliminate the trie, any newly introduced acceleration structure incurs additional overhead. The core question is if the surplus overhead brings sufficient value to justify it?
Returning to the fundamentals
We constructed this extraordinary Merkle Patricia tree to address all our challenges, and now we seek ways to bypass it for readings. Which acceleration framework should we employ to expedite reads once more? Well, if we can forgo the trie, we can eliminate all the intricacy it introduced. We can revert all the way to the beginning.
As noted at the outset of this discussion, the theoretical perfect data storage for Ethereum’s state is a straightforward key-value store (distinct for accounts and each contract). However, without the restrictions of the Merkle Patricia tree, there’s “nothing” preventing us from executing the ideal remedy!
Some time ago, Geth launched its snapshot acceleration structure (not activated by default). A snapshot provides a complete view of the Ethereum state at a specific block. In abstract implementation terms, it is a dump of all accounts and storage slots, represented by a flat key-value repository.
Whenever we want to access an account or storage slot, we only incur 1 LevelDB lookup instead of 7-8 as dictated by the trie. Updating the snapshot is also theoretically straightforward—after processing a block, we perform one additional LevelDB write for each updated slot.
The snapshot fundamentally lowers read times from O(log n) to O(1) (multiplied by LevelDB overhead) while increasing writes from O(log n) to O(1 + log n) (again multiplied by LevelDB overhead) and boosting disk storage from O(n log n) to O(n + n log n).
The devil lies in the details
Maintaining a functional snapshot of the Ethereum state comes with its intricacies. As long as blocks arrive sequentially, each one building upon the previous, the simplistic method of integrating changes into the snapshot suffices. However, if a minor reorganization occurs (even just a single block), we are at risk because there’s no option for reversal. Persistent writes are a one-way action for a flat data presentation. To complicate matters, accessing older state (e.g. 3 blocks back for some DApp or 64+ for fast/snap sync) becomes unfeasible.
To mitigate this restriction, Geth’s snapshot comprises two components: a persistent disk layer that is a complete snapshot of a previous block (e.g. HEAD-128); and a tree of in-memory difference layers that accumulate the writes on top.
Whenever a new block is processed, we do not directly merge the writes into the disk layer; instead, we simply create a new in-memory difference layer to capture the changes. If numerous in-memory difference layers stack up, the oldest ones begin merging and are ultimately pushed to disk. Whenever a state entry is to be accessed, we start from the topmost difference layer and continue backward until we locate it or reach the disk.
This data representation is highly effective since it resolves numerous issues. Since the in-memory difference layers are structured as a tree, reorgs no deeper than 128 blocks can easily select the difference layer associated with the parent block and proceed from there. DApps and remote synchronizers requiring older state have access to 128 recent states. The cost does rise by 128 map lookups, but 128 in-memory lookups are exponentially faster than 8 disk reads amplified 4x-5x by LevelDB.
Of course, many intricacies and caveats exist. Without delving too deeply, a quick enumeration of the essential points includes:
- Self-destructs (and deletions) pose unique challenges as they require a shortcut through the difference layer descent.
- Should a reorg extend beyond the persistent disk layer, the snapshot must be entirely discarded and regenerated, which is highly resource-intensive.
- Upon shutdown, in-memory difference layers must be saved into a journal and restored, or the snapshot risks becoming ineffective on restart.
- Utilize the lowest difference layer as an accumulator and flush to disk only when it surpasses a certain memory threshold. This enables deduplication of writes for identical slots across blocks.
- Implement a read cache for the disk layer so that contracts accessing the same historical slot repeatedly do not incur disk traffic.
- Employ cumulative bloom filters within the in-memory difference layers to swiftly determine if an item might exist in the diffs, or if we can head to disk directly.
- The keys are not the raw data (account address, storage key), but rather the hashes of these, ensuring that the snapshot preserves the same iteration sequence as the Merkle Patricia tree.
- Creating the persistent disk layer requires significantly more time compared to the pruning window for the state tries, hence the generator must dynamically follow the chain.
The good, the bad, the ugly
Geth’s snapshot acceleration structure reduces the complexity of state reads by approximately an order of magnitude. This leads to read-based DoS attacks becoming an order of magnitude more challenging to execute; and eth_call invocations becoming an order of magnitude quicker (barring CPU constraints).
The snapshot also facilitates incredibly rapid state iteration for the most recent blocks. This was indeed the primary motivation for developing snapshots, as it enabled the establishment of the new snap sync algorithm. Detailing this is an entirely separate blog entry, but the latest benchmarks on Rinkeby speak volumes:
Naturally, the compromises are always prevalent. After the initial synchronization is completed, it takes approximately 9-10 hours on the mainnet to generate the initial snapshot (it’s maintained live thereafter) and around 15+GB of additional disk space is required.
As for the undesirable aspects? Well, it took us over 6 months to feel sufficiently assured about the snapshot to launch it, yet even now it remains under the –snapshot flag, and there is still fine-tuning and refinement needed regarding memory consumption and recovery from crashes.
Overall, we take great pride in this enhancement. It entailed an enormous amount of effort and it was quite a leap of faith to implement everything while hoping it would come together successfully. Just as a fun fact, the initial version of snap sync (leaf sync) was developed 2.5 years ago and had been stalled ever since due to our lack of the necessary acceleration to saturate it.
Epilogue
I hope you found this inaugural post of Ask about Geth. It took me approximately twice the time to complete it than I had planned, but I considered the subject justified the extra effort. See you next week.
[PS: I intentionally didn’t link to the asking/voting site in this post as I believe it is a temporary arrangement and I do not wish to leave any broken links for future reference; nor do I want someone to acquire the name and host something harmful later on. You can find it among my Twitter updates.]