Geth v1.13 arrives swiftly on the heels of the 1.12 release series, which is interesting, given that its main feature has been under construction for a remarkable 6 years now. đ€Ż
This article will delve into numerous technical and historical aspects, but if you prefer just the summary, Geth v1.13.0 introduces a novel database architecture for storing the Ethereum state, which is not only quicker than the former method but also incorporates effective pruning. No more clutter building up on disk and no more guerrilla (offline) pruning!
- ÂčExcluding approximately 589GB of historical data, consistent across all configurations.
- ÂČFull sync using hash scheme surpassed our 1.8TB SSD at block approximately 15.43M.
- ÂłVariance in size compared to snap sync attributable to compaction overhead.
Before proceeding, a recognition must be given to Gary Rong, who has dedicated nearly 2 years to the core of this revamp! Extraordinary effort and remarkable perseverance to implement this substantial task!
Intricate technical details
So, whatâs the story with this new data model and what necessitated its creation?
In brief, our previous method of storing the Ethereum state did not permit efficient pruning. We employed various hacks and strategies to slow the accumulation of junk in the database, yet we still continued to gather it indefinitely. Users had the option to halt their node and prune it offline, or resynchronize the state to eliminate the junk. However, this was far from an ideal approach.
To execute and deliver genuine pruning; one that does not leave any debris, we had to break numerous constraints within Geth’s codebase. In terms of effort, we’d liken it to the Merge, albeit confined to Geth’s internal level:
- Storing state trie nodes using hashes introduces implicit deduplication (i.e., if two branches of the trie contain identical content (more likely for contract storages), they are stored once). This implicit deduplication means we can never ascertain how many parentâs (i.e., different trie paths, distinct contracts) point to some node; therefore, we cannot determine what is safe or unsafe to remove from disk.
- Any form of deduplication across various paths in the trie had to be removed before pruning could be applied. Our new data model assigns state trie nodes keyed by their path, rather than their hash. This minor alteration signifies that if previously two branches shared the identical hash and were stored once; now they will possess different paths leading to them, so even though their content is the same, they will each be stored separately, twice.
- Storing multiple state tries within the database introduces a distinct form of deduplication. For our prior data model, where we keyed trie nodes by hash, the vast majority of trie nodes remain constant across successive blocks. This leads to the same problem, where we have no understanding of how many blocks reference the same state, hindering a pruner from effectively operating. Adjusting the data model to be path-based makes storing multiple tries entirely unfeasible: the same path-key (e.g., an empty path for the root node) will need to store different data for each block.
- The second constraint we had to break was the ability to store an arbitrary number of states on disk. The only viable method for effective pruning, as well as the sole way to represent trie nodes keyed by path, was to limit the database to contain precisely 1 state trie at any given moment. This trie initially is the genesis state, which must then follow the chain state as the head progresses.
- The most straightforward solution for maintaining 1 state trie on disk is to do so for the head block. Regrettably, this is overly simplistic and results in two complications. Modifying the trie on disk block-by-block involves a significant number of writes. While it may not be overly apparent while synchronizing, importing numerous blocks (e.g., full sync or catchup) can become unwieldy. The second complication is that prior to finality, the chain head may waver a bit due to minor reorgs. While not frequent, since they can occur, Geth needs to handle them adeptly. Having the persistent state tied to the head complicates the transition to a different side-chain.
- The remedy is akin to the functionality of Geth’s snapshots. The persistent state does not follow the chain head; instead, it lags behind by a number of blocks. Geth will consistently maintain the trie modifications made in the last 128 blocks in memory. If multiple competing branches exist, all will be tracked in memory in a tree-like structure. As the chain progresses, the oldest (HEAD-128) diff layer gets flattened. This allows Geth to perform exceptionally quick reorgs within the top 128 blocks, making side-chain switches effectively seamless.
- However, the diff layers do not resolve the need for the persistent state to advance with each block (it would simply be delayed). To prevent disk writes block-by-block, Geth also utilizes a dirty cache between the persistent state and the diff layers, which collects writes. The benefit is that since consecutive blocks frequently modify the same storage slots and the top of the trie is continuously overwritten; the dirty buffer short circuits these writes, preventing them from reaching disk. Nonetheless, when the buffer reaches capacity, everything is flushed to disk.
- With the implementation of diff layers, Geth can perform 128 block-deep reorgs instantaneously. Occasionally, however, a deeper reorg may be required. It could be that the beacon chain is not finalizing; or perhaps there was a consensus error in Geth requiring an upgrade to âundoâ a more significant portion of the chain. Previously, Geth could simply revert to an old state it possessed on disk and reprocess blocks on top. With the new model of having only ever 1 state on disk, there is nothing to revert to.
- Our approach for this situation is the introduction of a concept known as reverse diffs. Each time a new block is imported, a diff is generated which can be utilized to revert the post-state of the block back to its pre-state. The last 90K of these reverse diffs are stored on disk. Whenever a highly deep reorg is needed, Geth can take the persistent state on disk and begin applying diffs until the state is reverted back to a much earlier version. At that point, it can switch to a different side-chain and process blocks atop that.
What is summarized above outlines the modifications we made to Geth’s internals to establish our new pruner. As you can observe, numerous invariants have evolved, to such an extent that Geth now functions in a completely different manner compared to the previous Geth version. Transitioning from one model to the other is indeed impossible.to the other.
We certainly acknowledge that we cannot simply “stop working” because Geth has introduced a new data structure, hence Geth v1.13.0 has two operational modes (letâs discuss the burden of maintaining open-source software). Geth will continue to support the previous data structure (and it will remain the default for the time being), so your node won’t exhibit any “strange” behavior merely due to your Geth update. You can even compel Geth to remain with the old operational mode for the long term through –state.scheme=hash.
If you intend to transition to our new operational mode, however, you’ll have to resync the state (you may retain the ancient data if it’s of any value). This can be done manually or via geth removedb (when prompted, delete the state database, but retain the ancient database). Then, initiate Geth with –state.scheme=path. Currently, the path model is not the default option, but if a previous database exists, and no state scheme is explicitly specified on the command line interface, Geth will utilize whatever is present in the database. Our recommendation is to consistently indicate –state.scheme=path to ensure safety. If no severe issues are revealed in our path scheme implementation, Geth v1.14.x is likely to adopt it as the default format.
A few reminders to consider:
- If you are operating private Geth networks using geth init, you need to define –state.scheme during the initialization step too; otherwise, you will end up with an outdated style database.
- For operators of archive nodes, the new data structure will be compatible with archive nodes (and will deliver the same impressive database sizes as Erigon or Reth), though it requires additional development before it can be activated.
Furthermore, a cautionary note: Geth’s new path-based storage is deemed stable and ready for production, but has obviously not undergone extensive testing outside of the core team. Everyone is invited to utilize it, but if significant risks are present in the event of your node crashing or falling out of consensus, it may be prudent to wait and observe if others with a lower risk profile encounter any challenges.
Now letâs address some unexpected effects…
Semi-instant shutdowns
Head state missing, repairing chain… đ±
…the startup log message we’ve all been dreading, knowing our node will be down for hours… is disappearing!!! But before we bid farewell to it, let’s swiftly review what it was, why it occurred, and why it is becoming obsolete.
Before Geth v1.13.0, the Merkle Patricia trie of the Ethereum state was stored on disk as a hash-to-node mapping. This meant that each node in the trie was hashed, and the value of the node (whether a leaf or an internal node) was inserted into a key-value store, keyed by the computed hash. This was both mathematically elegant and had a clever optimization that if different parts of the state shared the same subtrie, those would be deduplicated on disk. Clever… yet disastrous.
When Ethereum was launched, only archive mode was available. Every state trie of every block was preserved on disk. Simple and graceful. Naturally, it soon became evident that the storage demands of keeping all historical state saved indefinitely were not feasible. Fast sync proved helpful. By periodically resyncing, one could establish a node with only the latest state stored and subsequently add only new tries. Still, the rate of growth necessitated more frequent resyncs than what is acceptable in production environments.
What we required was a method to prune historical state that was no longer pertinent for the operation of a full node. There were several proposals, even 3-5 implementations in Geth, but each incurred such extensive overhead that we rejected them.
Geth ultimately adopted a very intricate in-memory pruner with reference counting. Rather than writing new states to disk immediately, we held them in memory. As blocks advanced, we accumulated new trie nodes and deleted older ones that werenât referred to by the last 128 blocks. When this memory space became full, we gradually wrote the oldest, still-referenced nodes to disk. While far from ideal, this solution resulted in a significant improvement: disk growth was drastically reduced, and the more memory allocated, the superior the pruning performance.
However, the in-memory pruner had a limitation: it only ever persisted very old, still active nodes; retaining anything relatively recent in RAM. When a user wanted to shut down Geth, the recent triesâall kept in memoryâhad to be written to disk. But due to the data configuration of the state (hash-to-node mapping), inserting hundreds of thousands of trie nodes into the database consumed a lot of time (random insertion order due to hash keying). If Geth was terminated quickly by the user or a monitoring service (systemd, docker, etc.), the state kept in memory would be lost.
On the subsequent start-up, Geth would recognize that the state associated with the most recent block had never been saved. The only solution was to begin rewinding the chain until a block was located with the entire state available. Since the pruner only ever released nodes to disk, this rewind would typically reverse everything until the last successful shutdown. Geth occasionally flushed an entire dirty trie to disk to mitigate this rewind, but that still entailed lengthy processing time after a crash.
We dug ourselves a very deep hole:
- The pruner required as much memory as possible to be efficient. However, the more memory available, the higher the likelihood of a timeout on shutdown, resulting in data loss and chain rewind. Allocating less memory resulted in more unnecessary data being written to disk.
- State was stored on disk keyed by hash, which implicitly deduplicated trie nodes. Yet, deduplication rendered it impossible to prune from disk, as it was prohibitively costly to ensure no references to a node remained across all tries.
- Reduplicating trie nodes could be accomplished by employing a different database structure. However, altering the database structure would have rendered fast sync inoperable, as the protocol was specifically designed to be accommodated by this data model.
- Fast sync could be substituted by another synchronization algorithm that does not depend on the hash mapping. However, abandoning fast sync in favor of a different algorithm necessitated all clients to implement it first; otherwise, the network would fragment.
- A new synchronization algorithm based on state snapshots, rather than tries, is very effective, but it requires someone to manage and provide the snapshots. Essentially, it is a second consensus-critical version of the state.
It took us a considerable amount of time to escape the above dilemma (yes, these werethe outlined procedures throughout):
- 2018: The initial designs for Snap sync are created, alongside the essential supporting data frameworks.
- 2019: Geth commences the generation and upkeep of the snapshot acceleration structures.
- 2020: Geth tests snap sync and establishes the definitive protocol specifications.
- 2021: Geth releases snap sync and transitions to it from fast sync.
- 2022: Other clients adopt the implementation of snap sync.
- 2023: Geth shifts from hash to path keying.
- Geth becomes unable to support the previous fast sync.
- Geth replicates persisted trie nodes to enable disk pruning.
- Geth abandons in-memory pruning in favor of proper persistent disk pruning.
A request for other clients at this moment is to implement serving snap sync, not merely consuming it. Presently, Geth stands as the sole network participant that upholds the snapshot acceleration structure utilized by all other clients for synchronization.
Where do we arrive after this extensive diversion? With the fundamental data representation of Geth upgraded from hash-keys to path-keys, we can finally say farewell to our cherished in-memory pruner, substituting it with a sleek new on-disk pruner that consistently keeps the state on disk updated/recent. Naturally, our new pruner also incorporates an in-memory element for improved optimization, but primarily functions on disk, ensuring its efficacy is 100%, irrespective of the memory available for its operation.
Thanks to the new disk data architecture and redeveloped pruning process, the memory-resident data is compact enough to be written to disk within seconds during shutdown. However, even in cases of a system crash or user/process-manager abrupt termination, Geth will only need to backtrack and reprocess a few hundred blocks to recover its previous state.
Bid adieu to lengthy startup durations; Geth v1.13.0 unveils a bold new frontier (with –state.scheme=path, keep that in mind).
Eliminate the –cache flag
No, we havenât removed the –cache flag, but chances are, you likely should!
Geth’s –cache flag has a somewhat unclear past, evolving from a basic (and ineffective) option to a rather intricate entity, whose behavior is quite challenging to convey and accurately assess.
During the Frontier era, Geth had limited parameters to adjust for enhancing speed. The sole optimization available was a memory allocation for LevelDB to retain more frequently accessed data in RAM. Interestingly, allocating RAM to LevelDB as opposed to allowing the OS to cache disk pages in RAM is fairly similar. The one occasion where explicitly designating memory to the database proves beneficial is when multiple OS processes are exchanging substantial amounts of data, thereby competing for each other’s OS caches.
At that time, permitting users to allocate memory for the database appeared to be a beneficial shot-in-the-dark gesture aimed at slightly enhancing performance. It subsequently emerged as an effective method of causing harm, as Go’s garbage collector has a strong aversion to large blocks of idle memory: the GC activates when it accumulates considerable waste, as it retained useful data following the last execution (i.e., it effectively doubles RAM requirements). Thus commenced the saga of Killed and OOM failures…
Fast-forward five years, and the –cache flag has, for better or worse, progressed:
- Whether on mainnet or testnet, the default for the –cache is set to 4GB or 512MB.
- 50% of the total cache is assigned to the database as a simple disk cache.
- 25% of the cache is designated for in-memory pruning, with 0% allocated for archive nodes.
- 10% of the cache is allocated for snapshot caching, while 20% is for archive nodes.
- 15% of the cache is assigned for trie node caching, with 30% for archive nodes.
The aggregate size and each percentage can be specifically adjusted via flags, but let’s face it, very few comprehend how to accomplish that or what the consequences will be. The majority of users increased the –cache settings since it resulted in less waste accumulating over time (that 25% allocation), but this also introduced potential OOM complications.
In the past two years, we have focused on various modifications to mitigate the complexity:
- Geth’s default database has transitioned to Pebble, which utilizes caching layers outside the Go runtime.
- Geth’s snapshot and trie node cache have begun employing fastcache, which also allocates resources outside the Go runtime.
- The new path schema prunes state dynamically, hence the prior pruning allocation was reassigned to the trie cache.
The resultant impact of all these amendments is that utilizing Geth’s new path database model should yield 100% of the cache being allocated outside of Go’s GC arena. Consequently, users raising or lowering this figure should not experience any negative repercussions on GC operations or how much memory the remainder of Geth consumes.
That being said, the –cache flag no longer has any influence on pruning or database size, so users who previously adjusted it for these reasons may eliminate the flag. Users who simply set it higher because they had available RAM should also contemplate removing the flag and observing how Geth responds without it. The OS will still utilize any available memory for disk caching, thus leaving the flag unset (i.e., at a lower value) may lead to a more resilient system.
Conclusion
As with all our earlier releases, you can find the: