Blog | LF Decentralized Trust

Optimizing Besu Performance: Parallelizing State Root Computation

 In the ever-evolving blockchain landscape, optimizing execution performance is essential. State root computation is one of the primary bottlenecks in Ethereum execution clients, becoming critical only recently as scaling efforts pushed gas limits higher. Besu tackles this bottleneck by introducing parallel state root computation within its Bonsai Trie storage implementation. This innovation fully exploits multi-core architectures and enables nodes to keep pace with increasing block capacity. In this article, we explore the technical aspects and performance benefits of this parallelization strategy.

1. The Sequential Processing Bottleneck

Historically, Besu optimized state root computation by only parallelizing updates across different contract accounts. Ethereum state is maintained through two types of data structures: a single account trie that tracks all accounts, and multiple storage tries (one for each contract) that store contract slots. When a block modifies multiple contracts, Besu processes their storage tries concurrently rather than sequentially. However, the account state root computation itself remained sequential, only beginning once all contract storage tries were fully updated.

A major bottleneck persisted: both the account trie (the top-level tree containing all accounts) and individual storage tries were still processed sequentially from the root down. Even when a single trie contained thousands of independent updates, the standard approach forced them into a narrow, serial execution path, failing to leverage multi-core hardware for massive state transitions within a single block.

2. Exploiting the 16-Child Trie Geometry

The solution lies in the inherent structure of the Merkle Patricia Trie. In Besu, branch nodes are RLP-encoded lists of 16 distinct child paths, each corresponding to a hexadecimal nibble (0 through F).

Because these 16 paths are structurally isolated, updates within one subtree (e.g., prefix 0x1...) do not affect the cryptographic integrity of another (e.g., prefix 0xA...) until they merge at a common ancestor. This isolation enables a "divide and conquer" strategy: the internal workload of a single trie is decomposed into 16 independent units of work that can be processed concurrently without race conditions.


3. Distributed Batch Processing Architecture

There are three key steps to optimizing the state root computation.

  1. State Update Batching groups pending changes together before applying them, making the process more efficient when branches split.

  2. Adaptive Parallelization spreads the work across multiple threads, taking full advantage of multi-core systems.

  3. Finally, Persistence writes all modified data to disk, ensuring nothing is lost.


Let's now delve into the specifics of each of these steps:

Step 1: State Update Batching and Pre-Sorting

To improve performance and reduce lock contention, Besu reorganizes the work before processing the trie.

  1. Collects all state modifications from a block into one consolidated batch.
  2. Orders these changes by their key values (account and storage hashes).
  3. Groups the modifications based on their first hexadecimal digit (prefix).

Step 2: Adaptive Parallelization Through Load Assessment

Creating threads indiscriminately for every branch node would overwhelm the system with memory and context switching overhead. Besu employs a load-based decision heuristic to determine when parallelization provides actual benefit:

Criterion 1: Multiple groups

The first test asks: Are updates arriving for multiple different destinations?

A branch node has 16 possible child positions (nibbles 0x0 through 0xF). Besu groups incoming updates by their destination nibble.

Example with a single destination:

  • Five updates all destined for child nibble 0xC
  • All updates must traverse the same path
  • Creating tasks adds pointless overhead
  • Decision: Process sequentially

Example with multiple destinations:

  • Three updates destined for child 0x2
  • Four updates destined for child 0x5
  • Two updates destined for child 0xA
  • Three different paths can be processed independently
  • Decision: Proceed to criterion 2

Criterion 2: Sufficient Work Volume

Even if updates are distributed across multiple destinations, Besu should do a second check: Is there enough work in each group to justify creating a thread?

Thread creation and management involves overheads that can easily exceed the actual work being done. A single update processed in a new thread may be slower than handling it in the current thread.

Example with insufficient work:

  • Child 0x2 receives one update
  • Thread overhead exceeds the work cost
  • Decision: Process this update in the current thread

Example with sufficient work:

  • Child 0x5 receives 10 updates
  • Work volume justifies thread overhead
  • Decision: Create a task for this group


When both criteria are satisfied, the function dispatches subtree processing to the scheduler. This evaluation occurs recursively at each trie level, enabling adaptive parallelism based on data distribution: dense subtrees with many modifications spawn additional tasks, while sparse areas continue processing on the current thread to avoid unnecessary overhead.


Handling Path Compression: Extension Nodes and Update Conflicts

Merkle Patricia Tries use extension nodes to compress sequential path segments, optimizing storage space. However, these nodes create serialization bottlenecks in Besu because updates require decompressing the shared path, applying changes iteratively, and rebuilding the extension node. We optimized this by parallelizing the update logic: instead of processing updates serially through the compressed path, we decompress the full path, apply all updates concurrently to the underlying nodes, and then reconstruct the extension node if needed.


To preserve parallelization opportunities, Besu implements a "puffing" mechanism (dynamic expansion):

  • Divergence Detection: When concurrent updates branch at different points within an Extension node's compressed path, we identify the divergence location.
  • Temporary Expansion: The Extension node is temporarily converted into a proper branch structure, creating separate child paths for conflicting updates.
  • Concurrent Processing: Each divergent path is processed independently within a different task. Those tasks can still be executed by the same thread.
  • Structure Recompression: After processing completes, a cleanup pass traverses the modified structure and re-compresses eligible paths back into Extension nodes.

This approach ensures that storage optimization doesn't compromise parallelization.


The same approach handles leaf nodes and empty slots. A leaf in the trie can only have one update path passing through it. When multiple concurrent updates target the same leaf or empty position, they all converge at that node, creating a bottleneck.

Why can a leaf have multiple modifications? A leaf is not necessarily at maximum depth due to storage optimization. However, when multiple keys share a common path prefix, the leaf must be converted into a branch node to accommodate them.

Example:
Initially:

  • 0x (branch)
  • 0x01 (leaf for key 0x0101...0)

When adding key 0x0102...0, both keys share the prefix 0x01, so the leaf must become a branch:

  • 0x (branch)
  • 0x01 (branch)
  • 0x0101 (leaf for key 0x0101...0)
  • 0x0102 (leaf for key 0x0102...0)

Enabling parallelization: To process concurrent changes at this level, we preemptively convert leaf nodes into branches before they become bottlenecks. This transformation maintains pipeline flow by ensuring that leaf-to-branch conversions don't block concurrent updates to the newly created child nodes.

Step 3: Persistence and Safe Finalization

The final pipeline stage ensures deterministic computation of the unique Root Hash through coordinated result aggregation and optimized persistence.

  • Future-based Synchronization: Besu spawns child Tasks (Java CompletableFuture objects) for each parallel subtree operation. The parent branch hash calculation waits until all child tasks (covering nibbles 0-F) complete and return their results (Divide and Conquer algorithm).

  • Batch-Optimized Persistence: To prevent I/O bottlenecks from undermining parallel processing gains, Besu employs a commit cache mechanism. Worker threads deposit their computed node updates into this shared buffer rather than writing directly to RocksDB. After Root Hash calculation completes, the system executes a single bulk write operation to RocksDB, minimizing disk contention and maximizing write throughput.

4. Performance Impact

Based on empirical benchmarking, distributed intra-trie processing delivers measurable improvements in block processing time. On 8-core / 8-thread instances (non-NVMe storage), we observed up to 40% reduction in block processing time on Ethereum mainnet blocks, as illustrated in the results below.

A significant portion of this improvement comes from better parallelization of storage slot reads, allowing I/O-related work to be distributed more effectively across cores. As a result, the performance gains are more pronounced on nodes backed by slower storage. On NVMe-backed nodes, where storage latency is already low, the relative impact is expected to be smaller.

The gains are less pronounced on smaller VMs (e.g., 4 cores / 8 threads), indicating that the optimization scales more effectively with higher physical core counts.

Our tests indicate that hyperthreading negatively impacts performance, as threads competing on the same core introduce resource contention and reduce effective throughput. 

The structural properties of the Merkle Patricia Trie's 16-way branching architecture mean that workload distribution should scale proportionally with the number of independent sub-tree operations, though actual performance gains will depend on real-world factors such as data distribution patterns, memory bandwidth, and I/O characteristics.


Activation:
This feature is the default for Bonsai storage format from Besu 26.1.0 release onwards. If you have any questions or feedback, reach out to us on Discord.

Conclusion

The implementation of distributed intra-trie processing in Besu demonstrates significant performance improvements, particularly in environments with higher core counts and slower storage. By effectively parallelizing storage slot reads, Besu achieves up to a 40% reduction in block processing time on Ethereum mainnet blocks. While the benefits are most pronounced on systems with multiple cores, especially those with non-NVMe storage, the optimization also highlights the nuanced effects of hyperthreading and the importance of hardware characteristics. These findings underline the potential for substantial efficiency gains in Ethereum nodes, paving the way for more responsive and scalable network participation.

AI Disclosure: This post used artificial intelligence tools for research, structural assistance, or grammatical refinement. The final content was reviewed, edited, and validated by human contributors to LF Decentralized Trust to ensure accuracy and alignment with our community standards. We remain committed to transparency in the use of generative technologies within the open source ecosystem.