BLAKE3 slower than SHA-256 for small inputs

Summary

As of 2023-Dec, when comparing the fastest widely available BLAKE3 and SHA-256 implementations, BLAKE3 features worse throughput on x86_64 for message sizes up to ~4kB.

This is due to the fact that libraries for multi-message (batch) hashing exist for SHA-256, but not for BLAKE3. Multi-message hashing is possible on BLAKE3, but requires non-trivial scheduling of operations due to BLAKE3’s tree hash construction.

Theory

SHA-256 and BLAKE3 are hash functions used in the Solana protocol.
BLAKE3 is used for account hashes, SHA-256 for virtually everything else.

Hash functions are commonly compared in single-message benchmarks, which are single-core tests in which one message is hashed at a time. The BLAKE3 paper claims to feature better throughput than SHA-256 for all input sizes in this setting. I could reproduce this result when comparing the BLAKE3 C reference implementation against SHA-256.

Solana, however, uses both hash functions in a way that allows for a high degree of message-level SIMD and ILP parallelism, i.e. hashing multiple messages simultaneously on a single CPU core. In a multi-message benchmark, the hash implementation is given an asymptotically infinite number of messages that can be processed in parallel. We then measure the peak throughput of input bytes per second processed.

Multi-message hashing is the fastest approach for both hash functions for small message sizes (up to ~4kB).

Practice

Usually, the fastest approach is to store each hash state in one or more SIMD registers. The multi-block function executes SIMD instructions that applies each step of the computation to all in-flight hash states.

Ascending to the next layer, the multi-message engine then schedules inputs to the multi-block function. This scheduler can get more complicated than it seems, especially with variable-length messages. It involves initialization of new hash states, appending pieces of message inputs, finalizing output values, and so on.

The relatively simple Merkle–Damgård construction powering SHA-256 makes scheduling manageable. The block function is invoked exactly once per 64 bytes of input data, plus optionally one additional time during finalization.

Indeed, there exist open-source implementations of multi-message SHA-256: minio/sha256-simd in Go, and fd_sha256 in C. Both feature AVX2 and AVX512 backends, in the realm of ~10 Gbps (Zen 2, AVX2) and ~20 Gbps (Icelake Server, AVX512) per-core peak throughput respectively on recent x86_64 CPUs. Peak throughput is reached for message sizes (64*(n-1))+55.

The BLAKE3 hash function is implemented C, Rust, and Go. All three are highly optimized toward large message hashing. But currently, none support parallel hashing of small independent messages.

The following compares fd_sha256 vs BLAKE3-C throughput for multi-message hashing in AVX512 mode. The input size is (64*n)-9 for SHA-256 and 64*n for BLAKE3 (to account for SHA’s padding).

n SHA-256 (Gbps) BLAKE3 (Gbps) Delta
1 7.6 3.8 -50%
2 8.9 5.1 -43%
4 9.7 6.1 -37%
8 10.1 6.8 -33%
16 10.3 7.2 -30%
32 10.5 7.0 -33%
64 10.5 12.3 +17%
128 10.5 22.5 +114%

Plans

BLAKE3 is a core part of the Solana runtime and this is unlikely to change soon.

I will attempt to introduce multi-message hashing for BLAKE3 and will post findings in this thread. The scheduling required for multi-message mode is much more complicated in BLAKE3 due to its hash tree construction, so I don’t expect a linear speedup.

3 Likes

Python PoC of multi-message parallel scheduling: blake3-parallel-scheduler/batch.py at main · ripatel-fd/blake3-parallel-scheduler · GitHub

The work was worth it, it seems. First results show that fd_blake3 is 3.4x faster for the 1024 byte case.