If anyone just cares for speed instead of compression I’d recommend lz4 [1]. I only recently started using it. Its speed is almost comparable to memcpy.
Yes, zstd has forced bitwise match coding, whereas lz4 is byte-aligned and with inline literals.
So lz4 has some base advantages in terms of speed, which zstd is unlikely to match. But as you point out it is only relevant for very high speed operations.
It's still quite a bit faster than zstd -1, at least according to their GitHub page. The trade-off is it's a worse compression ratio (2.8 vs. 2.1), but in some cases that's a good trade-off.
Depending on your data, experimentally, some people find zstd --fast can beat LZ4 for them on compression, some people find the opposite; my usual advice to people considering one or the other is to experiment and find out.
(An interesting anecdote about the differing notions of compressibility - when I recently wrote something to do a clever dance to avoid burning a great deal of CPU on incompressible data with higher zstd levels, I ended up using LZ4 -> zstd-1 as a two-tier filter to catch incompressible data, because what they each thought was incompressible was different enough that only using LZ4 lost a significant amount of compression sometimes, but only using zstd-1 was comparatively expensive and also lost a significant fraction.)
note that for decompression speeds, they quote "We know LZ4 is significantly faster than ZSTD on standalone benchmarks: likely bottleneck is ROOT IO API"