I've been using the .NET port of Disruptor to good success. Once you understand the underlying pattern, you can apply it everywhere without pulling in a dependency.
You would be astonished at what will actually fit on 1 x86 thread in 2023. Instruction-level parallelism can give you unbelievable throughput, assuming your batches are reasonably-sized and everything fits neatly into the various caches.