I had the same question, wondering what sort of workflow would have this task in the critical path. Maybe if the Library of Congress needs to change their default text encoding it'll save a minute or two?
The benchmark result is cool, but I'm curious how well it works with smaller outputs. When I've played around with SIMD stuff in the past, you can't necessary go off of metrics like "bytes generated per cycle", because of how much CPU freq can vary when using SIMD instructions, context switching costs, and different thermal properties (eg maybe the work per cycles is higher per SIMD, but the CPU generates heat much more quickly and downclocks itself).
But where do you find it? Almost the entirety of internet is UTF-8. You can always transcode to Latin 1 for testing purposes, but that raises the question of practical benefits of this algorithm.
It's not necessarily about sustained throughput spent only in this routine. It can be small bursts of processing text segments that are then handed off to other parts of the program.
Once a program is optimized to the point where no leaf method / hot loop takes up more than a few percent of runtime and algorithmic improvements aren't available or extremely hard to implement the speed of all the basic routines (memcpy, allocations, string processing, data structures) start to matter. The constant factors elided by Big-O notation start to matter.
The benchmark result is cool, but I'm curious how well it works with smaller outputs. When I've played around with SIMD stuff in the past, you can't necessary go off of metrics like "bytes generated per cycle", because of how much CPU freq can vary when using SIMD instructions, context switching costs, and different thermal properties (eg maybe the work per cycles is higher per SIMD, but the CPU generates heat much more quickly and downclocks itself).