Hacker News new | ask | show | jobs
by justin101 1036 days ago
Where does one even go about finding 12Gb of pure latin text?
4 comments

I had the same question, wondering what sort of workflow would have this task in the critical path. Maybe if the Library of Congress needs to change their default text encoding it'll save a minute or two?

The benchmark result is cool, but I'm curious how well it works with smaller outputs. When I've played around with SIMD stuff in the past, you can't necessary go off of metrics like "bytes generated per cycle", because of how much CPU freq can vary when using SIMD instructions, context switching costs, and different thermal properties (eg maybe the work per cycles is higher per SIMD, but the CPU generates heat much more quickly and downclocks itself).

Not sure whether that was sarcastic, but ISO-8859-1 (Latin 1) encodes most european languages, not just latin.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

But where do you find it? Almost the entirety of internet is UTF-8. You can always transcode to Latin 1 for testing purposes, but that raises the question of practical benefits of this algorithm.
Older corpora are probably still in Latin-1 or some variant. That could include decades of news paper publications.
All of Europe has written in Latin 1 for a decade. There are billion of files encoded in Latin 1 everywhere.
Where?
It's not necessarily about sustained throughput spent only in this routine. It can be small bursts of processing text segments that are then handed off to other parts of the program.

Once a program is optimized to the point where no leaf method / hot loop takes up more than a few percent of runtime and algorithmic improvements aren't available or extremely hard to implement the speed of all the basic routines (memcpy, allocations, string processing, data structures) start to matter. The constant factors elided by Big-O notation start to matter.

The Vatican?
The latin in latin-1 refers to the alphabet, not the language. In fact latin-1 can encode many Western European languages.
I believe it was a joke.

But the humour may have been lost in translation. It's funnier in the original ASCII.

The high bit is generally used to indicate humour.