| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by justin101 1036 days ago
	Where does one even go about finding 12Gb of pure latin text?

4 comments

Rebelgecko 1036 days ago

I had the same question, wondering what sort of workflow would have this task in the critical path. Maybe if the Library of Congress needs to change their default text encoding it'll save a minute or two?

The benchmark result is cool, but I'm curious how well it works with smaller outputs. When I've played around with SIMD stuff in the past, you can't necessary go off of metrics like "bytes generated per cycle", because of how much CPU freq can vary when using SIMD instructions, context switching costs, and different thermal properties (eg maybe the work per cycles is higher per SIMD, but the CPU generates heat much more quickly and downclocks itself).

link

lovasoa 1036 days ago

Not sure whether that was sarcastic, but ISO-8859-1 (Latin 1) encodes most european languages, not just latin.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

link

ko27 1036 days ago

But where do you find it? Almost the entirety of internet is UTF-8. You can always transcode to Latin 1 for testing purposes, but that raises the question of practical benefits of this algorithm.

link

tgv 1036 days ago

Older corpora are probably still in Latin-1 or some variant. That could include decades of news paper publications.

link

lovasoa 1035 days ago

All of Europe has written in Latin 1 for a decade. There are billion of files encoded in Latin 1 everywhere.

link

ko27 1035 days ago

Where?

link

the8472 1036 days ago

It's not necessarily about sustained throughput spent only in this routine. It can be small bursts of processing text segments that are then handed off to other parts of the program.

Once a program is optimized to the point where no leaf method / hot loop takes up more than a few percent of runtime and algorithmic improvements aren't available or extremely hard to implement the speed of all the basic routines (memcpy, allocations, string processing, data structures) start to matter. The constant factors elided by Big-O notation start to matter.

link

martijnvds 1036 days ago

The Vatican?

link

ant6n 1036 days ago

The latin in latin-1 refers to the alphabet, not the language. In fact latin-1 can encode many Western European languages.

link

CoastalCoder 1036 days ago

I believe it was a joke.

But the humour may have been lost in translation. It's funnier in the original ASCII.

link

mmastrac 1036 days ago

The high bit is generally used to indicate humour.

link