Hacker News new | ask | show | jobs
by whistle650 711 days ago
https://chatgpt.com/share/13f553a8-5cff-42a1-be95-4a9d33cd10...

May also be easy to correct a lot of it:

“For better safekeeping, Russia’s $24,000,000 collection of crown jewels, probably the finest array of gems ever assembled at one time,”

3 comments

But are you correcting the OICR or miscorrecting the originals?

I want original text, including misspellings, and original regional / historical spellings, including slang (which may look like another word, but is not, and isn't in a dictionary).

You cannot fix OCR text wirhout lioking at the original.

With the spelling having been fixed, even if imperfectly, you could much more easily search for content and find relevant results, and then go on to look at the originals. What you want is still possible, unless you unreasonably make it a requirement that the transcriptions should be perfect.
Proper transcription to digital is to do so with accuracy, not "close enough".
to quote myself, "every interesting data set will have inaccuracies in it"
There is a vast difference between a rare, honest mistake, and an attenpt to mitigate them...

vs willingly knowing you are introducing corrections that are ridiculously wrong.

Advocating and being a champion for inaccuracy, really isn't a positive. You should find a new thing to quote about yourself.

This is not what this phrase is about. I came to it working on the structural data of just under 100k Chinese characters. I'd spend hours, days and weeks proofreading and correcting formulas, so your "advocating and being a champion for inaccuracy" doesn't stick. But absent an automated, complete coverage of all records against a known error-free data set, there will likely be a small percentage of errors and dubious cases.

And thanks by the way for the readiness to jump to conclusions and fire a salve of allegations, viz. "willingly", "knowingly", "introducing", "ridiculous"

"$2¢4,000,000" should be "$204,000,000" rather than ChatGPT's "$24,000,000".
Are you aware of any models that perform as well as an LLM on this task at lower cost?
Self hosted LLM?