| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saalweachter 3817 days ago

So I've got an HTML file and a PDF file, one was generated off the other. In the generated file, when I got to highlight some portions, I discover that it has been converted into a column format and it pastes wrong. It is easy to imagine that PDF being converted into a third format, and that conversion happening slightly wrong, from the aforementioned weirdness.

So imagine you have generations of digital files, converted from format to format to format. Why? Because software engineers are assholes who keep inventing new formats, for stupid goddamn reasons. A hundred generations from now, you'll be looking at copies of copies of copies which aren't just checksumable bit-wise copies but transcriptions, with transcription errors. If our descendants are lucky, all of those different versions will be preserved as well, but that just means that if you notice a possible transcription error, you'll have the opportunity to dig into two hundred year old character and file encodings to try to figure out what the original text was.

Which is not that different of a situation from trying to guess whether a scribe three copies ago misread the scribe four copies ago's atrocious f for a t.

1 comments

cooper12 3816 days ago

Yeah converting from one format to another that isn't completely compatible might be an issue in the future. Even how to preserve websites isn't exactly intuitive because as you saw the conversion to PDF was faulty, and doing "file" > "save as" would not yield the same HTML because the browser modifies the DOM. We have to start using formats specifically designed for archiving such as WARC for webpages: http://www.digitalpreservation.gov/formats/fdd/fdd000236.sht.... (relevant and interesting site in general, run by the Library of Congress)

link