Hacker News new | ask | show | jobs
by hilbert42 687 days ago
Exchanging information between different data formats is one of the biggest problems I've experienced in computing and IT and it's been thus from the earliest days.

Having so many formats is confusing, inefficient and leads to data loss. This article is right, CSV is king simply because it's essentially the lowest common denominator and I, like most of us, use it for that reason—at least that's so for data that can be stored in database type formats.

But take other data such as images, sound and AVI, and even text. There are dozens of sound, image and other formats. It's all a first-class mess.

For example, we fall back to the antiquated horrible JPG format because we can't agree on better ones such as say jpeg 2000, there being always excuses why we can't such speed, data size, inefficient algorithms etc.

Take word processing for instance, why is it so hard to convert Microsoft's confounded nasty DOC format to say the open document ODT format without errors. It's almost impossible to get the layout in one format converted accurately into another. Similarly, information is lost converting from lossless TIF to say JPG, or from WAV to MP3, etc. What's worse is that so few seem to care about such things.

Every time a conversion is done between lossless formats and lossy ones entropy increases. That's not to say that shouldn't happen it's just that in isolation one has little or no idea about the quality of the original material. Even with ever increasing speeds, more and more storage space so many still have an obsession—in fact a fetish—of compressing data into smaller and smaller sizes using lossy formats with little regard for what's actually lost.

It's not only in sound and image formats where data integrity suffers over convenience, take the case of converting data fields from one format to another. How often has one experienced the situation where a field is truncated during conversion—where say 128 characters suddenly becomes 64 or so after conversion and there's no indication from the converter that data has actually been truncated? Many times I'd suggest.

Another instance, is where fields in the original data don't exist in the converted format. For example, data is often lost from one's phone contacts when converted from an old phone to a new one because the new phone doesn't accommodate all the fields of the old one.

Programmers really have a damn hide for not only allowing this to occur but for not even warning the poor hapless user that some of his/her data has been lost.

That programmers have so little reagard and consideration for data integrity I reckon is a terrible situation and a blight on the whole IT industry.

Why doesn't computer science take these issues more seriously?

1 comments

>Why doesn't computer science take these issues more seriously?

Simple, cost. A company is not going to approve any project to move to a new standard. Plus you have new hires coming it with their favorite "Standard of the Day" and start using that standard no matter what they are told.

Management only care about the end result (ie: bottom line), now how it got there.

"Simple, cost."

That lack of consideration for users' data will ultimately lead to regulation. Much of a user's data is only machine-readable, so ordinary users shouldn't be expected to know when their data is truncated after say data conversion. They aren't responsible for realizing their data is corrupted long after the event and past the point where it can be corrected.

It's like everything else, originally there's the Wild West days when everything's a free-for-all, but regulations eventually kick in after the harm done is considered unacceptable. We've seen regulations introduced everywhere else, from foods—pure food acts, pharmaceutical—FDA, transport—NTSB, Water purity standards and so on. So eventually computing/IT will be no exception.

Unfortunately, computing/IT is still in the 'Wild West' days. Personally, I can hardly wait for those enforced regulations to become effective.