Hacker News new | ask | show | jobs
by otherme123 2140 days ago
In the case menctioned in the article, LibreOffice is on par with Excel. Don't believe it? Open up LibreOffice, type "MAR-1" or "SEP-2" in any cell, and look how easy is your data corrupted.

I have meet some other glitches: identifiers in the form of "1E123" get turned into scientific numbers. The column was something like "1A123", "1B123", etc. Those things are sneaky: you can have thousands of rows, and Excel/LO doesn't mind if only the 0.1% matches its rules for smartness. They just change without notice, leaving the others intact.

I'm a bioinformatic, and this kind of stuff is a daily issue. IMO, two problems colide:

1. People with very rudimentary knowledge of computers. I've meet some people way smarter than me, with thousands of papers written, that cannot open a CSV with R with a gun to their head. They can cut you, put three robotic arms in your heart, remove a tumour, sew you and send you home in three days. But R is just too much.

2. People that needs to collect and analyze data, and Excel is the easiest tool they know, but at the same time it's no lame toy: it sorts, it sums, it filters, it makes stats, it graphs... You cannot ask this people to use SQL, specially if it involves foreign keys. They just use Excel for data collection, storage and analysis.

Excel (and LibreOffice) are to blame. This shit could be avoided if they had a "scientific mode" to not be smart with your data, and be the default mode for any file in CSV/TSV mode. Unless you explicitly turn the file into a XLS/ODT or ask "turn this range of cells into dates for me" or "this cells are scientific numbers", they should not change a single dot.

2 comments

> This shit could be avoided if they had a "scientific mode" to not be smart with your data, and be the default mode for any file in CSV/TSV mode.

There's a whole lot of business use of CSV/TSV with Excel that benefits from the existing defaults. Heck, I've seen federal government websites distributing code lists that are actually in CSV format with .XLS extensions and that are expected to use the default “smart” conversion.

Even if Excel should arguably have had different defaults, the impact on other uses of changing it now would be enormous.

Then we are doomed to the lowest denominator. The bad actor here is the one that expects some software to magically change your data, and he should expect an error or at least a warning or asking for confirmation.

To good actor is people who carefully care and curate their data, just to be corrupted silently by the program.

Lets say then, keep the default but add at least some sort of "respect my data as it is, because this is a CSV with CSV extension" mode.

> Lets say then, keep the default but add at least some sort of "respect my data as it is, because this is a CSV with CSV extension" mode.

CSV doesn't have really data types (or even a hard and fast standard), so that doesn't really work. OTOH, Excel lets you specify data handling by column when importing from untyped text-based formats like CSV, which you should probably do if you care.

Or you should write up a standard for a CSV-like format with explicit semantics around data types (it could even be CSV-compatible so that you get no worse results than normal CSV handling if using software that is aware of CSV but not the new format) and start using it where those defined semantics are useful, and try to get Microsoft to add specific support to Excel for it.

> CSV-like format with explicit semantics around data types

You could encode the column names in row 1 and column types in row 2. JavaScript type names would be convenient for most CRUD work, but SQL type names might be more appropriate for more complicated data.

> if they had a "scientific mode" to not be smart with your data

That "scientific mode" is called text. Assign the column type Text during CSV import and nothing is converted to numeric or date.