Hacker News new | ask | show | jobs
by acdha 3447 days ago
You need a lot more than that to handle CSV in the wild (quoting, Unicode, line termination, etc.) but the real killer I see is when it's edited by humans. The special cases for errors and inconsistencies will add up quickly; in some cases you may be able to reject invalid data but you may not have that option or an easy way to tell whether any particular value is wrong.

Excel takes that, adds some fun things like people using color and formatting to store data, and things like Excel auto-corrupting values which look like dates and may not have been noticed before you do something with the data.

3 comments

I know of at least one company whose entire business is handling this stuff. They find growing companies as they hit critical mass and need to move their Excel data into a real database. The product is just "Your data is hideous and was entered by hand without validation or formatting; it'll never convert and it'll be wrong when it does. We can help."

They handle all kinds of theory and technical stuff, like normalization and processing Excel-corrupted dates. But they also handle a lot of easy-but-agonizing tasks like regularizing single quotes into apostrophes, which crop as soon as you let humans enter free-form data.

I used to use Google Refine (now OpenRefine [0]) for this. It lets you load up the data and then apply rules to see if they are mostly correct. It doesn't get you all the way, but it is better than going blind on manually revising a huge Excel "database".

[0] http://openrefine.org/

What do you use now?
Could you share the company name?
I'll try to remember. I ran into them at a career fair a few years ago, so it's not leaping to mind, but it seemed like they had good software and a great market niche.
Let's not forget Japan Post's CSV for all the Japanese Address data that contains some lines that are line-wrapped, that is, one record spans two or more lines in the CSV file. A line-wrapped CSV... I just can't even.
That's why ASCII was designed with record and field separators. Unfortunately, it's not used (de facto) for delimited files.
That is very interesting, thanks! I hadn't thought about Unicode or tolerating human error. Although the times I have worked with it have been when it is a transport medium between two computer programs.
That's definitely a less-aggravating situation by far. I've had a lot of cases where a significant amount of specialist human time was in a spreadsheet and it's really made me wish there was an Excel-for-data which acknowledges how many people are using it for semi-structured data like this.
Like Airtable?