| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by samwilliams 3447 days ago

Do you mean Excel to CSV? Or a CSV importer?

I totally agree about Excel importing, but CSV is trivial, no? Here is an Erlang version I happened to write yesterday:

  lists:map(
    fun(Row) -> string:tokens(Row, [SepChar]) end,
    string:tokens(InputStr, "\n")
  ).

EDIT: I know this version won't support escaped separator/newline characters, but I made it for a specific use case in which I knew that would not occur. Adding that functionality would make it a little messier, but still not too bad.

EDIT2: Thanks for the interesting comments! Not so trivial after all!

Perhaps a more accurate version of what I was attempting to say above is that 'it is often (not always) easy to build a CSV parser to interact with one specific program'. The four line version above works perfectly for reading the type of files I designed it for. If you want to work with human created, or more complex variants of CSV, all bets are off.

5 comments

acdha 3447 days ago

You need a lot more than that to handle CSV in the wild (quoting, Unicode, line termination, etc.) but the real killer I see is when it's edited by humans. The special cases for errors and inconsistencies will add up quickly; in some cases you may be able to reject invalid data but you may not have that option or an easy way to tell whether any particular value is wrong.

Excel takes that, adds some fun things like people using color and formatting to store data, and things like Excel auto-corrupting values which look like dates and may not have been noticed before you do something with the data.

Bartweiss 3447 days ago

I know of at least one company whose entire business is handling this stuff. They find growing companies as they hit critical mass and need to move their Excel data into a real database. The product is just "Your data is hideous and was entered by hand without validation or formatting; it'll never convert and it'll be wrong when it does. We can help."

They handle all kinds of theory and technical stuff, like normalization and processing Excel-corrupted dates. But they also handle a lot of easy-but-agonizing tasks like regularizing single quotes into apostrophes, which crop as soon as you let humans enter free-form data.

coredog64 3447 days ago

I used to use Google Refine (now OpenRefine [0]) for this. It lets you load up the data and then apply rules to see if they are mostly correct. It doesn't get you all the way, but it is better than going blind on manually revising a huge Excel "database".

[0] http://openrefine.org/

rabidrat 3447 days ago

What do you use now?

btown 3447 days ago

Could you share the company name?

Bartweiss 3447 days ago

I'll try to remember. I ran into them at a career fair a few years ago, so it's not leaping to mind, but it seemed like they had good software and a great market niche.

nicostouch 3447 days ago

Let's not forget Japan Post's CSV for all the Japanese Address data that contains some lines that are line-wrapped, that is, one record spans two or more lines in the CSV file. A line-wrapped CSV... I just can't even.

johnnyhillbilly 3447 days ago

That's why ASCII was designed with record and field separators. Unfortunately, it's not used (de facto) for delimited files.

samwilliams 3447 days ago

That is very interesting, thanks! I hadn't thought about Unicode or tolerating human error. Although the times I have worked with it have been when it is a transport medium between two computer programs.

acdha 3447 days ago

That's definitely a less-aggravating situation by far. I've had a lot of cases where a significant amount of specialist human time was in a spreadsheet and it's really made me wish there was an Excel-for-data which acknowledges how many people are using it for semi-structured data like this.

nicostouch 3447 days ago

Like Airtable?

byroot 3447 days ago

I'm not talking about parsing. It's a mess in it's own right of course (encoding, line terminators, etc as others mentioned).

I'm talking about the actual conversion from tabular data to relational. Most of the applications I've worked on had this in one form or another.

So you end up with users downloading an export of their data in CSV, editing it in Excel in various ways, and then reimporting it in the application.

Every company I worked for, this kind of feature was always in the top 3 in term of support load.

johnnyhillbilly 3447 days ago

"Relational" means "tabular". (A "relation" in relational theory is a table with a name, fields with names and types, and the data in the table.)

A "relationship" in an ER diagram maps to a "reference" in relational theory. This is part of the type safety/domain system of RDBMSs.

If these concepts are muddled, SQL will never quite make sense :)

dragonwriter 3447 days ago

> "Relational" means "tabular".

Relational database can be expressed in tabular form, but tabular data is not necessarily relational.

> (A "relation" in relational theory is a table with a name, fields with names and types, and the data in the table.)

A relation is a system of one or more functions (in the mathematical sense) each of which has a domain that is a candidate key of the relation and a range that is the composite of the non-key attributes.

johnnyhillbilly 3446 days ago

Interesting definition. Do you have a source for it. It seems ambiguous.

From the Wikipedia article on relational databases, subsection relational model. "This model organizes data into one or more tables (or "relations") of columns and rows, with a unique key identifying each row. Rows are also called records or tuples."

samwilliams 3447 days ago

Ah right, yes, I can imagine that would be extremely messy!

mattmanser 3447 days ago

Your edit is delightful.

"No, honest guys, I knew CSV was more complicated. I just didn't need to make my code safe."

Here's a csv parser in Erlang that actually attempts all that trivial stuff:

https://github.com/rcouch/ecsv/blob/master/src/ecsv_parser.e...

That's a lot more code than yours. And the notes even say it's not tolerant of badly formed CSVs.

samwilliams 3447 days ago

I submitted that edit before the post had any replies.

Also, I did try to make clear that the given code was created 'for a specific use case in which I knew' that the format of the input files was tightly defined.

dragonwriter 3447 days ago

> but CSV is trivial, no?

You can define a narrow subset or version of CSV that is trivial, but that doesn't reflect what one finds in the wild as "CSV", which was not systematically defined or described until well after many mutually incompatible things by that name.wdre well established.

pizza234 3447 days ago

CSV is trivial? You may have missed this:

    http://tburette.github.io/blog/2014/05/25/so-you-want-to-write-your-own-CSV-code/

:-)

samwilliams 3447 days ago

This is interesting, and as the other commenters have pointed out, creating a parser for /all/ variations of CSV can be very tricky.

always_good 3447 days ago

But your code doesn't even handle the trivial case.

    "She said, \"Hello, world!\""

You can drop the "meh, I know I didn't handle all the complicated cases" act.

We all recognize the classic developer I-could-build-that-in-a-weekend hubris when we see it. :)

samwilliams 3447 days ago

Hi,

Thanks for your thoughts. As I have stated elsewhere, the code handles all of the cases I needed it to handle, due to the stability of the input file format (which was emitted from another program). I don't see that this should be too hard to believe.

I also said in my second edit, on the top line, 'Not so trivial after all!'. If I was putting on some kind of act, wouldn't that have been dropping it? Further, I noted in my first edit, before I had received any replies, that I 'know this version won't support escaped separator/newline characters', so I am not sure what you were trying to add with your example?

I think that my central point (and I totally accept that I didn't express this well) is that depending on the specifications of your program, the required CSV parser /can be/ very short. When one compares this to other data exchange formats, for example JSON, it is clear that the barrier to /entry/ is much lower. The shortest JSON parser I could find with a cursory look was 200 lines of C.

I totally appreciate that to write a CSV parser that works for all cases would be extremely longwinded. It has been interesting to hear other people's experiences and opinions about that. But the fact remains true that /in some cases/, depending on the requirements of the program, the parser can be very short.

> We all recognize the classic developer I-could-build-that-in-a-weekend hubris when we see it. :)

It is funny you should say this. I needed the CSV parser because I thought it would be fun and interesting to see if I could build an anti-malware tool in a week (I am taking a malware detection class at the moment, I wanted it done before the next lecture). I did not expect I would be able to have anything good working in that time, but by the early hours of the next morning I had a perfectly functional anti-malware tool. It can use ClamAV signatures (so it can detect everything(?) that ClamAV can), runs in parallel, has a nice text console with DSL, and is fast enough (processing 210k small files in ~5 minutes, checking against ~60k sigs). It is about 650 lines of Erlang (including comments). I am saying this not to boast(!), but to make the point that I greatly underestimated how productive I could be, beat my expectations by many fold, then people comment about my hubris online the next day. It is funny how life goes!

Thanks,

Sam