Hacker News new | ask | show | jobs
Comparisons among Java-based CSV parsers (github.com)
12 points by kmoe 4276 days ago
4 comments

Taken with a grain of salt, of course, because the winner is the one running the competition.

Worth noting:

- All but the last-place finisher here are actually placed quite closely in performance, given that they're working on a 3 million record file.

- Other performance stats that could be more relevant (depending on what you're doing with CSV...): startup time, memory footprint, any differences in handling based on very long or very short rows

- Given similar performance on the above, what's actually more important (for most uses): elegance & consistency of API, support for various CSV formats (e.g., Excel vs. RFC-4180 etc. vs. flexibility for rolling your own format), and sensible error handling options (like: don't blow up if there's one row with a different number of columns).

I've hardly reviewed any of these, so I can't really ompare them usefully, but I've been using the Apache Commons CSV parser 1.0 version recently (finally released after who knows how many years in semi-hibernation!), and it's been pleasant to work with thus far.

I agree with much of this, especially api simplicity. I usually reach for openCSV for the same reason.

Definitely applaud the effort, and it would be good to extend the test corpus in terms of record length and escape complexity. I do think 3M records is on the low side. Good to see scale tests for 1OM, 100M, 1BN records too.

They're mostly operating on streams, so at some point (based mostly on how GC is managing, I imagine) the speed will be constant per-row regardless of the record count.
Jackson has a CSV parser that is not in the list: https://github.com/FasterXML/jackson-dataformat-csv
Side note: CSV is boring and unsexy as formats go; but it's also dead easy for companies to provide and even automate even with minimal technical staff on hand; this is one of the reasons I'm working with it at the moment ("CSV uploaded via SFTP" is on the list of inaterfaces we support for data integrations).

Think of the character escapes involved in something like XML or even JSON, for example; for CSV you escape only the double-quote, and you escape that by adding a second double quote -- so you don't need to mess with escaping your escape character. The main problem with CSV is more that there are several possible specs for it...

There's only one RFC though.
RFC 4180: http://tools.ietf.org/html/rfc4180

... which is compatible with the data I used to get out of Lotus-123 (and some other things) back in the 80s.

Only one RFC... but that doesn't mean you can just reject all Excel-exported CSV as "invalid" and force those people to figure out some other solution. :(
I think they missed this parser: http://ostermiller.org/utils/CSV.html