| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jtheory 4276 days ago

Taken with a grain of salt, of course, because the winner is the one running the competition.

Worth noting:

- All but the last-place finisher here are actually placed quite closely in performance, given that they're working on a 3 million record file.

- Other performance stats that could be more relevant (depending on what you're doing with CSV...): startup time, memory footprint, any differences in handling based on very long or very short rows

- Given similar performance on the above, what's actually more important (for most uses): elegance & consistency of API, support for various CSV formats (e.g., Excel vs. RFC-4180 etc. vs. flexibility for rolling your own format), and sensible error handling options (like: don't blow up if there's one row with a different number of columns).

I've hardly reviewed any of these, so I can't really ompare them usefully, but I've been using the Apache Commons CSV parser 1.0 version recently (finally released after who knows how many years in semi-hibernation!), and it's been pleasant to work with thus far.

1 comments

farmfood 4276 days ago

I agree with much of this, especially api simplicity. I usually reach for openCSV for the same reason.

Definitely applaud the effort, and it would be good to extend the test corpus in terms of record length and escape complexity. I do think 3M records is on the low side. Good to see scale tests for 1OM, 100M, 1BN records too.

link

jtheory 4276 days ago

They're mostly operating on streams, so at some point (based mostly on how GC is managing, I imagine) the speed will be constant per-row regardless of the record count.

link