Hacker News new | ask | show | jobs
by mschuster91 1649 days ago
If you're doing user-supplied CSVs, definitely... but if you are ingesting CSVs from a known source with known format (<insert audible sigh here>) it can definitely make sense to use a high-speed optimized ingester.

One might wonder if it might be worth the time to look into optimising the runtimes of various languages. I took a look, all operate on naive byte-by-byte scanning, and all sans PHP are written in the respective language which means any form of SIMD optimization is right off the table (okay, maybe something could be done in Java, but it seems incredibly complex, see https://www.morling.dev/blog/fizzbuzz-simd-style/):

- PHP isn't optimized anywhere, but at least it's C: https://github.com/php/php-src/blob/1c0e613cf1a24cdc159861e4...

- Python's C implementation is the same: https://github.com/python/cpython/blob/main/Modules/_csv.c

- Java doesn't have a "standard" way at all (https://www.baeldung.com/java-csv-file-array), and OpenCSV seems the usual object-oriented hell (https://sourceforge.net/p/opencsv/source/ci/master/tree/src/...).

- Ruby's CSV is native Ruby: https://github.com/ruby/ruby/blob/bd65757f394255ceeb2c958e87...

3 comments

Python's csv imports _csv for core functionality, which is C: https://github.com/python/cpython/blob/main/Modules/_csv.c
Thanks! Updated accordingly.
You should update "all sans PHP" to reflect the update.
Perl's best known library Terxt::CSV has both a pure-perl and a C implementation.

Here is the C version

https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.xs

It’s funny, csv files are so common and yet many mainstream languages don’t even attempt a decent parser baked in. I think dotnet has 3-4 different ones and as I recall they’re all pretty slow.
There's multiple dialects of CSV. Besides the more standardish dialect there are some weird ones that prevent some types of optimization. I remember Apple's "Enterprise Partner Feed" had a dialect I've never seen elsewhere so far. Columns were separated by 0x01, rows were separated by 0x02 0x0A.

The row separator being two bytes throws a wrench in most parsers.

What a bizarre choice. If they're going to commit to weird ASCII control chars you'd think they could just use 0x1C to 0x1F, which are explicitly intended as delimiters/Separators... sigh. (I've always wondered why more people don't use the various Separators, but I admit human-readability is a big advantage)
> The row separator being two bytes throws a wrench in most parsers.

Huh? Anything that ingests Windows-origin files needs to be capable with \r\n by default.