| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mschuster91 1649 days ago

If you're doing user-supplied CSVs, definitely... but if you are ingesting CSVs from a known source with known format (<insert audible sigh here>) it can definitely make sense to use a high-speed optimized ingester.

One might wonder if it might be worth the time to look into optimising the runtimes of various languages. I took a look, all operate on naive byte-by-byte scanning, and all sans PHP are written in the respective language which means any form of SIMD optimization is right off the table (okay, maybe something could be done in Java, but it seems incredibly complex, see https://www.morling.dev/blog/fizzbuzz-simd-style/):

- PHP isn't optimized anywhere, but at least it's C: https://github.com/php/php-src/blob/1c0e613cf1a24cdc159861e4...

- Python's C implementation is the same: https://github.com/python/cpython/blob/main/Modules/_csv.c

- Java doesn't have a "standard" way at all (https://www.baeldung.com/java-csv-file-array), and OpenCSV seems the usual object-oriented hell (https://sourceforge.net/p/opencsv/source/ci/master/tree/src/...).

- Ruby's CSV is native Ruby: https://github.com/ruby/ruby/blob/bd65757f394255ceeb2c958e87...

3 comments

__s 1649 days ago

Python's csv imports _csv for core functionality, which is C: https://github.com/python/cpython/blob/main/Modules/_csv.c

link

mschuster91 1649 days ago

Thanks! Updated accordingly.

link

jwandborg 1648 days ago

You should update "all sans PHP" to reflect the update.

link

clscott 1649 days ago

Perl's best known library Terxt::CSV has both a pure-perl and a C implementation.

Here is the C version

https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.xs

link

nickpeterson 1649 days ago

It’s funny, csv files are so common and yet many mainstream languages don’t even attempt a decent parser baked in. I think dotnet has 3-4 different ones and as I recall they’re all pretty slow.

link

jwandborg 1648 days ago

There's multiple dialects of CSV. Besides the more standardish dialect there are some weird ones that prevent some types of optimization. I remember Apple's "Enterprise Partner Feed" had a dialect I've never seen elsewhere so far. Columns were separated by 0x01, rows were separated by 0x02 0x0A.

The row separator being two bytes throws a wrench in most parsers.

link

ipdashc 1648 days ago

What a bizarre choice. If they're going to commit to weird ASCII control chars you'd think they could just use 0x1C to 0x1F, which are explicitly intended as delimiters/Separators... sigh. (I've always wondered why more people don't use the various Separators, but I admit human-readability is a big advantage)

link

mschuster91 1648 days ago

> The row separator being two bytes throws a wrench in most parsers.

Huh? Anything that ingests Windows-origin files needs to be capable with \r\n by default.

link