Hacker News new | ask | show | jobs
by djur 3581 days ago
It seems pretty common for languages to start out with a relatively unoptimized CSV parser (if they have one at all) and then get a faster one contributed by the community once there's enough interest. Ruby had that happen with FasterCSV.

The Java comparison here seems inapt, because it doesn't do as much as the other two. It's just a naive "split on commas" implementation that wouldn't handle quoted cells. Really, if Go's CSV reader is only 200% slower than that and 50% slower than Python's optimized C implementation, that's pretty good already.

4 comments

Indeed. Flip it around: Go-lang comes with CSV support in the standard library, whereas Java requires you to get something like the Ostermiller utilities or an Apache library.
ignore this, I somehow missed that the poster specifically mentioned Python 3, which does have an encoding-aware CSV module.

~~It's also unclear which version of Python is being used, the Python 2 csv module is byte-based and encoding-unaware which can lead to unexpected behaviours.~~

Go's CSV package apparently only does UTF-8, and suggestions for speeding it up in the tracker is to just remove that and work on raw bytes (FFS)

> suggestions for speeding it up in the tracker is to just remove that and work on raw bytes (FFS)

This is valid, because UTF-8 was designed to make this valid. The UTF-8 encoding of a comma, 0x2C (also the ASCII encoding of a comma), does not appear as a part of any other UTF-8 encodings. Same with the UTF-8 encoding of the double quote, 0x22. So scanning for 0x22 and 0x2C bytes, without stopping to decode other UTF-8 sequences along the way, will produce the correct result for a valid UTF-8 input string. Then you fully decode UTF-8 for the individual fields when needed (and if you're doing a string-compare for some target value that's already UTF-8, you never need to decode UTF-8 for that field at all).

> and if you're doing a string-compare for some target value that's already UTF-8, you never need to decode UTF-8 for that field at all

Is Go's internal representation of the target string UTF-8?

> Is Go's internal representation of the target string UTF-8?

Kinda but kinda not, a Go string is actually an arbitrary bag of bytes, but some API (such as unicode/utf8 or `range` to iterate on codepoints — runes in Go parlance) assume it's proper UTF8.

Go's implementation allows the caller to designate any UTF8 character as the delimiter, not just as ascii.
For string comparisons, wouldn't case sensitivity and normalization be an issue in some contexts?
That's cool about utf8 - what downsides are there to not treating utf-8 as raw bytes?
The big things are related to string length not matching byte count. strlen() is O(n) because you have to see how many sequences are actually in the string. More than that, splitting/slicing/indexing a string based on byte offsets doesn't work. For a 100-byte ASCII string, you're guaranteed that you can split it into two 50-byte strings and things will still work: you can output them separately, you can get the total length by adding strlen() on each half, you can find a character by doing strchr() on each half, etc. For a 100-byte valid UTF-8 string, splitting it into two 50-byte strings will possibly get you an invalid string, because a character could be split in half. So strlen() (even a UTF-8-correct strlen()) and strchr() don't compose. Outputting a string in two halves works properly as long as the receiver buffers its input, and is willing to wait to reconstruct a partial character.

A related problem is that in older UNIX terminals, pressing backspace would delete one byte, not one character. Newer UNIX kernels have code in the terminal implementation to decode UTF-8 enough to backspace an entire character.

To clarify, Letting the length of a UTF-8 string in Go is O(1); it's computed and stored on the string header at creation.
To clarify even more: that length is the number of bytes (or UTF-8 code units) in the string. It doesn't corresponding to the number of characters (which one may either consider to be Unicode codepoints, or more technically correct, Unicode grapheme clusters).

If you want to count the number of codepoints in a string (called "rune" in Go), then you need to do so explicitly: https://golang.org/pkg/unicode/utf8/#RuneCountInString

How do you mean? The first comment specifies it’s Python 3.
Oh dear, I'm not sure how I managed to miss that.
It says "Python3 equivalent".
It also seems to support UTF8, which is perhaps common now. (CSV doesnt tell you what encoding it is, in the old days it was 95% ASCII).
How would a CSV parser even break UTF8 encoding (by accident)? All CSV control characters (comma, doubleqote and newline) map to the same codepoints in ASCII and UTF8, and no non-ASCII UTF8 character uses any ASCII codepoint in it's encoding.
I've seen one break because of the byte-order marker that sometimes gets added to UTF-8. I don't remember the details of why that broke it, just remember that it worked fine on everything except that.
UTF-8 doesn't have a "byte order", so I thought it wouldn't have a byte-order mark. But apparently some software adds one anyway. https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark
Go lets you use any code point as the delimiter.
> CSV doesnt tell you what encoding it is, in the old days it was 95% ASCII

Or some non-ascii codepage you're not told about and have to guess (e.g. Excel generates CSV in CP1250 by default, with an option to export UTF-16)

All 3 support UTF8. The difference with the Go implementation is the mostly useless capability to have a utf-8 multibyte item as the delimiter.
1. there's nothing useless about it

2. the Python 3 CSV library supports arbitrary codepoints as delimiter, quote character and escape character (if applicable)

>1. there's nothing useless about it

Opportunity cost. It slows down the parsing for support of something that nobody has ever seen in the wild (not to mention it doesn't even match the name of the format, but let's get past that since we already use ; | and others).

Plus I can't even imagine a use case that would make it a good idea to use that over a simpler delimiter, or even the special purpose ASCII delimiter character. Can you?

> 1. there's nothing useless about it

Have you ever seen a "C"SV with a multibyte sequence as a delimiter? I haven't.

Even if such a thing exists, the feature is of negative utility if it slows down CSV parsing for everyone else. If you must, write two implementations, and use the slow path if your delimiter is multibyte.

That's precisely my suggestion https://github.com/golang/go/issues/16791#issuecomment-24209...

Moving from runes to bytes in reading gives us a nice speedup - not quite to eliminate the gap, but it's a start. The rest is likely all the memory copies - once the data is read in a buffer, then copied byte by byte into a slice and only then converted into a string, which is another copy, because strings can't be based on pre-existing byte slices (not in the public API that is).

I feel you're trying to defend go without much objectivity. Such performance gap needs to be addressed properly instead of saying it's pretty good already.

It doesn't sound right if go takes 5 hours to finish csv parsing job while Python takes 2.5 hrs.

Well, you are free to address it if it bothers you. The typical response to this of _I don't want to_ or _I shouldn't have to_ seems a bit naive when working with open source projects. The issue has only been on the tracker for two weeks, and it has the 'HelpWanted' tag, so it's not like they're opposed to improving the speed here.

If you're going to throw out specific numbers, you should probably get them, or at least the ratios, correct.

Numbers from the tracker are:

Go: avg 1.489 secs Python: avg 0.933 secs

If you'd like to test this on a really large dataset to come up with how long it would take for Python to perform the same operation when Go requires 5 hours, that might be a bit more useful. If we just look at the available data, then the _extrapolation_ for Python would not be 2.5 hours. There's still a gap, but there's no need to exaggerate.