Golang – encoding/csv: Reading is slow | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Golang – encoding/csv: Reading is slow (github.com)
	105 points by chinmaymk 3581 days ago

13 comments

djur 3581 days ago

It seems pretty common for languages to start out with a relatively unoptimized CSV parser (if they have one at all) and then get a faster one contributed by the community once there's enough interest. Ruby had that happen with FasterCSV.

The Java comparison here seems inapt, because it doesn't do as much as the other two. It's just a naive "split on commas" implementation that wouldn't handle quoted cells. Really, if Go's CSV reader is only 200% slower than that and 50% slower than Python's optimized C implementation, that's pretty good already.

Roboprog 3580 days ago

Indeed. Flip it around: Go-lang comes with CSV support in the standard library, whereas Java requires you to get something like the Ostermiller utilities or an Apache library.

masklinn 3581 days ago

ignore this, I somehow missed that the poster specifically mentioned Python 3, which does have an encoding-aware CSV module.

~~It's also unclear which version of Python is being used, the Python 2 csv module is byte-based and encoding-unaware which can lead to unexpected behaviours.~~

Go's CSV package apparently only does UTF-8, and suggestions for speeding it up in the tracker is to just remove that and work on raw bytes (FFS)

geofft 3581 days ago

> suggestions for speeding it up in the tracker is to just remove that and work on raw bytes (FFS)

This is valid, because UTF-8 was designed to make this valid. The UTF-8 encoding of a comma, 0x2C (also the ASCII encoding of a comma), does not appear as a part of any other UTF-8 encodings. Same with the UTF-8 encoding of the double quote, 0x22. So scanning for 0x22 and 0x2C bytes, without stopping to decode other UTF-8 sequences along the way, will produce the correct result for a valid UTF-8 input string. Then you fully decode UTF-8 for the individual fields when needed (and if you're doing a string-compare for some target value that's already UTF-8, you never need to decode UTF-8 for that field at all).

paulddraper 3581 days ago

> and if you're doing a string-compare for some target value that's already UTF-8, you never need to decode UTF-8 for that field at all

Is Go's internal representation of the target string UTF-8?

masklinn 3581 days ago

> Is Go's internal representation of the target string UTF-8?

Kinda but kinda not, a Go string is actually an arbitrary bag of bytes, but some API (such as unicode/utf8 or `range` to iterate on codepoints — runes in Go parlance) assume it's proper UTF8.

weberc2 3580 days ago

Go's implementation allows the caller to designate any UTF8 character as the delimiter, not just as ascii.

mook 3580 days ago

For string comparisons, wouldn't case sensitivity and normalization be an issue in some contexts?

snissn 3581 days ago

That's cool about utf8 - what downsides are there to not treating utf-8 as raw bytes?

geofft 3581 days ago

The big things are related to string length not matching byte count. strlen() is O(n) because you have to see how many sequences are actually in the string. More than that, splitting/slicing/indexing a string based on byte offsets doesn't work. For a 100-byte ASCII string, you're guaranteed that you can split it into two 50-byte strings and things will still work: you can output them separately, you can get the total length by adding strlen() on each half, you can find a character by doing strchr() on each half, etc. For a 100-byte valid UTF-8 string, splitting it into two 50-byte strings will possibly get you an invalid string, because a character could be split in half. So strlen() (even a UTF-8-correct strlen()) and strchr() don't compose. Outputting a string in two halves works properly as long as the receiver buffers its input, and is willing to wait to reconstruct a partial character.

A related problem is that in older UNIX terminals, pressing backspace would delete one byte, not one character. Newer UNIX kernels have code in the terminal implementation to decode UTF-8 enough to backspace an entire character.

weberc2 3580 days ago

To clarify, Letting the length of a UTF-8 string in Go is O(1); it's computed and stored on the string header at creation.

deno 3581 days ago

How do you mean? The first comment specifies it’s Python 3.

masklinn 3581 days ago

Oh dear, I'm not sure how I managed to miss that.

stesch 3581 days ago

It says "Python3 equivalent".

justincormack 3581 days ago

It also seems to support UTF8, which is perhaps common now. (CSV doesnt tell you what encoding it is, in the old days it was 95% ASCII).

wongarsu 3581 days ago

How would a CSV parser even break UTF8 encoding (by accident)? All CSV control characters (comma, doubleqote and newline) map to the same codepoints in ASCII and UTF8, and no non-ASCII UTF8 character uses any ASCII codepoint in it's encoding.

ktRolster 3581 days ago

I've seen one break because of the byte-order marker that sometimes gets added to UTF-8. I don't remember the details of why that broke it, just remember that it worked fine on everything except that.

sp332 3580 days ago

UTF-8 doesn't have a "byte order", so I thought it wouldn't have a byte-order mark. But apparently some software adds one anyway. https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

weberc2 3580 days ago

Go lets you use any code point as the delimiter.

masklinn 3581 days ago

> CSV doesnt tell you what encoding it is, in the old days it was 95% ASCII

Or some non-ascii codepage you're not told about and have to guess (e.g. Excel generates CSV in CP1250 by default, with an option to export UTF-16)

coldtea 3581 days ago

All 3 support UTF8. The difference with the Go implementation is the mostly useless capability to have a utf-8 multibyte item as the delimiter.

masklinn 3581 days ago

1. there's nothing useless about it

2. the Python 3 CSV library supports arbitrary codepoints as delimiter, quote character and escape character (if applicable)

coldtea 3580 days ago

>1. there's nothing useless about it

Opportunity cost. It slows down the parsing for support of something that nobody has ever seen in the wild (not to mention it doesn't even match the name of the format, but let's get past that since we already use ; | and others).

Plus I can't even imagine a use case that would make it a good idea to use that over a simpler delimiter, or even the special purpose ASCII delimiter character. Can you?

geofft 3580 days ago

> 1. there's nothing useless about it

Have you ever seen a "C"SV with a multibyte sequence as a delimiter? I haven't.

Even if such a thing exists, the feature is of negative utility if it slows down CSV parsing for everyone else. If you must, write two implementations, and use the slow path if your delimiter is multibyte.

drej 3580 days ago

That's precisely my suggestion https://github.com/golang/go/issues/16791#issuecomment-24209...

Moving from runes to bytes in reading gives us a nice speedup - not quite to eliminate the gap, but it's a start. The rest is likely all the memory copies - once the data is read in a buffer, then copied byte by byte into a slice and only then converted into a string, which is another copy, because strings can't be based on pre-existing byte slices (not in the public API that is).

halayli 3581 days ago

I feel you're trying to defend go without much objectivity. Such performance gap needs to be addressed properly instead of saying it's pretty good already.

It doesn't sound right if go takes 5 hours to finish csv parsing job while Python takes 2.5 hrs.

heroprotagonist 3580 days ago

Well, you are free to address it if it bothers you. The typical response to this of _I don't want to_ or _I shouldn't have to_ seems a bit naive when working with open source projects. The issue has only been on the tracker for two weeks, and it has the 'HelpWanted' tag, so it's not like they're opposed to improving the speed here.

If you're going to throw out specific numbers, you should probably get them, or at least the ratios, correct.

Numbers from the tracker are:

Go: avg 1.489 secs Python: avg 0.933 secs

If you'd like to test this on a really large dataset to come up with how long it would take for Python to perform the same operation when Go requires 5 hours, that might be a bit more useful. If we just look at the available data, then the _extrapolation_ for Python would not be 2.5 hours. There's still a gap, but there's no need to exaggerate.

zephyrfalcon 3581 days ago

Python's csv module uses an internal module _csv which is written in C. So I'm not sure it's all that surprising that a Go implementation is a bit slower.

aartur 3581 days ago

I run the benchmark using PyPy (which doesn't have this C extension) and got a result about 20% slower compared to CPython (ie. still faster than Go).

EDIT. I also did a funny thing and replaced the CPython C _csv.so extensions with pure Python version _csv.py, from PyPy. It run about 80 (eighty) times slower. It shows what wonders does JIT do (at least to some code).

jjawssd 3581 days ago

Would be a great experiment to Cythonize PyPy's _csv.so

chrisseaton 3580 days ago

PyPy doesn't have a _csv.so - that was the main point of the comment you replied to.

masklinn 3580 days ago

That sounds completely worthless, PyPy doesn't need a Cython version, and its library is a Python version of CPython's native csv module.

chrisper 3581 days ago

You should contribute this knowledge to the GitHub issue.

jlarocco 3581 days ago

A good idea, but I think it's common knowledge.

Lots of modules in the stdlib are written in C.

chrisper 3581 days ago

Well, it was news to me. Maybe because I only use Python casually.

baq 3581 days ago

ultimately it's of academic interest - the end user doesn't care what precisely is going on under the hood.

chrisper 3581 days ago

I thought it is rather part of the solution/answer to the issue. It is expected to be slower, so it's not necessarily broken. Kind of it's a feature, not a bug.

wongarsu 3581 days ago

If one language is 50% slower in file reading and a simple state mashine that iterates over an array, that's not a feature.

Either the implementation or the compiler is lacking some optimization.

weberc2 3580 days ago

The implementation is suboptimal, and Go has some overhead compared to C.

rdtsc 3580 days ago

But not sure if it matters. Go is free to use a C module to load csv files as well. It can use assembly or other tricks as well perhaps.

weberc2 3580 days ago

There's a performance cost to calling between c and Go, and sharing memory between the two makes for hard to predict GC behavior. I doubt it would be faster than a pure Go implementation.

Cyph0n 3581 days ago

I don't think the answer is that simple. Perhaps there's a key optimization that the Go team didn't consider.

0xFFC 3581 days ago

How about Java? It is quite funny Java version is much faster than Python version even when Python version does use C ? Something fishy going on.

jerf 3581 days ago

There's nothing mysterious about the Java version... that's not a CSV parser. The other two things are truly CSV parsers. (Inasmuch as there is such a thing for such an ill-defined format. (No, the RFC is not determinative.))

It's easy to be faster if you do fundamentally less. Not necessarily wrong, depending on your task, but it's not comparable.

nradov 3581 days ago

The Java code is defective. It's not checking for double quotes. The CSV format allows for commas inside column values by surrounding with double quotes, and then you can also put double quotes within such values by escaping them as double double quotes. Fix those defects and the Java code will be a little slower.

With modern JVMs, Java can occasionally actually be faster than native compiled languages due to dynamic optimization at runtime.

dalke 3581 days ago

A few minutes ago (and after your comment) one of the commenters of that issue tested against Apache Commons CSV and found that Java was 1.9x faster than Go, rather than the original 3x: https://github.com/golang/go/issues/16791#issuecomment-24456...

merb 3580 days ago

actually the java one is still amazing since it's a cold jvm. when it would be a big file I would think that java is far ahead of both. with an aggressive jit. maybe pypy is faster than all 3 :D

jsiep 3581 days ago

Besides the point that the Java example is not a good one; The JVM is actually a pretty mean piece of software with a lot of optimisation. So while Go could in theory produce faster code then Java I doubt the Go compiler is clever enough to produce faster code then the JVM in a lot scenarios (at the moment).

jakub_h 3581 days ago

But Java conceptually has a lot of drawbacks that require the JVM to have screaming performance to compensate for. Almost everything being a "headered" object being probably the worst offender. Even a slightly worse Go compiler is probably well compensated-for by denser data structure layout in the operating memory.

pjmlp 3580 days ago

Only true until value types get productified and there are already prototype versions to play with.

Also depending on which JVM SDK is being used (Oracle Hotspot, Oracle Graal, IBM J9, HP, PTG, JET,...), the quality of escape analysis differs but it all boils down to turning those headered objects into plain structs, if possible stack allocated.

eternalban 3580 days ago

> Java conceptually has a lot of drawbacks

In what sense is an articulated object a "conceptual drawback"? It is a richer object model and SMI and friends had the engineering chops to makes it highly performant.

weberc2 3580 days ago

What do you mean by "a richer object model". The drawbacks are principally that the programmer has less control over allocs and layout.

jakub_h 3576 days ago

If you want an actual rich object model, why aren't you using CLOS instead? Java's objects costs you a lot of potential performance with few of the benefits.

__s 3581 days ago

The java source is using str.split(',') which is completely different than Python's offering of dialect/delimiter/etc https://github.com/python/cpython/blob/master/Lib/csv.py#L24

masklinn 3581 days ago

The Java version is just reading lines and splitting them on commas, it's not actually a CSV parser.

justin66 3581 days ago

It's as if there's more to software quality than the choice of tools.

Anderkent 3581 days ago

The java one doesn't do as much as the python one: `line.split(',')` doesn't handle quoted commas, escape characters, different csv separators, etc.

andrepd 3581 days ago

Yeah, calling from python introduces a layer of inefficiency.

paulddraper 3581 days ago

The Java code is not a CSV parser.

I added the results of using Apache Commons CSV to the GitHub thread.

After using that, Python was actually by far the fastest. Hooray for performance sensitive code in C :)

paulddraper 3580 days ago

FYI, I later went back and ran PyPy, which implements csv in pure Python. Even including start-up time, it was nearly at CPython speed.

endymi0n 3581 days ago

On a related note, also the Go stdlib regex package is pretty naive and imperformant compared to a full blown and modern backtracking PCRE implementation (at 1/10 the LOC and complexity) - same thing goes for the reflection based JSON package (which is still kinda "fast enough").

The focus wasn't so much on performance but on initial completeness, good interface, versatility, clarity and simplicity - with faster or more specialized implementations left to the community.

There might be different opinions about that, but I personally like the approach of having a solid and ordered programming pocket knife - that also doesn't replace a Katana for cutting.

dsymonds 3580 days ago

The standard regexp package, unlike PCRE, is actually a proper regular expression parser/matcher. Anything doing backtracking is at risk of exponential blowup and isn't safe.

https://swtch.com/~rsc/regexp/regexp1.html

endymi0n 3580 days ago

which is exactly what I meant with less specialized but safe :)

jonlawlor 3580 days ago

It is also often much faster than pcre.

paulddraper 3580 days ago

On differing opinions: I prefer having a Katana, an automatic shotgun, and a spy drone, and leaving the pocket knife at home.

In other words....C/C++ :) I just try not to blow off my leg.

tmaly 3581 days ago

I wrote my own in Go that is blazing fast using bytes. I know the data is ascii so I was able to use that to my advantage.

piinbinary 3581 days ago

I did the same [0]. It runs almost as fast as the Java implementation.

I'd be interested to see how yours works if you are willing to share it.

Edit: Plus one that is ~2x faster than Java by avoiding allocations [1].

[0] https://gist.github.com/jmikkola/6ac96ad6d6f66e772c33ec41ed2...

[1] https://gist.github.com/jmikkola/7ded8392226b7659c881f5540be...

sorokod 3580 days ago

Note that internally Java represents strings as utf-16

grandinj 3580 days ago

Depends. The API for strings in java is mostly UTF16 but the latest JVM will magically use UTF8 as its internal representation.

pjmlp 3580 days ago

Only the Oracle one, it doesn't apply to other vendors.

SeanDav 3581 days ago

Effectively one is comparing library performance here and not language performance. Granted, that line can get very blurry indeed, but in this case this says very little about golang the language and far more about a current implementation of one of the golang libraries.

jonlawlor 3580 days ago

It is kind of both; go doesn't allow some approaches in native go code that can make it slower than other languages. (I love go, but that is my experience.)

In this case the choice to use utf-8 everywhere, including in the csv delimiters, is making it slower.

weberc2 3580 days ago

This isn't what makes CSV parsing slow, there's nothing about Go that requires you to deal in UTF8.

lcarlson 3580 days ago

This may be a bit off topic but I've found sqlite to be quite a powerful csv parser. Once posted you can manipulate the data in lots of ways. When you're working with reports that need to get back into some sort of table format, it's very intuitive and easy for SQL people.

WestCoastJustin 3580 days ago

Related to this, was a Reddit thread from a few days ago in /r/golang about improving a csv/reader. See: https://www.reddit.com/r/golang/comments/50ncer/implementing...

deno 3581 days ago

Better than node.js

    import * as csv from 'csv-parse';
    import * as fs from 'fs';
    
    type Line = [string,string,string,string,string,string];
    
    const parser = new csv.Parser({});
    
    parser.on('data', (line: Line) => { 
    if (line[0] === '42') {
            console.dir(line);
        } 
    });
    
    fs.createReadStream('mock_data.csv').pipe(parser);
    
    $ /usr/bin/time node parse_csv.js
    43.61user 0.85system 0:45.61elapsed 97%CPU (0avgtext+0avgdata 60076maxresident)k
    
    $ node --version
    v6.4.0

Edit: Using fast-csv

    24.28user 0.20system 0:24.58elapsed 99%CPU (0avgtext+0avgdata 91780maxresident)k

elmigranto 3580 days ago

Probably has to do with using `try/catch` whis is not omptimized by V8. Different parser is 10 times faster on my machine.

https://www.npmjs.com/package/csv-parser

Edit: `fast-csv` seems to be using a lot of `RegExp`s on each iteration which can't be that fast compared to csv-parser which seems to simply go over each symbol (state machine?).

deno 3580 days ago

    4.95user 0.19system 0:05.22elapsed 98%CPU (0avgtext+0avgdata 29704maxresident)k

A lot better but that’s still 5× slower than Python.

masklinn 3580 days ago

csv-parse is hardly the only CSV parser for node, and it is by far the slowest: https://github.com/phihag/csv-speedtest (csv2json depends on csv-parse, so it's unsurprising that it's even slower)

deno 3580 days ago

I chose the most popular one on npm because Go and Python are using stdlib.

elmigranto 3580 days ago

But you still wrote "faster than node.js" and not "faster that most popular npm module" (which aren't always of a great quality or performance-oriented).

deno 3580 days ago

Yup[1]. Sorry!

[1] https://meta.wikimedia.org/wiki/Cunningham%27s_Law

deno 3580 days ago

The fastest streaming example on that list (csv-parser) is still 5× slower than Python.

masklinn 3580 days ago

Which I'd assume has to do with the overhead of dispatching a ton of asynchronous events for relatively little parsed data rather than the intrinsic speed of node, the fastest synchronous parsers of the list are about on-par with Python.

weberc2 3580 days ago

You don't need to drop quote handling in order to stream, but handling quotes is a pain.

twotwotwo 3580 days ago

As someone notes on the bug, if you were rolling your own, there are some other things you could do--return a [][]byte that's a pointer to its internal buffer, only usable until the next row is read.

Making a version of encoding/csv that retains most of its features (custom delimiters, handling backslashes and quoting and \r) but streams like that would be a fun open source project for someone who likes Making Things Go Fast.

peterwaller 3580 days ago

I did this a couple of months ago and got a >5x speedup. It's at the expense of dropping quoting though, so no commas or newlines can be in the input data.

https://github.com/pwaller/usv

weberc2 3580 days ago

That guy here. I'm also interested in rolling a version that only supports standard delimiters so I can forego rune parsing. Rune parsing accounts for about 30% of the processing; not sure how much a bytes implementation could save, but I'm hopeful.

petters 3581 days ago

Any CSV reader should be limited only be disk access right? I wrapped together a C++ program solving this problem and got 0.124 seconds. But that does not do quotations etc.

Roboprog 3580 days ago

Good point. But assume if you run a series of tests on the same file, it's in RAM. (discard the time of the first run or two, assume data source - network, DB, file - makes access time moot)

_ph_ 3580 days ago

I would guess, if you directly translate your program to Go it would not be much slower. On low level code, Go 1.7 gets quite close to GCC. The point is, that doing the quotations right, especially if you allow non-ascii quotes, eats a lot of performance. So the benchmark is less about the languages involved, but rather the exact algorithms used and capabilities offered.

masklinn 3580 days ago

> Any CSV reader should be limited only be disk access right?

Depends on the speed of your storage subsystem. If you're working from RAM or from a fast PCIe SSD, you'll probably bottleneck in the encoding validation and actual parsing.

burntsushi 3580 days ago

On the CSV Game benchmark, the Go csv reader is around 10x slower than the fastest: https://bitbucket.org/ewanhiggs/csv-game

Roboprog 3580 days ago

So, where's the Perl (+ CPAN...) version for comparison?

I guess "Nobody cares about your dead religion" :-)

Still, it would probably be faster, at the expense of being unreadable.