| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zephyrfalcon 3581 days ago
	Python's csv module uses an internal module _csv which is written in C. So I'm not sure it's all that surprising that a Go implementation is a bit slower.

5 comments

aartur 3581 days ago

I run the benchmark using PyPy (which doesn't have this C extension) and got a result about 20% slower compared to CPython (ie. still faster than Go).

EDIT. I also did a funny thing and replaced the CPython C _csv.so extensions with pure Python version _csv.py, from PyPy. It run about 80 (eighty) times slower. It shows what wonders does JIT do (at least to some code).

jjawssd 3581 days ago

Would be a great experiment to Cythonize PyPy's _csv.so

chrisseaton 3580 days ago

PyPy doesn't have a _csv.so - that was the main point of the comment you replied to.

masklinn 3581 days ago

That sounds completely worthless, PyPy doesn't need a Cython version, and its library is a Python version of CPython's native csv module.

chrisper 3581 days ago

You should contribute this knowledge to the GitHub issue.

jlarocco 3581 days ago

A good idea, but I think it's common knowledge.

Lots of modules in the stdlib are written in C.

chrisper 3581 days ago

Well, it was news to me. Maybe because I only use Python casually.

baq 3581 days ago

ultimately it's of academic interest - the end user doesn't care what precisely is going on under the hood.

chrisper 3581 days ago

I thought it is rather part of the solution/answer to the issue. It is expected to be slower, so it's not necessarily broken. Kind of it's a feature, not a bug.

wongarsu 3581 days ago

If one language is 50% slower in file reading and a simple state mashine that iterates over an array, that's not a feature.

Either the implementation or the compiler is lacking some optimization.

weberc2 3580 days ago

The implementation is suboptimal, and Go has some overhead compared to C.

rdtsc 3580 days ago

But not sure if it matters. Go is free to use a C module to load csv files as well. It can use assembly or other tricks as well perhaps.

weberc2 3580 days ago

There's a performance cost to calling between c and Go, and sharing memory between the two makes for hard to predict GC behavior. I doubt it would be faster than a pure Go implementation.

Cyph0n 3581 days ago

I don't think the answer is that simple. Perhaps there's a key optimization that the Go team didn't consider.

0xFFC 3581 days ago

How about Java? It is quite funny Java version is much faster than Python version even when Python version does use C ? Something fishy going on.

jerf 3581 days ago

There's nothing mysterious about the Java version... that's not a CSV parser. The other two things are truly CSV parsers. (Inasmuch as there is such a thing for such an ill-defined format. (No, the RFC is not determinative.))

It's easy to be faster if you do fundamentally less. Not necessarily wrong, depending on your task, but it's not comparable.

nradov 3581 days ago

The Java code is defective. It's not checking for double quotes. The CSV format allows for commas inside column values by surrounding with double quotes, and then you can also put double quotes within such values by escaping them as double double quotes. Fix those defects and the Java code will be a little slower.

With modern JVMs, Java can occasionally actually be faster than native compiled languages due to dynamic optimization at runtime.

dalke 3581 days ago

A few minutes ago (and after your comment) one of the commenters of that issue tested against Apache Commons CSV and found that Java was 1.9x faster than Go, rather than the original 3x: https://github.com/golang/go/issues/16791#issuecomment-24456...

merb 3581 days ago

actually the java one is still amazing since it's a cold jvm. when it would be a big file I would think that java is far ahead of both. with an aggressive jit. maybe pypy is faster than all 3 :D

jsiep 3581 days ago

Besides the point that the Java example is not a good one; The JVM is actually a pretty mean piece of software with a lot of optimisation. So while Go could in theory produce faster code then Java I doubt the Go compiler is clever enough to produce faster code then the JVM in a lot scenarios (at the moment).

jakub_h 3581 days ago

But Java conceptually has a lot of drawbacks that require the JVM to have screaming performance to compensate for. Almost everything being a "headered" object being probably the worst offender. Even a slightly worse Go compiler is probably well compensated-for by denser data structure layout in the operating memory.

pjmlp 3580 days ago

Only true until value types get productified and there are already prototype versions to play with.

Also depending on which JVM SDK is being used (Oracle Hotspot, Oracle Graal, IBM J9, HP, PTG, JET,...), the quality of escape analysis differs but it all boils down to turning those headered objects into plain structs, if possible stack allocated.

eternalban 3580 days ago

> Java conceptually has a lot of drawbacks

In what sense is an articulated object a "conceptual drawback"? It is a richer object model and SMI and friends had the engineering chops to makes it highly performant.

weberc2 3580 days ago

What do you mean by "a richer object model". The drawbacks are principally that the programmer has less control over allocs and layout.

jakub_h 3576 days ago

If you want an actual rich object model, why aren't you using CLOS instead? Java's objects costs you a lot of potential performance with few of the benefits.

__s 3581 days ago

The java source is using str.split(',') which is completely different than Python's offering of dialect/delimiter/etc https://github.com/python/cpython/blob/master/Lib/csv.py#L24

masklinn 3581 days ago

The Java version is just reading lines and splitting them on commas, it's not actually a CSV parser.

justin66 3581 days ago

It's as if there's more to software quality than the choice of tools.

Anderkent 3581 days ago

The java one doesn't do as much as the python one: `line.split(',')` doesn't handle quoted commas, escape characters, different csv separators, etc.

andrepd 3581 days ago

Yeah, calling from python introduces a layer of inefficiency.