| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by justincormack 3582 days ago
	It also seems to support UTF8, which is perhaps common now. (CSV doesnt tell you what encoding it is, in the old days it was 95% ASCII).

3 comments

wongarsu 3582 days ago

How would a CSV parser even break UTF8 encoding (by accident)? All CSV control characters (comma, doubleqote and newline) map to the same codepoints in ASCII and UTF8, and no non-ASCII UTF8 character uses any ASCII codepoint in it's encoding.

link

ktRolster 3582 days ago

I've seen one break because of the byte-order marker that sometimes gets added to UTF-8. I don't remember the details of why that broke it, just remember that it worked fine on everything except that.

link

sp332 3581 days ago

UTF-8 doesn't have a "byte order", so I thought it wouldn't have a byte-order mark. But apparently some software adds one anyway. https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

link

weberc2 3582 days ago

Go lets you use any code point as the delimiter.

link

masklinn 3582 days ago

> CSV doesnt tell you what encoding it is, in the old days it was 95% ASCII

Or some non-ascii codepage you're not told about and have to guess (e.g. Excel generates CSV in CP1250 by default, with an option to export UTF-16)

link

coldtea 3582 days ago

All 3 support UTF8. The difference with the Go implementation is the mostly useless capability to have a utf-8 multibyte item as the delimiter.

link

masklinn 3582 days ago

1. there's nothing useless about it

2. the Python 3 CSV library supports arbitrary codepoints as delimiter, quote character and escape character (if applicable)

link

coldtea 3582 days ago

>1. there's nothing useless about it

Opportunity cost. It slows down the parsing for support of something that nobody has ever seen in the wild (not to mention it doesn't even match the name of the format, but let's get past that since we already use ; | and others).

Plus I can't even imagine a use case that would make it a good idea to use that over a simpler delimiter, or even the special purpose ASCII delimiter character. Can you?

link

geofft 3582 days ago

> 1. there's nothing useless about it

Have you ever seen a "C"SV with a multibyte sequence as a delimiter? I haven't.

Even if such a thing exists, the feature is of negative utility if it slows down CSV parsing for everyone else. If you must, write two implementations, and use the slow path if your delimiter is multibyte.

link

drej 3582 days ago

That's precisely my suggestion https://github.com/golang/go/issues/16791#issuecomment-24209...

Moving from runes to bytes in reading gives us a nice speedup - not quite to eliminate the gap, but it's a start. The rest is likely all the memory copies - once the data is read in a buffer, then copied byte by byte into a slice and only then converted into a string, which is another copy, because strings can't be based on pre-existing byte slices (not in the public API that is).

link