Hacker News new | ask | show | jobs
by justincormack 3582 days ago
It also seems to support UTF8, which is perhaps common now. (CSV doesnt tell you what encoding it is, in the old days it was 95% ASCII).
3 comments

How would a CSV parser even break UTF8 encoding (by accident)? All CSV control characters (comma, doubleqote and newline) map to the same codepoints in ASCII and UTF8, and no non-ASCII UTF8 character uses any ASCII codepoint in it's encoding.
I've seen one break because of the byte-order marker that sometimes gets added to UTF-8. I don't remember the details of why that broke it, just remember that it worked fine on everything except that.
UTF-8 doesn't have a "byte order", so I thought it wouldn't have a byte-order mark. But apparently some software adds one anyway. https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark
Go lets you use any code point as the delimiter.
> CSV doesnt tell you what encoding it is, in the old days it was 95% ASCII

Or some non-ascii codepage you're not told about and have to guess (e.g. Excel generates CSV in CP1250 by default, with an option to export UTF-16)

All 3 support UTF8. The difference with the Go implementation is the mostly useless capability to have a utf-8 multibyte item as the delimiter.
1. there's nothing useless about it

2. the Python 3 CSV library supports arbitrary codepoints as delimiter, quote character and escape character (if applicable)

>1. there's nothing useless about it

Opportunity cost. It slows down the parsing for support of something that nobody has ever seen in the wild (not to mention it doesn't even match the name of the format, but let's get past that since we already use ; | and others).

Plus I can't even imagine a use case that would make it a good idea to use that over a simpler delimiter, or even the special purpose ASCII delimiter character. Can you?

> 1. there's nothing useless about it

Have you ever seen a "C"SV with a multibyte sequence as a delimiter? I haven't.

Even if such a thing exists, the feature is of negative utility if it slows down CSV parsing for everyone else. If you must, write two implementations, and use the slow path if your delimiter is multibyte.

That's precisely my suggestion https://github.com/golang/go/issues/16791#issuecomment-24209...

Moving from runes to bytes in reading gives us a nice speedup - not quite to eliminate the gap, but it's a start. The rest is likely all the memory copies - once the data is read in a buffer, then copied byte by byte into a slice and only then converted into a string, which is another copy, because strings can't be based on pre-existing byte slices (not in the public API that is).