Hacker News new | ask | show | jobs
by lolpanda 802 days ago
I reviewed the code in this project and it looks pretty reasonable. I thought CSV was a loosely specified format. In my past experience, I never had a smooth experience moving data from one system to another using CSV. I had a lot of trouble with Snowflake -> CSV -> Clickhouse. I now use JSONL for pretty much everything.
4 comments

There are the specs and the real world. The specs are more often than not on the opposite side of the moon, you never see that in real life. Oh, so many hours of my life were wasted on that. Real-world CSVs are as loosely specified as any free text in a notepad.
Yes, loosely specified in practice. I made a CSV parser that tries to do something reasonable for many variants by default. When that's not enough, you can specify options. https://www.neilvandyke.org/racket/csv-reading/
This is the problem… CSV isn’t specified at sufficient detail, it is just too loose in the real world. So the question “can you make a small parser” isn’t a real issue. And then, the problem with such a small parser is — which edge cases are you missing/ignoring?

I just don’t see the flex in having a small csv parser.

There's an RFC that specifies a standard format for CSV. If you're smart you'd use it ^W^W… well, you'd probably not use CSV to start with.

The problem is that often, what you have to ingest is more properly described as "malformed CSV / bytes that loosely resembled CSV in some manner that I have no choice but to either try to shove into a parser, or write some custom junk for this hot garbage because it comes form a source that I cannot control".

A lot of parsers are fairly configurable precisely to account for the situation of "the other end is sending me ill-defined jank" and to be flexible enough that maybe, just maybe, it'll mostly work. But it's hardly "engineering" at that point.

It doesn’t specify a standard format.

It describes a common format.

https://datatracker.ietf.org/doc/html/rfc4180