Hacker News new | ask | show | jobs
by tanin 692 days ago
What surprised me the most about CSVs is that:

- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.

- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.

Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.

Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

7 comments

> Many tools are getting it wrong.

They're not getting it wrong, they're just assuming a different variant.

There is no "standard" for CSV. Yes, there's an RFC, published in 2005, about 30 years after everyone was already using CSV. That's too late. You can't expect people to drop all compatibility just because someone published some document somewhere. RFC 4180 explicitly says that "it does not specify an Internet standard of any kind", although many people do take it as a "standard". But even if it did call itself a standard: it's still just some document someone published somewhere.

They should have just created a new "Comma Separated Data" (file.csd) standard or something instead of trying to retroactively redefine something that already exists. Then applications could add that as a new option, rather than "CSV, but different from what we already support". That was always going to be an uphill battle.

Never mind that RFC 4180 is just insufficient by not specifying character encodings in the file itself, as well as some other things such as delimiters. If someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.

RFC 4180 says it "documents the format that seems to be followed by most implementations" and in practice I find that to be true, though my CSVs don't interact with a lot of very old software. You get very far by treating "RFC 4180, UTF-8" as a standard and considering every implementation that doesn't follow it to be broken. I'm not sure I have ever seen software that simultaneousy doesn't follow the RFC, but does consistently support escaping.
Did TOML take off? As much as I love it, it seems really rare to see in the wild. I still see YAML everywhere and despair.
It's in the standard library for Python, Rust, Julia, and maybe some other languages. It's also widely used in those ecosystems (pyproject.toml, cargo.toml). I think it's fair to say it took off, even though YAML is also popular.
The tomllib library in Python 3.11+ can only read TOML files, not write them.
I don't believe its in the standard library for Rust, even if it is very popular in the Rust ecosystem.
Right; I'm not super-familiar with Rust and how exactly they organise things, but it's in more or less every Rust project due to Cargo.toml.
Rust uses it, and Rust seems pretty popular.

I know Alire, the Ada crate manager uses it too.

I use it for some personal projects. It's really nice!

Which is hilarious when you consider that the spec is that complex.
Toml is both great and terrible. I'm not a fan of how it handles some deeper arrays
> someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.

Why? We have xlsx for the office crowd and arrow for the HPC crowd. In no universe does anyone actually have to invent another tabular data format using delimiters.

Neither are a universal replacements for CSV. They're not even text formats (well, technically xlsx is if you expect the XML from the zip, but practically: no really.). The article already explains why, as the title says, "CSV is still king": it's widely used, it's simple, it's used all over the place, it's universal, it's human-readable-y.
I can't tell you how to run your business, but subscriptions for offline apps aren't going to be popular here.

Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...

I recently saw a license that was based on a monthly subscription, but once you paid for a year you got a perpetual license to the version you started with. Every year, your perpetual license was updated to the next year's version. I find that to be a reasonable middle ground.
I think you mean perpetual license, unless you really do mean a license that covers the clitoris or penis.
Thank you for your feedback. I think your opinion is super valid here.

I've been thinking about pricing, and a lot of people did complain about it. However, many people expense their software cost, so they don't mind the yearly subscription.

I'm improving the pricing right now and a perpetual license is what I'm going with.

> Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

Looks like SQL is the main selling point for your tool. For other simpler needs, Modern CSV [1] seems suitable (and it’s cheaper too, with a one time purchase compared to a yearly subscription fee). But Modern CSV does not support SQL or other ways to create complex queries.

[1]: https://www.moderncsv.com/

https://www.ietf.org/rfc/rfc4180.txt

Works for SQLite at least, but not sure about other software.

It would be more useful if every RFC had a test suite of input/output and input/error.

Yes, those are potentially infinite, but a core set would be useful. As ambiguities come up, publish an addendum for clarification, and eventually, as the exceptions accumulate, a version step.

I don't understand how anyone can write a spec without concrete examples of pass/fail in their head. Perhaps there could be an informal example/counterexample syntax for those writing RFCs, which could be extracted into the 1.0 test suite.

The test suite must be a single open source repo, that accumulates acceptable edge cases until the relevant informed adults can make a call about revising the spec.

There has to be one approved, sanctioned, well-known and monitored test suite repo. It cannot be shrugged off into a free-for-all that makes it impossible to find a single canonical test suite. The interwebs are big and conflicted.

See Imre Lakatos 'Proofs and Refutations' for how this evolves.

RFCs sometimes have pseudocode. It would be nice to have a "pseudocode translator" that translates it to some actual programming language.

With few exceptions, I have gven up on documentation. Whether it is specifications or software. Now I just read source code instead.

I think in the 60s and 70s documentation used to be better and did focus more on input/output. For example, I still use spitbol and icon.

Maybe it is controversial view, but I fail to comprehend how any RFC can be considered a "specification". In truth an RFC is only a "proposed specification" at best, literally a "request for comments". (Where are the comments?) In fact, often RFCs simply document some internet practice that already exists. (Meanwhile the number of "BCPs" is relatively small.) RFCs can be anything.

As is the case with Markdown, many parsers have prioritized ease of implementation over formal rigor.
I agree about markdown, but the only awkward implementation issue is nested syntax: what markup is parsed inside various other outer markup forms?

Italic headings? Bold links? Nested lists - how many levels? Code in list? How do paragraphs interact with lists? There are many opinions and many leaky implementations of those opinions. Newlines? Embedding HTML in Markdown !?!?

It all seems so sad, because (X)HTML nailed most of these issues a very long time ago. But HTML implementations were sloppy from the outset. And XML was born with inherited bloat, then got ever more complex over time (modular specs, XLink, XPath, XSLT, DTD -> XML Schema, ...)

With Markdown, it is relatively easy to introduce some recursion into the parser, but for what spec? In what contextual cases? At what cost?

One classic example is JSON.

It is possible to just treat commas as whitespace. It makes implementation so much easier. It accepts missing, trailing and repeated commas. It makes elements uniform. It ignores many common errors that arise from typos or cut'n'paste. It makes JSON writers simpler, by removing the first/last special case.

A JSON parser that treats commas as whitespace can be two dozen lines in most programming languages - if you do not want line/column, chapter and verse, for the remaining error messages.

I wish there was a text format that used the ascii unit separator and record separator. It would have solved so many problematic edge cases.
The one tools get the most wrong is that there is no escaping of the new line character.
Oh yes, but I encountered it on the parsing side. A CSV parsing algorithm that does parallel processing would have this issue.

DuckDB has this problem when the parallel processing of CSV is enabled.

Understandably though because they want to process many lines in parallel.