Hacker News new | ask | show | jobs
by ndsipa_pomu 1284 days ago
As much as I like and use CSV for database work, it has a problem with being poorly specified. The most common problems are when processing CSVs produced elsewhere which might not enclose text fields with quotes and thus have issues with data that includes commas and multi-line data.
6 comments

There is a spec (RFC 4180 [1]) but it's definitely not widely followed. Worse, for a lot of data there's no problems for potentially years, until your numbers get too big or the first time a quote or comma gets in the data.

In my experience one of the biggest barriers I run into -- and the primary reason I hate using CSV -- is Microsoft Excel. It misinterprets numbers as dates, it convers big numeric identifiers to exponents, and more. Even merely opening a RFC4180-compliant file and saving it changes the data, and even Excel itself will often have a different misinterpretation of the de file.

If humans never used Excel for CSV, it would be a viable format. At the same time in most cases where humans aren't in the loop (machine-to-machine communications), there's better formats. You could spec "RFC4180 CSV" and hope no developer just sees the "CSV" and assumes they understand. Or specify something like a JSON streaming format and avoid a whole lot of headache.

[1] https://www.ietf.org/rfc/rfc4180.txt

Regarding excel and rfc4180…

I avoided CSV for quite a while because I had excel-vs-CSV compatibility concerns like this.

However, when I tested this for myself a few years back, Excel output to my surprise was rfc4180 or darn near it (it might use CRLF rather than LF?) It emitted commas and quotes the same way as the rfc for all the test cases I checked.

That said, I agree with you Excel is problematic as an input source. Usually the problems are the humans who touch the data in excel, but what I’ve found is the automation problems tend to be with Excel parsing and interpreting incoming data (before it goes to CSV.) Exponents, trimming leading zeros, etc. as you say. But if the data is confirmed good in excel before being emitted, the CSV it emits is decent.

Counterexamples welcome.

Yes, I feel like this would've been more helpful generalized as "Consider DSV" (delimiter-separated values) than CSV specifically, because of the interop issues that often come up. I'd have also mentioned using Parquet.
I generally find Avro to be a better replacement for CSV than Parquet. It’s a better drop in for the typical CSV use case of “process this file row by row”.

Parquet is great, don’t get me wrong.

Avro is a row-based format that supports schema evolution, but unfortunately it is not a very popular format in general and there's much less support for it than Parquet. My team evaluated several formats (ORC, Parquet, Avro) but landed on Parquet due to columnar performance and wider support, though you're right, it's less performant for row-wise operations (but not by that much). If you really want row-wise performance, better to go with sqlite than Avro.
Parquet has the opposite problem of CSV though. It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".

I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

We data scientists are well-known for our exclusive mastery data wrangling arcana, like…

  df = pandas.read_parquet(‘foo.parquet’)
  df.to_csv(‘foo.csv’)
  df.to_json(‘foo.json’)
(no sarcasm)—how could it be simpler than that? What problems have you encountered that make it unusable?
Arrow and pandas are massive dependencies.
Not really. Depends on your use case but most of the time you’re trading off disk space for a specialized efficient library.

Pandas and Arrow are dependencies like any other. Pandas is like a DSL for working with tabular data, much like numpy is a DSL for working with arrays and numerical algebra. No one working with linear algebra will insist on using the Python standard library built ins.

If you’re distributing a smallish Python app that only needs to read and manipulate smallish amounts of data, then I agree there are easier solves like SQLite.

But if you’re doing consulting work and dealing with large tabular datasets and need to do SQL type window functions and aggregations then Parquet is a better fit and the disk space required for adding a Pandas dependency is trivial. If one is using Anaconda, Pandas is batteries included. It really depends on what is being optimized for.

> It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

FWIW, in my experience at a "data analytics platform" company, it's reasonably popular for data-heavy workflows since Parquet is well-defined, and file sizes (especially as the amount of data grows) are a fraction of their CSV equivalents.

> Is it a limitation of the format itself?

I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).

> It's so complex to work with

This is the opposite of my experience.

> To read a parquet file in Python, you need Apache Arrow and Pandas.

Or DuckDB.

    import duckdb
    df = duckdb.query("select * from 'a.parquet'")
Want to look inside a Parquet file? Use Visidata.

    vd a.parquet
> I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

Do you consider Pandas a "data science" stack? To me, it's just a library like any other that makes it easy to work with tabular data. Even for CSV, there is csvreader (usually not a good idea to deal with CSV by hand). Outputting to CSV is literally a one liner in Pandas or DuckDB.

   import pandas as pd

   # output to CSV
   pd.read_parquet("a.parquet").to_csv("a.csv") 

   # output to JSON (choose from any number of orientations)
   pd.read_parquet("a.parquet").to_json(orient="table")

   # read rows
   for row in pd.read_parquet("a.parquet").itertuples():
       print(row)
I want to use parquet more frequently, but it creates new problems that do not exist if I dump to CSV. Last I looked, there were not any good GUIs that would let someone quickly browse the data. Now it is just a blob lacking introspection. CSV has issues, but it is universal.
Not a GUI tool but try Visidata for looking inside Parquet files (and other tabular formats)

https://www.visidata.org/

A bit round-about, but the slick way I discovered is to take a detour through DuckDB. DuckDB offers parquet bindings which you can link through a kind of foreign data interface and then query through SQL. Using this, you can then just browse parquet files through DBeaver or your IDE of choice. Hardly an out of the box solution I can offer to a random collaborator, but fantastic for your savvy analyst.
That is interesting to hear. Parquet input and output is on the wishlist for our Easy Data Transform software (currently we support CSV, Excel, XML, JSON and a few others). Anyone have any experience integrating Parquet read/write into a C++ application?
but which delimiter.

if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands

if you choose tab, ok, now people will get confused when they try to edit the text file to replace tabs with spaces, and now you have trouble putting code snippets into data fields because they have tabs.

this is the problem and it's why xml/json exist.

in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.

Well the obvious solution would be ASCII 0x1D (Group Separator)! Accept, no one actually uses those ASCII characters. Kind of bums me out that UNIX basically skipped out on them.
I agree. Lots of discussion related to proposing that here: https://news.ycombinator.com/item?id=31220841
It's not a separator character, but at least vim and emacs acknowledge the page feed character. A pittance, I suppose.
> It's not a separator character,

Isn't it? I thought all the separator characters (0x1e, 0x1f, 0x1c) were specifically for delimiting records, fields and units.

What are they for?

They’re saying that “page feed character” (I’m guessing form feed) is acknowledged by Emacs, in contrast to those separator characters.

I think it’s used to mark sections in Emacs Lisp code.

Both pipe and tab are infinitely better for so-called human-readable data compared to comma. Comma doesn’t even work well for numbers since some locales use comma as the decimal separator. And a data format can’t be “human-readable” if you’re not allowed to write numbers in the way that you’re used to write them.
Pipes are quite common, but for tricky data, I'd recommend ¬. It's on most keyboards and I can't think of any other use of it.
This symbol is not present on US keyboards.
I did not know that. It's on most UK keyboards
What is it called? What's it for?
> but which delimiter

Control characters. Like ctrl-A and stuff. Almost nobody has them in their data.

In the context of an API so long as using a "real" CSV library a lot of those inconsistencies do not appear. Problems happen when you have to interface with humans and desktop software (ie Excel) which has its own rules.
Yes, CSV is superior to JSON for tabular data, but has it's own issues. One issue is that the standard is not consistently applied. Another is the approach to escaping means that it is hard to parse a CSV file with multiple threads. You have to parse the entire file before you can be sure which " characters escape other characters.

I wrote an article about tabular formats and their strengths and weaknesses here: https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...

The resulting HN discussion is here: https://news.ycombinator.com/item?id=31220841

>CSV is superior to JSON for tabular data

Can't you just do this?

  {
    "columns": ["col1", "col2", "col3"],
    "data": [
               [1,      2,      3],
               [4,      5,      6],
               [7,      8,      9]
    ]
  }
That's valid JSON but it's human-readable and human-editable rows of comma-separated data, just like CSV.
You can. But I don't see how that is superior to the equivalent CSV.
Until you try to either cram it into Excel, work with different encodings or pass it around different software platforms (even your lang vs JS), no difference.

CSV has been abused a lot to make it work on conflicting use-cases, JSON handles a lot of misshaps happened with delimiter-separated record formats, like new-lines or bring-your-own-character encoding.

> The most common problems are when processing CSVs produced elsewhere [...]

The limitations of CSV are certainly worth considering and, in the instances you mentioned, it may be not be worth using CSV. (If you are going to be using a more complex parser anyway, you may as well using a format that is better defined and where you are less likely to encounter edge cases.) That being said, there remain many cases where CSV is far more efficient and far less error prone.

It tends to be a lowest common denominator or a choice between CSV and Excel documents which are trickier to automate.
Which works until you have a quote in your data. It also prevents chunking out the file to process in parallel since you could have new lines between the quotes.

What I've found to work well is to just % encode your delimiter, the new line character, and the '%' character. Basically every language has utilities for this.

Doesn't solve the issue with accepting outside files though. You have to be pessimistic with those regardless.