Hacker News new | ask | show | jobs
by CharlesW 1290 days ago
Yes, I feel like this would've been more helpful generalized as "Consider DSV" (delimiter-separated values) than CSV specifically, because of the interop issues that often come up. I'd have also mentioned using Parquet.
3 comments

I generally find Avro to be a better replacement for CSV than Parquet. It’s a better drop in for the typical CSV use case of “process this file row by row”.

Parquet is great, don’t get me wrong.

Avro is a row-based format that supports schema evolution, but unfortunately it is not a very popular format in general and there's much less support for it than Parquet. My team evaluated several formats (ORC, Parquet, Avro) but landed on Parquet due to columnar performance and wider support, though you're right, it's less performant for row-wise operations (but not by that much). If you really want row-wise performance, better to go with sqlite than Avro.
Parquet has the opposite problem of CSV though. It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".

I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

We data scientists are well-known for our exclusive mastery data wrangling arcana, like…

  df = pandas.read_parquet(‘foo.parquet’)
  df.to_csv(‘foo.csv’)
  df.to_json(‘foo.json’)
(no sarcasm)—how could it be simpler than that? What problems have you encountered that make it unusable?
Arrow and pandas are massive dependencies.
Not really. Depends on your use case but most of the time you’re trading off disk space for a specialized efficient library.

Pandas and Arrow are dependencies like any other. Pandas is like a DSL for working with tabular data, much like numpy is a DSL for working with arrays and numerical algebra. No one working with linear algebra will insist on using the Python standard library built ins.

If you’re distributing a smallish Python app that only needs to read and manipulate smallish amounts of data, then I agree there are easier solves like SQLite.

But if you’re doing consulting work and dealing with large tabular datasets and need to do SQL type window functions and aggregations then Parquet is a better fit and the disk space required for adding a Pandas dependency is trivial. If one is using Anaconda, Pandas is batteries included. It really depends on what is being optimized for.

> It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

FWIW, in my experience at a "data analytics platform" company, it's reasonably popular for data-heavy workflows since Parquet is well-defined, and file sizes (especially as the amount of data grows) are a fraction of their CSV equivalents.

> Is it a limitation of the format itself?

I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).

> It's so complex to work with

This is the opposite of my experience.

> To read a parquet file in Python, you need Apache Arrow and Pandas.

Or DuckDB.

    import duckdb
    df = duckdb.query("select * from 'a.parquet'")
Want to look inside a Parquet file? Use Visidata.

    vd a.parquet
> I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

Do you consider Pandas a "data science" stack? To me, it's just a library like any other that makes it easy to work with tabular data. Even for CSV, there is csvreader (usually not a good idea to deal with CSV by hand). Outputting to CSV is literally a one liner in Pandas or DuckDB.

   import pandas as pd

   # output to CSV
   pd.read_parquet("a.parquet").to_csv("a.csv") 

   # output to JSON (choose from any number of orientations)
   pd.read_parquet("a.parquet").to_json(orient="table")

   # read rows
   for row in pd.read_parquet("a.parquet").itertuples():
       print(row)
I want to use parquet more frequently, but it creates new problems that do not exist if I dump to CSV. Last I looked, there were not any good GUIs that would let someone quickly browse the data. Now it is just a blob lacking introspection. CSV has issues, but it is universal.
Not a GUI tool but try Visidata for looking inside Parquet files (and other tabular formats)

https://www.visidata.org/

A bit round-about, but the slick way I discovered is to take a detour through DuckDB. DuckDB offers parquet bindings which you can link through a kind of foreign data interface and then query through SQL. Using this, you can then just browse parquet files through DBeaver or your IDE of choice. Hardly an out of the box solution I can offer to a random collaborator, but fantastic for your savvy analyst.
That is interesting to hear. Parquet input and output is on the wishlist for our Easy Data Transform software (currently we support CSV, Excel, XML, JSON and a few others). Anyone have any experience integrating Parquet read/write into a C++ application?
but which delimiter.

if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands

if you choose tab, ok, now people will get confused when they try to edit the text file to replace tabs with spaces, and now you have trouble putting code snippets into data fields because they have tabs.

this is the problem and it's why xml/json exist.

in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.

Well the obvious solution would be ASCII 0x1D (Group Separator)! Accept, no one actually uses those ASCII characters. Kind of bums me out that UNIX basically skipped out on them.
I agree. Lots of discussion related to proposing that here: https://news.ycombinator.com/item?id=31220841
It's not a separator character, but at least vim and emacs acknowledge the page feed character. A pittance, I suppose.
> It's not a separator character,

Isn't it? I thought all the separator characters (0x1e, 0x1f, 0x1c) were specifically for delimiting records, fields and units.

What are they for?

They’re saying that “page feed character” (I’m guessing form feed) is acknowledged by Emacs, in contrast to those separator characters.

I think it’s used to mark sections in Emacs Lisp code.

Both pipe and tab are infinitely better for so-called human-readable data compared to comma. Comma doesn’t even work well for numbers since some locales use comma as the decimal separator. And a data format can’t be “human-readable” if you’re not allowed to write numbers in the way that you’re used to write them.
Pipes are quite common, but for tricky data, I'd recommend ¬. It's on most keyboards and I can't think of any other use of it.
This symbol is not present on US keyboards.
I did not know that. It's on most UK keyboards
What is it called? What's it for?
I had to look it up as I don't know what it's called and it's a negation symbol https://en.wikipedia.org/wiki/List_of_logic_symbols

As to what it's for, I'd say it's great as a delimiter. I've never used it for its intended purpose.

> but which delimiter

Control characters. Like ctrl-A and stuff. Almost nobody has them in their data.