Hacker News new | ask | show | jobs
by scrollaway 1284 days ago
Parquet has the opposite problem of CSV though. It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".

I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

4 comments

We data scientists are well-known for our exclusive mastery data wrangling arcana, like…

  df = pandas.read_parquet(‘foo.parquet’)
  df.to_csv(‘foo.csv’)
  df.to_json(‘foo.json’)
(no sarcasm)—how could it be simpler than that? What problems have you encountered that make it unusable?
Arrow and pandas are massive dependencies.
Not really. Depends on your use case but most of the time you’re trading off disk space for a specialized efficient library.

Pandas and Arrow are dependencies like any other. Pandas is like a DSL for working with tabular data, much like numpy is a DSL for working with arrays and numerical algebra. No one working with linear algebra will insist on using the Python standard library built ins.

If you’re distributing a smallish Python app that only needs to read and manipulate smallish amounts of data, then I agree there are easier solves like SQLite.

But if you’re doing consulting work and dealing with large tabular datasets and need to do SQL type window functions and aggregations then Parquet is a better fit and the disk space required for adding a Pandas dependency is trivial. If one is using Anaconda, Pandas is batteries included. It really depends on what is being optimized for.

> It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

FWIW, in my experience at a "data analytics platform" company, it's reasonably popular for data-heavy workflows since Parquet is well-defined, and file sizes (especially as the amount of data grows) are a fraction of their CSV equivalents.

> Is it a limitation of the format itself?

I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).

> It's so complex to work with

This is the opposite of my experience.

> To read a parquet file in Python, you need Apache Arrow and Pandas.

Or DuckDB.

    import duckdb
    df = duckdb.query("select * from 'a.parquet'")
Want to look inside a Parquet file? Use Visidata.

    vd a.parquet
> I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

Do you consider Pandas a "data science" stack? To me, it's just a library like any other that makes it easy to work with tabular data. Even for CSV, there is csvreader (usually not a good idea to deal with CSV by hand). Outputting to CSV is literally a one liner in Pandas or DuckDB.

   import pandas as pd

   # output to CSV
   pd.read_parquet("a.parquet").to_csv("a.csv") 

   # output to JSON (choose from any number of orientations)
   pd.read_parquet("a.parquet").to_json(orient="table")

   # read rows
   for row in pd.read_parquet("a.parquet").itertuples():
       print(row)
I want to use parquet more frequently, but it creates new problems that do not exist if I dump to CSV. Last I looked, there were not any good GUIs that would let someone quickly browse the data. Now it is just a blob lacking introspection. CSV has issues, but it is universal.
Not a GUI tool but try Visidata for looking inside Parquet files (and other tabular formats)

https://www.visidata.org/

A bit round-about, but the slick way I discovered is to take a detour through DuckDB. DuckDB offers parquet bindings which you can link through a kind of foreign data interface and then query through SQL. Using this, you can then just browse parquet files through DBeaver or your IDE of choice. Hardly an out of the box solution I can offer to a random collaborator, but fantastic for your savvy analyst.
That is interesting to hear. Parquet input and output is on the wishlist for our Easy Data Transform software (currently we support CSV, Excel, XML, JSON and a few others). Anyone have any experience integrating Parquet read/write into a C++ application?