| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wenc 1330 days ago

> It's so complex to work with

This is the opposite of my experience.

> To read a parquet file in Python, you need Apache Arrow and Pandas.

Or DuckDB.

    import duckdb
    df = duckdb.query("select * from 'a.parquet'")

Want to look inside a Parquet file? Use Visidata.

    vd a.parquet

> I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

Do you consider Pandas a "data science" stack? To me, it's just a library like any other that makes it easy to work with tabular data. Even for CSV, there is csvreader (usually not a good idea to deal with CSV by hand). Outputting to CSV is literally a one liner in Pandas or DuckDB.

   import pandas as pd

   # output to CSV
   pd.read_parquet("a.parquet").to_csv("a.csv") 

   # output to JSON (choose from any number of orientations)
   pd.read_parquet("a.parquet").to_json(orient="table")

   # read rows
   for row in pd.read_parquet("a.parquet").itertuples():
       print(row)