| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by meehai 902 days ago

can confirm that it is a nice thing to work with parquet files. Before this, we've worked for ~1 year with CSVs (I know the horror) and we made an effort to port all the 'legacy' code to Parquet files

We interface with BigQuery (via Airflow) mostly, and except one very annoying situation it's a big improvement in terms of speed (parsing floats after querying the DB is NEVER a good option).

---

In case anyone's wondering, it's basically storing and loading native numpy arrays in BigQuery via the python client(s).

You have a bunch of options (assume you have one or more cols with float32 numpy arrays):

- dataframe -> to_parquet -> upload to GCS -> GCSToBigQueryOperator (https://airflow.apache.org/docs/apache-airflow-providers-goo...)

  -> instead of storing as a `FLOAT, REPEATED` it will be stored as a STRUCT with a structure of `list>item` OR `list>element` (pyarrow==11 OR pyarrow==13).This requires a manual parsing from this 'json structure' that you get when querying the DB back to np.array -> slow and basically you are using CSVs again.

  -> Read more: https://stackoverflow.com/questions/68303327/unnecessary-list-item-nesting-in-bigquery-schemas-from-pyarrow-upload-dataframe

  -> set the schema before uploading? Nope, all values will uploaded as null in BQ.

- dataframe -> bigquery.Client -> upload the dataframe from python

  - very slow, you need to batch your data (imagine 24h vs 5 minutes kind of slow as dataframe sizes increase + necessity to keep all data in memory or batch it so extra save/load of each batch before uploading)

  - arrays are stored properly

- solution: you must do 2 things, one on the pyarrow side and one on the BigQuery side

  - `df.to_parquet(..., use_compliant_nested_type=True)` (in pyarrow==14 it's True by default, but airflow needs pyarrow==11, where it's False by default)

  - use `enable_list_inference=True` (link: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type)

  - when both of this are true (i.e. save parquet files [to GCS] using that flag and load parquet files [from GCS to BQ] using the other flag arrays can be stored as (FLOAT, REPEATED) and queried as numpy arrays out of the box without any manual management.

This took me like 1 week of debugging and reading source code, obscure SO comments and GH issues etc.

1 comments

hermitcrab 902 days ago

>Before this, we've worked for ~1 year with CSVs (I know the horror)

The CSV format (or lack of) is such a mess. You don't appreciate how much until you have to write a CSV parser and do real world support for it. Ugh.

link