Hacker News new | ask | show | jobs
by aynyc 121 days ago
I read that. But afaik, feather format is stable now. Hence my confusion. I use parquet at work a lot, where we store a lot of time series financial data. We like it. Creating the Parquet data is a pain since it's not append-able.
2 comments

Generally Parquet files are combined in an LSM style, compacting smaller files into larger ones. Parquet isn't really meant for the "journal" of level-0 append-one-record style storage, it's meant for the levels that follow.
So feather for journaling and parquet for long term processing?
I still don't understand what happened to using Apache Avro [1] for row-oriented fast write use cases.

I think by now a lot of people know you can write to Avro and compact to Parquet, and that is a key area of development. I'm not sure of a great solution yet.

Apache Iceberg tables can sit on top of Avro files as one of the storage engines/formats, in addition to Parquet or even the old ORC format.

Apache Hudi[2] was looking into HTAP capabilities - writing in row store, and compacting or merge on read into column store in the background so you can get the best of both worlds. I don't know where they've ended up.

[1] https://avro.apache.org/

[2] https://hudi.apache.org/

You basically can't do row by row appends to any columnar format stored in a single file. You could kludge around it by allocating arenas inside the file but that's still a huge write amplification, instead of writing a row in a single block you'd have to write a block per column.
You can do row by row appends to a Feather (Arrow IPC — the naming is confusing). It works fine. The main problem is that the per-append overhead is kind of silly — it costs over 300 bytes (IIRC) per append.

I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.

Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification. At that point, you're better off storing rows than pretending it's columnar.

I actually wrote a row storage format reusing Arrow data types (not Feather), just laying them out row-wise not columnar. Validity bits of the different columns collected into a shared per-row bitmap, fixed offsets within a record allow extracting any field in a zerocopy fashion. I store those in RocksDB, for now.

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

> Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification.

Sure, except insofar as I didn’t want to pretend to be columnar. There just doesn’t seem to be something out there that met my (experimental) needs better. I wanted to stream out rows, event sourcing style, and snarf them up in batches in a separate process into Parquet. Using Feather like it’s a row store can do this.

> kantodb

Neat project. I would seriously consider using that in a project of mine, especially now that LLMs can help out with the exceedingly tedious parts. (The current stack is regrettable, but a prompt like “keep exactly the same queries but change the API from X to Y” is well within current capabilities.)

Agreed.

There is room still for an open source HTAP storage format to be designed and built. :-)

Have you considered something like iceberg tables?
Yes, but parquet hates small files.
You can't compact? i.e. iceberg maintenance
We might be doing something wrong, but we saw significant performance degradation for both ingestion and query when doing compaction when it comes to finance data during trading hours.