Hacker News new | ask | show | jobs
by aynyc 129 days ago
What's the difference between feather and parquet in terms of usage? I get the design philosophy, but how would you use them differently?
3 comments

parquet is optimized for storage and compresses well (=> smaller files)

feather is optimized for fast reading

Given the cost of storage is getting cheaper, wouldn't most firms want to use feather for analytic performance? But everyone uses parquet.
You can, still, gain a lot of performance by doing less I/O.
There's definitely a "everyone uses it because everyone uses it" effect.

Feather might be a better fit for sime yse cases, but parquet has fantastic support and is still a pretty good choice for things that feather does.

Unless they're really focussed on eaking out every bit of read performance, people often opt for the well supported path instead.

What people have done in the face of cheaper storage is store more data.
Storage is cheap but bandwidth no.
Storage getting cheaper did not really reach the cloud providers and for self-hosting it has recently gotten even more expensive due to AI bs.
And now there's Lance! https://lance.org/
I read that. But afaik, feather format is stable now. Hence my confusion. I use parquet at work a lot, where we store a lot of time series financial data. We like it. Creating the Parquet data is a pain since it's not append-able.
Generally Parquet files are combined in an LSM style, compacting smaller files into larger ones. Parquet isn't really meant for the "journal" of level-0 append-one-record style storage, it's meant for the levels that follow.
So feather for journaling and parquet for long term processing?
I still don't understand what happened to using Apache Avro [1] for row-oriented fast write use cases.

I think by now a lot of people know you can write to Avro and compact to Parquet, and that is a key area of development. I'm not sure of a great solution yet.

Apache Iceberg tables can sit on top of Avro files as one of the storage engines/formats, in addition to Parquet or even the old ORC format.

Apache Hudi[2] was looking into HTAP capabilities - writing in row store, and compacting or merge on read into column store in the background so you can get the best of both worlds. I don't know where they've ended up.

[1] https://avro.apache.org/

[2] https://hudi.apache.org/

You basically can't do row by row appends to any columnar format stored in a single file. You could kludge around it by allocating arenas inside the file but that's still a huge write amplification, instead of writing a row in a single block you'd have to write a block per column.
You can do row by row appends to a Feather (Arrow IPC — the naming is confusing). It works fine. The main problem is that the per-append overhead is kind of silly — it costs over 300 bytes (IIRC) per append.

I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.

Agreed.

There is room still for an open source HTAP storage format to be designed and built. :-)

Have you considered something like iceberg tables?
Yes, but parquet hates small files.
You can't compact? i.e. iceberg maintenance
We might be doing something wrong, but we saw significant performance degradation for both ingestion and query when doing compaction when it comes to finance data during trading hours.
Feather (Arrow IPC) is zero copy and an order of magnitude simpler. Parquet has a lot of compatibility issues between readers and writers.

Arrow is also directly usable as the application memory model. It’s pretty common to read Parquet into Arrow for transport.

When you say compatibility issues, you mean they are more problematic or less?

It’s pretty common to read Parquet into Arrow for transport.

I'm confused by this. Are you referring to Arrow Flight RPC? Or are you saying distributed analytic engine use arrow to transport parquet between queries?

Not the OP, but Parquet compatibility issues are usually due to the varying support of features across implementations. You have to take that into account when writing Parquet data (unless you go with the defaults which can be conservative and suboptimal).

Recently we have started documenting this to better inform choices: https://parquet.apache.org/docs/file-format/implementationst...