| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aynyc 129 days ago
	What's the difference between feather and parquet in terms of usage? I get the design philosophy, but how would you use them differently?

3 comments

tosh 129 days ago

parquet is optimized for storage and compresses well (=> smaller files)

feather is optimized for fast reading

link

aynyc 129 days ago

Given the cost of storage is getting cheaper, wouldn't most firms want to use feather for analytic performance? But everyone uses parquet.

link

yencabulator 129 days ago

You can, still, gain a lot of performance by doing less I/O.

link

benrutter 129 days ago

There's definitely a "everyone uses it because everyone uses it" effect.

Feather might be a better fit for sime yse cases, but parquet has fantastic support and is still a pretty good choice for things that feather does.

Unless they're really focussed on eaking out every bit of read performance, people often opt for the well supported path instead.

link

outside1234 129 days ago

What people have done in the face of cheaper storage is store more data.

link

vb-8448 128 days ago

Storage is cheap but bandwidth no.

link

farsa 129 days ago

Storage getting cheaper did not really reach the cloud providers and for self-hosting it has recently gotten even more expensive due to AI bs.

link

twic 129 days ago

And now there's Lance! https://lance.org/

link

dionian 129 days ago

https://stackoverflow.com/questions/48083405/what-are-the-di...

link

aynyc 129 days ago

I read that. But afaik, feather format is stable now. Hence my confusion. I use parquet at work a lot, where we store a lot of time series financial data. We like it. Creating the Parquet data is a pain since it's not append-able.

link

yencabulator 129 days ago

Generally Parquet files are combined in an LSM style, compacting smaller files into larger ones. Parquet isn't really meant for the "journal" of level-0 append-one-record style storage, it's meant for the levels that follow.

link

aynyc 129 days ago

So feather for journaling and parquet for long term processing?

link

sixdimensional 129 days ago

I still don't understand what happened to using Apache Avro [1] for row-oriented fast write use cases.

I think by now a lot of people know you can write to Avro and compact to Parquet, and that is a key area of development. I'm not sure of a great solution yet.

Apache Iceberg tables can sit on top of Avro files as one of the storage engines/formats, in addition to Parquet or even the old ORC format.

Apache Hudi[2] was looking into HTAP capabilities - writing in row store, and compacting or merge on read into column store in the background so you can get the best of both worlds. I don't know where they've ended up.

[1] https://avro.apache.org/

[2] https://hudi.apache.org/

link

yencabulator 129 days ago

You basically can't do row by row appends to any columnar format stored in a single file. You could kludge around it by allocating arenas inside the file but that's still a huge write amplification, instead of writing a row in a single block you'd have to write a block per column.

link

amluto 129 days ago

You can do row by row appends to a Feather (Arrow IPC — the naming is confusing). It works fine. The main problem is that the per-append overhead is kind of silly — it costs over 300 bytes (IIRC) per append.

I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.

link

gregw2 129 days ago

Agreed.

There is room still for an open source HTAP storage format to be designed and built. :-)

link

dionian 129 days ago

Have you considered something like iceberg tables?

link

aynyc 129 days ago

Yes, but parquet hates small files.

link

dionian 129 days ago

You can't compact? i.e. iceberg maintenance

link

aynyc 128 days ago

We might be doing something wrong, but we saw significant performance degradation for both ingestion and query when doing compaction when it comes to finance data during trading hours.

link

willtemperley 128 days ago

Feather (Arrow IPC) is zero copy and an order of magnitude simpler. Parquet has a lot of compatibility issues between readers and writers.

Arrow is also directly usable as the application memory model. It’s pretty common to read Parquet into Arrow for transport.

link

aynyc 128 days ago

When you say compatibility issues, you mean they are more problematic or less?

It’s pretty common to read Parquet into Arrow for transport.

I'm confused by this. Are you referring to Arrow Flight RPC? Or are you saying distributed analytic engine use arrow to transport parquet between queries?

link

pitrou 128 days ago

Not the OP, but Parquet compatibility issues are usually due to the varying support of features across implementations. You have to take that into account when writing Parquet data (unless you go with the defaults which can be conservative and suboptimal).

Recently we have started documenting this to better inform choices: https://parquet.apache.org/docs/file-format/implementationst...

link