| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by snidane 1703 days ago

Partially.

The problem with Apache Arrow and Parquet is that you have two - one for storage and one for computation - but in the end you only want one for both. You want to run fast algorithms on memory mapped compressed columns. Not doing this stupid deserialization from parquet to arrow.

Parquet and arrow are designed by committee and try to accomplish too much for that matter. While that's good for some cases, my prediction is that there will exist a data processing system in the future whose file format will support that and be good enoigh for most data intensive applications. It will not be feature complete, like json, but will be good enough. Some devs from then on will complain about adding this and that feature to that format, but majority will be happy as they are now with json. Such format can only come from industry, not from a committee.

1 comments

liuliu 1703 days ago

Right. That's why I am more interested in arrow than parquet. Going from a pure compressed storage format to incorporate computation would be more difficult than going from memory-mapped / computation format to long-term storage. Arrow already made some good choices regarding data exchange over wire, these are translatable to data exchange over time.

Of course, I am only dealing with a few hundreds GiB data, not sure at larger scale whether arrow fails.

link