Hacker News new | ask | show | jobs
by filereaper 3573 days ago
Asking the stupid question here, but why create a new Apache project for this?

Apache Arrow seems to be targeting the use of SIMD which is a very JVM/Runtime dependent feature. If the runtime can't detect this out-of-the-box then create recognized method or some sort of intrinsic to coax the runtime to SIMD-ize the operation.

I understand the performance gains of this but why not add this functionality to existing projects like Parquet or HTable etc...

This just comes to mind: https://xkcd.com/927/

2 comments

The idea behind Apache Arrow (you can see this in the list of people supporting it) is to provide a common serialization/exchange format among different data science tools/languages/platforms (Hadoop, Spark, pandas, R's datatable). Typically data scientists will cobble together a pipeline across various tools to leverage their strengths (for example, using spark to clean up data and then pandas for timeseries analysis), and this often involves an expensive serialization/deserialization step at the boundaries. The goal of Arrow is to provide a near zero-cost format that all tools can support.
I don't know the answer, but in this case does columnar store imply that it is a collection of arrays, perhaps for a scientific database, and a bit different than HBase?

Here's someone else's blog post from 2010 on different categories of columnar store DBs:

http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...

That "someone else" is Daniel Abadi, one of the researchers who re-popularized the idea of column stores during his graduate work at MIT (in addition to researchers at CWI).