Apache Arrow – Powering Columnar In-Memory Analytics

Y	Hacker News new \| ask \| show \| jobs

	Apache Arrow – Powering Columnar In-Memory Analytics (arrow.apache.org)
	48 points by bertzzie 3617 days ago

5 comments

PDoyle 3616 days ago

Oops... The first sentence in the "Fast" section says "SIMD (Single input multiple data)".

link

filereaper 3616 days ago

Asking the stupid question here, but why create a new Apache project for this?

Apache Arrow seems to be targeting the use of SIMD which is a very JVM/Runtime dependent feature. If the runtime can't detect this out-of-the-box then create recognized method or some sort of intrinsic to coax the runtime to SIMD-ize the operation.

I understand the performance gains of this but why not add this functionality to existing projects like Parquet or HTable etc...

This just comes to mind: https://xkcd.com/927/

link

infinite8s 3614 days ago

The idea behind Apache Arrow (you can see this in the list of people supporting it) is to provide a common serialization/exchange format among different data science tools/languages/platforms (Hadoop, Spark, pandas, R's datatable). Typically data scientists will cobble together a pipeline across various tools to leverage their strengths (for example, using spark to clean up data and then pandas for timeseries analysis), and this often involves an expensive serialization/deserialization step at the boundaries. The goal of Arrow is to provide a near zero-cost format that all tools can support.

link

rz2k 3615 days ago

I don't know the answer, but in this case does columnar store imply that it is a collection of arrays, perhaps for a scientific database, and a bit different than HBase?

Here's someone else's blog post from 2010 on different categories of columnar store DBs:

http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...

link

infinite8s 3614 days ago

That "someone else" is Daniel Abadi, one of the researchers who re-popularized the idea of column stores during his graduate work at MIT (in addition to researchers at CWI).

link

ljoshua 3616 days ago

Is this similar to how QlikView's in-memory engine works?

link

jandrewrogers 3616 days ago

Yes, almost all BI cache database engines are typically designed this way.

link

threeseed 3616 days ago

It really is a confusing title for the project. It's more of a high speed interchange format e.g. send data to Cassandra from Spark or Storm.

Nothing that end users will ever really have to know anything about.

link

axman6 3616 days ago

I'm confused, is this just Structure of Arrays as a service for columnar data? It's not clear to me what this actually does.

link

tveita 3615 days ago

It's not really a service at all, it is a in-memory data format intended to be shareable between processes. The project also includes libraries for C++, Java and Python.

This post explains the intention better than the project webpage:

http://blog.cloudera.com/blog/2016/02/introducing-apache-arr...

link