Hacker News new | ask | show | jobs
by wesm 2152 days ago
This isn't accurate -- there are multiple query engine subprojects within Apache Arrow.
2 comments

I am eternally indebted to you for Pandas. Many thanks for that.

Are you talking about there being support for multiple language libraries like PyArrow or about there being multiple Apache projects that utilize Arrow like Parquet and Spark?

If not, I'm not following what sub-projects you are speaking about. As far as I know, Arrow is principally the Arrow Columnar Format and Arrow Flight with some other potentially interesting interfaces for compute kernels and CUDA devices.

Am I missing something?

The Arrow project contains implementations in multiple languages. Some of these languages contain code that can evaluate expressions against Arrow data, or even execute full queries. The C++ and Rust implementations contain query capabilities, and the Java implementation contains the Gandiva library that can delegate to C++ via JNI to evalulate expressions, for example.
Is there some documentation for this on the Arrow website somewhere? I've been looking for info on the "compute engine" that's mentioned in this 1.0 announcement but haven't found much.

In general, where's the best place to learn more about Arrow? I've approached it several times, and can find a lot about how to integrate it into other products, but none of the tools like the query engines that I would find very useful.