Hacker News new | ask | show | jobs
by RobinL 1968 days ago
Yes.

A second important point is the recognition that data tooling often re-implements the same algorithms again and again, often in ways which are not particularly optimised, because the in-memory representation of data is different between tools. Arrow offers the potential to do this once, and do it well. That way, future data analysis libraries (e.g. a hypothetical pandas 2) can concentrate on good API design without having to re-invent the wheel.

And a third is that Arrow allows data to be chunked and batched (within a particular tool), meaning that computations can be streamed through memory rather than the whole dataframe needing to be stored in memory. A little bit like how Spark partitions data and sends it to different nodes for computation, except all on the same machine. This also enables parallelisation by default. With the core count of CPUS this means Arrow is likely to be extremely fast.

1 comments

Re this second point: Arrow opens up a great deal of language and framework flexibility for data engineering-type tasks. Pre-Arrow, common kinds of data warehouse ETL tasks like writing Parquet files with explicit control over column types, compression, etc. often meant you needed to use Python, probably with PySpark, or maybe one of the other Spark API languages. With Arrow now there are a bunch more languages where you can code up tasks like this, with consistent results. Less code switching, lower complexity, less cognitive overhead.