|
|
|
|
|
by RobinL
1968 days ago
|
|
Yes. A second important point is the recognition that data tooling often re-implements the same algorithms again and again, often in ways which are not particularly optimised, because the in-memory representation of data is different between tools. Arrow offers the potential to do this once, and do it well. That way, future data analysis libraries (e.g. a hypothetical pandas 2) can concentrate on good API design without having to re-invent the wheel. And a third is that Arrow allows data to be chunked and batched (within a particular tool), meaning that computations can be streamed through memory rather than the whole dataframe needing to be stored in memory. A little bit like how Spark partitions data and sends it to different nodes for computation, except all on the same machine. This also enables parallelisation by default. With the core count of CPUS this means Arrow is likely to be extremely fast. |
|