Hacker News new | ask | show | jobs
by Kalanos 1792 days ago
Arrow is just an intermediary
2 comments

What is your point? (I honestly do not understand.)

Blackbear's comment was that writing libraries in C allows those libraries to be deployed broadly in many compute environments.

Jakob's reply (as I understood it) was that outside of the big Deep Learning libraries, this has not really happened. There is no C implementation of Pandas that allows for redeployment in other non-python compute contexts.

My point was that, with Arrow, this type of cross platform compatibility is coming to python dataframe libraries. You can prototype Dask code that runs on your laptop, then deploy it to a production Spark cluster, knowing the same Arrow engine is underpinning both. Or at least that's the vision. Obviously Arrow is still relatively young. But the point is, it's far from certain that the long-term global optimum for the ecosystem isn't sticking with "all libraries are written in C".

In response to "rise of," I too was excited about Arrow until I played with it and realized it didn't even provide a shape attribute. Anyways, people shouldn't be dependent on a low level lang like C to write fast code.
Fair. I agree Arrow is still more of a vision than anything else.

> it didn't even provide a shape attribute

I suspect this has to do with the project's focus. I think they aspire to be a back-end to DataFrame libraries, which are generally 2d. I think they (correctly) are ceding the "n-dimensional tensor computation" space to the current incumbents.

Arrow is getting support for N-d arrays, so if anything they're expanding in that area (which is exciting). I don't think they're interested in creating a universal libarrow though, the point of the data format and C data interface is to have languages define their own implementations.
I may be wrong. It happens a lot! But I think Arrow's vision encompasses compute, not just a data format and data interface.

https://www.slideshare.net/wesm/pycon-colombia-2020-python-f...

Slide 43: The "Arrow C++ Platform" encompasses a "Multi-core Work Scheduler" and a "Query Engine"

Slide 38: "It would be more productive (long-term) to have a reusable computational foundation for data frames"

Again, I agree that, today, it's more data format, and the shared compute stuff is more a vision.

EDIT: See also https://ursalabs.org/tech/

For sure, I didn't mean to imply they weren't looking at compute too! https://github.com/apache/arrow-datafusion is another example of the shared compute vision. What I was trying to point out is that (at least for Arrow core) they seem to eschew FFI and generating shared libraries in favour of from scratch implementations in other compiled languages and direct bindings in interpreted ones.