| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Kalanos 1792 days ago
	Arrow is just an intermediary

2 comments

awaythrowact 1792 days ago

What is your point? (I honestly do not understand.)

Blackbear's comment was that writing libraries in C allows those libraries to be deployed broadly in many compute environments.

Jakob's reply (as I understood it) was that outside of the big Deep Learning libraries, this has not really happened. There is no C implementation of Pandas that allows for redeployment in other non-python compute contexts.

My point was that, with Arrow, this type of cross platform compatibility is coming to python dataframe libraries. You can prototype Dask code that runs on your laptop, then deploy it to a production Spark cluster, knowing the same Arrow engine is underpinning both. Or at least that's the vision. Obviously Arrow is still relatively young. But the point is, it's far from certain that the long-term global optimum for the ecosystem isn't sticking with "all libraries are written in C".

link

Kalanos 1792 days ago

In response to "rise of," I too was excited about Arrow until I played with it and realized it didn't even provide a shape attribute. Anyways, people shouldn't be dependent on a low level lang like C to write fast code.

link

awaythrowact 1792 days ago

Fair. I agree Arrow is still more of a vision than anything else.

> it didn't even provide a shape attribute

I suspect this has to do with the project's focus. I think they aspire to be a back-end to DataFrame libraries, which are generally 2d. I think they (correctly) are ceding the "n-dimensional tensor computation" space to the current incumbents.

link

BadInformatics 1792 days ago

Arrow is getting support for N-d arrays, so if anything they're expanding in that area (which is exciting). I don't think they're interested in creating a universal libarrow though, the point of the data format and C data interface is to have languages define their own implementations.

link

awaythrowact 1792 days ago

I may be wrong. It happens a lot! But I think Arrow's vision encompasses compute, not just a data format and data interface.

https://www.slideshare.net/wesm/pycon-colombia-2020-python-f...

Slide 43: The "Arrow C++ Platform" encompasses a "Multi-core Work Scheduler" and a "Query Engine"

Slide 38: "It would be more productive (long-term) to have a reusable computational foundation for data frames"

Again, I agree that, today, it's more data format, and the shared compute stuff is more a vision.

EDIT: See also https://ursalabs.org/tech/

link

BadInformatics 1792 days ago

For sure, I didn't mean to imply they weren't looking at compute too! https://github.com/apache/arrow-datafusion is another example of the shared compute vision. What I was trying to point out is that (at least for Arrow core) they seem to eschew FFI and generating shared libraries in favour of from scratch implementations in other compiled languages and direct bindings in interpreted ones.

link