Hacker News new | ask | show | jobs
by derriz 2147 days ago
I'm not expecting anything really and I do appreciate your work and effort. And it's a specific use case for arrow, I guess.

But at your landing page, it's claimed "Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. " and that "Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.". This certainly gave me the impression that more than just Python, C++ and R would be well supported.

The JVM isn't complete irrelevant in data-science given the position of Spark/Scala. This also raised my expectations of arrow/parquet because it seems to be the de-facto standard for table storage for this JVM platform. And I experienced no issues on that platform.

To be clear, I'm not blaming you for my design decision (I'm a software engineer not a data-scientist btw), and I still think parquet/arrow rocks for Python but in my experience it doesn't really deliver a useable "cross-language" file format at the moment.

4 comments

Again, I have to object to your use of “arrow/parquet”. These are not the same open source projects and while people use them together it isn’t fair to the developers of each project for you to discuss them like a single project.
FWIW, while the JVM isn't completely irrelevant in data, I will say, even as a big user of Spark via Scala, that JVM languages are quickly becoming irrelevant in data. Spark's Scala API is simultaneously the core of the platform, and also very much a second-class citizen that lacks a lot of important features that the Python API has. Easy interop with a good math library, for example.

Similarly, the reference implementation of Parquet may be in Java, but consuming it from a Java language, outside of a Spark cluster, is still a royal pain. Whereas doing it from Python isn't too bad.

Long story short, I think that expecting a project that's just trying to implement a columnar memory format to also muck out the world's filthiest elephant pen is perhaps asking too much. Though perhaps a project like Arrow could serve as the cornerstone of an effort to douse it all with kerosene and make a fresh start.

I spent a couple of years doing consultancy for life sciences research labs, most people were just using Excel and Tableau, plugged into OLAP, SQL servers, alongside Java and .NET based stores.

Stuff like Arrow doesn't come even into the radar of IT.

You do raise a very important point. At my organisation, Apache Avro was selected by the Java Devs due to the "cross-platform" marketing. However, they found out after it was too late, that the C/C++ implementations were too buggy/incomplete to effectively interoperate with the Java versions.
Keep in mind that Arrow Java<->C++/Python interop has been in production use in Apache Spark and elsewhere for multiple years now. We have avoided some of the mistakes of past projects by really emphasizing protocol integration tests across the implementations.
I believe that the Spark parquet library is available to be used in plain old Java: https://www.arm64.ca/post/reading-parquet-files-java/