Hacker News new | ask | show | jobs
by mulmen 1120 days ago
I’m confused. Parquet is a file format. The reference implementation is in Java but rust implementations exist. Is this faster because rust or because of the file format? Could this format offer benefits in a Java environment?
1 comments

Hey, co-author of Lance here. Lance is faster in random access because the layout / encodings were designed to be fast in both scan and random access case. We borrowed many ideas from Google's Procella paper, and Arrow's in-memory layout. Also we added a bunch of I/O exec plan optimizations with the assumption that it has large-blob columns (i.e., image, lidar point cloud) during scanning, which do not exist in traditional OLAP systems, because their workloads are different than ML training.

Re-implementing Lance in Java should have very similar I/O characteristics. There are actually some efforts to support Lance in JVM / Spark data sources.

Hey Eddy, Arrow also allows you to serialize to disk and then utilize mmap. Compared to Parquet, the downside is that the design of Arrow makes it so storage requirements increase. If you're borrowing elements of Arrow's layout, does that not come with all the same downsides of just directly utilizing Arrow's serialization. And at that point, why not just use Arrow?
Thanks for the clarification! That’s very exciting. A JVM implementation that can drop in to Spark/JVM would be great because there’s so much inertia built up around the Apache ecosystem.
Reminded me of deeplake. What is the comparative analysis?
We have not done benchmarks against deeplake yet. Deeplake has some interesting concepts in their design, I'd be very interested to do a benchmark soon.