Hacker News new | ask | show | jobs
by eddyxu 1113 days ago
Hey, co-author of Lance here. Lance is faster in random access because the layout / encodings were designed to be fast in both scan and random access case. We borrowed many ideas from Google's Procella paper, and Arrow's in-memory layout. Also we added a bunch of I/O exec plan optimizations with the assumption that it has large-blob columns (i.e., image, lidar point cloud) during scanning, which do not exist in traditional OLAP systems, because their workloads are different than ML training.

Re-implementing Lance in Java should have very similar I/O characteristics. There are actually some efforts to support Lance in JVM / Spark data sources.

3 comments

Hey Eddy, Arrow also allows you to serialize to disk and then utilize mmap. Compared to Parquet, the downside is that the design of Arrow makes it so storage requirements increase. If you're borrowing elements of Arrow's layout, does that not come with all the same downsides of just directly utilizing Arrow's serialization. And at that point, why not just use Arrow?
Thanks for the clarification! That’s very exciting. A JVM implementation that can drop in to Spark/JVM would be great because there’s so much inertia built up around the Apache ecosystem.
Reminded me of deeplake. What is the comparative analysis?
We have not done benchmarks against deeplake yet. Deeplake has some interesting concepts in their design, I'd be very interested to do a benchmark soon.