Hacker News new | ask | show | jobs
by taeric 1117 days ago
Meanwhile, far too many shops still use csv. Hard not to see a new entry as slowing the move.

I am curious on the random access benefits. Seems most ML workloads will naturally scan the data sets fairly linearly. Does this maintain parity on that?

1 comments

The use cases for such random access was from our experiences to maintain large-scale training data, with the needs to debugging model performance against to data, which requires fast slicing and dicing over the dataset, filtering into subset of dataset that satisfy certain distributions, and visualize them on internal debugging UI interactively.

A few of users use random access to do shuffling and training on subsets as well.

Many of them migrated from a system where you can store asset URL in your format of choice (say parquet/tfrecord), to Lance, partially because putting asset (i.e., image, large tensors, lidar point cloud) physically together can lead to better scan perf than loading a lot of small files from S3 or on-prem object store / file systems, due to much less metadata server load over the directory / key&value structures. (i.e. Some of our users see similar issues like this decade old article https://blog.cloudera.com/the-small-files-problem/)

One motivation to design lance is to avoid creating a new copy of training dataset just for one model / training iteration. This is one copy of data for maintenance (update/schema evolution/deletion), analytics & visualization, and training.

For scanning, Lance has been proved slightly faster than Tfrecord and parquet over object store (S3 and GCS). I'd contribute the scan performance just to Rust and tokio async I/O. It is not necessary a better design in scan as we just try to scan "no more data than parquet" when we designed the layouts / encoding algorithms.

Apologies for not getting back on this yesterday. Made the mistake of posting from my phone right before evening plans took over.

Looking at this, it seems you build a few indexes, such that I'm guessing those are the main drivers on the benefits? Makes sense, does it add to the space at all? As I said, most teams I work with are still on CSV, so even if this adds, I'm sure it is well below that.

At any rate, thanks for the response. Looks really nice!

No worries at all. For the teams which are happy on CSV/JSON, i'd admit that lance is not ideal alternative for them.

> It seems you build a few indexes, such that I'm guessing those are the main drivers on the benefits?

Yeah, we are building different indices into this columnar storage format, which is actually a happy side-effect of its good random-access performance. It does occur extra space for indices.

Thanks for your kind word too!

I'd hazard that many of the teams aren't so much happy with CSV, as they are ignorant of its costs. I fought for a bit to get them to move to parquet, but all too often they insist on having it in a format that excel can open.