|
|
|
|
|
by taeric
1117 days ago
|
|
Meanwhile, far too many shops still use csv. Hard not to see a new entry as slowing the move. I am curious on the random access benefits. Seems most ML workloads will naturally scan the data sets fairly linearly. Does this maintain parity on that? |
|
A few of users use random access to do shuffling and training on subsets as well.
Many of them migrated from a system where you can store asset URL in your format of choice (say parquet/tfrecord), to Lance, partially because putting asset (i.e., image, large tensors, lidar point cloud) physically together can lead to better scan perf than loading a lot of small files from S3 or on-prem object store / file systems, due to much less metadata server load over the directory / key&value structures. (i.e. Some of our users see similar issues like this decade old article https://blog.cloudera.com/the-small-files-problem/)
One motivation to design lance is to avoid creating a new copy of training dataset just for one model / training iteration. This is one copy of data for maintenance (update/schema evolution/deletion), analytics & visualization, and training.
For scanning, Lance has been proved slightly faster than Tfrecord and parquet over object store (S3 and GCS). I'd contribute the scan performance just to Rust and tokio async I/O. It is not necessary a better design in scan as we just try to scan "no more data than parquet" when we designed the layouts / encoding algorithms.