|
|
|
|
|
by eddyxu
1122 days ago
|
|
The use cases for such random access was from our experiences to maintain large-scale training data, with the needs to debugging model performance against to data, which requires fast slicing and dicing over the dataset, filtering into subset of dataset that satisfy certain distributions, and visualize them on internal debugging UI interactively. A few of users use random access to do shuffling and training on subsets as well. Many of them migrated from a system where you can store asset URL in your format of choice (say parquet/tfrecord), to Lance, partially because putting asset (i.e., image, large tensors, lidar point cloud) physically together can lead to better scan perf than loading a lot of small files from S3 or on-prem object store / file systems, due to much less metadata server load over the directory / key&value structures. (i.e. Some of our users see similar issues like this decade old article https://blog.cloudera.com/the-small-files-problem/) One motivation to design lance is to avoid creating a new copy of training dataset just for one model / training iteration. This is one copy of data for maintenance (update/schema evolution/deletion), analytics & visualization, and training. For scanning, Lance has been proved slightly faster than Tfrecord and parquet over object store (S3 and GCS). I'd contribute the scan performance just to Rust and tokio async I/O. It is not necessary a better design in scan as we just try to scan "no more data than parquet" when we designed the layouts / encoding algorithms. |
|
Looking at this, it seems you build a few indexes, such that I'm guessing those are the main drivers on the benefits? Makes sense, does it add to the space at all? As I said, most teams I work with are still on CSV, so even if this adds, I'm sure it is well below that.
At any rate, thanks for the response. Looks really nice!