Hacker News new | ask | show | jobs
by hxzhao 2237 days ago
Stavros thanks for the explanation, how does TileDB avoid downloading the entire matrix and do the slicing (locally)? Are we achieving this by breaking down a big matrix to a set of smaller ones? so that you only down the subset of that the current query need? If this is the case, what is the current measure we have to avoid mismatch on metadata (e.g. some error while uploading them to S3) that links them together? thanks
1 comments

Efficient slicing happens because of "tiling", hence the name TileDB. A tile is similar to an HDF5 or Zarr "chunk", or more loosely to a Parquet page. Although totally configurable, tiling is handled solely by TileDB, the user doesn't need to know about it. A tile is the atomic unit of IO and compression. TileDB maintains all the necessary metadata and indexing built into its format and, given a query, it knows how to fetch only the tiles that might include results. The tiles are decompressed in your memory and filtered further for the actual results. The dense array case is rather straightforward. The sparse case is a big differentiator in TileDB and it is quite challenging, especially in the presence of updates. TileDB handles the sparse case via bulk-loaded R-trees for multi-dimensional indexing, and via an LSM-tree-like approach with immutable objects that allows time traveling.

Concerning your point on potential errors occurring on S3, this is addressed by TileDB's immutable object approach. If an error occurs upon some write, there will be no array corruption. Happy to discuss about this topic on a separate thread.

Some related docs:

https://docs.tiledb.com/main/performance-tips/choosing-tilin...

https://docs.tiledb.com/main/basic-concepts/tile-filters#til...

https://docs.tiledb.com/main/basic-concepts/definitions/frag...