| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stavrospap 2237 days ago

Stavros from TileDB here. TL;DR, it's all about fast slicing on multiple columns while supporting updates, locally or in the cloud.

Suppose you serialize your dataframe in HDF5 or Redis. Let your dataframe have schema (Date, Stock, Price). Assume this dataframe is 1 TB long and stored on S3, GCS or Azure (as they are cheap). How would you be able to efficiently perform an average query on Price for a specific Date range and Stock symbol? With HDF5 you would have to download 1 TB (no notion of "fast slicing on variable predicates") and apply the predicates locally. If you stored the dataframe in Parquet (a better choice for this use case), then you would be able to build some logic in your code that uses the Parquet metadata/indexes and prune a lot of unnecessary information (as Spark does). However, Parquet is "one-dimensional", i.e., your pruning would be efficient on Date, but not on Stock (you'd have to "partition" your Parquet files with Spark or Hive and things could get quite complicated). Most importantly, you wouldn't be able to update the Parquet files; you would have to generate new files and build a catalog on top (or use services like Delta Lake) to manage your Parquet files. And this is an extremely cumbersome task.

TileDB abstracts everything for you, while allowing you to slice fast on any number of columns. You just define Date and Stock as "dimensions", and slicing on both those columns becomes uber efficient locally or in the cloud. Effectively, you turn this dataframe into a sparse 2D array. Updates and time traveling are handled by TileDB. You get to use Spark, Dask, MariaDB and PrestoDB as you did before, but there is no need for Hive, Delta Lake or any other cataloging service. Thank you for pointing out the confusion though. We just launched and we have tons of examples coming up.

1 comments

hxzhao 2237 days ago

Stavros thanks for the explanation, how does TileDB avoid downloading the entire matrix and do the slicing (locally)? Are we achieving this by breaking down a big matrix to a set of smaller ones? so that you only down the subset of that the current query need? If this is the case, what is the current measure we have to avoid mismatch on metadata (e.g. some error while uploading them to S3) that links them together? thanks

stavrospap 2236 days ago

Efficient slicing happens because of "tiling", hence the name TileDB. A tile is similar to an HDF5 or Zarr "chunk", or more loosely to a Parquet page. Although totally configurable, tiling is handled solely by TileDB, the user doesn't need to know about it. A tile is the atomic unit of IO and compression. TileDB maintains all the necessary metadata and indexing built into its format and, given a query, it knows how to fetch only the tiles that might include results. The tiles are decompressed in your memory and filtered further for the actual results. The dense array case is rather straightforward. The sparse case is a big differentiator in TileDB and it is quite challenging, especially in the presence of updates. TileDB handles the sparse case via bulk-loaded R-trees for multi-dimensional indexing, and via an LSM-tree-like approach with immutable objects that allows time traveling.

Concerning your point on potential errors occurring on S3, this is addressed by TileDB's immutable object approach. If an error occurs upon some write, there will be no array corruption. Happy to discuss about this topic on a separate thread.

Some related docs:

https://docs.tiledb.com/main/performance-tips/choosing-tilin...

https://docs.tiledb.com/main/basic-concepts/tile-filters#til...

https://docs.tiledb.com/main/basic-concepts/definitions/frag...