|
|
|
|
|
by stavrospap
2237 days ago
|
|
Stavros from TileDB here. TL;DR, it's all about fast slicing on multiple columns while supporting updates, locally or in the cloud. Suppose you serialize your dataframe in HDF5 or Redis. Let your dataframe have schema (Date, Stock, Price). Assume this dataframe is 1 TB long and stored on S3, GCS or Azure (as they are cheap). How would you be able to efficiently perform an average query on Price for a specific Date range and Stock symbol? With HDF5 you would have to download 1 TB (no notion of "fast slicing on variable predicates") and apply the predicates locally. If you stored the dataframe in Parquet (a better choice for this use case), then you would be able to build some logic in your code that uses the Parquet metadata/indexes and prune a lot of unnecessary information (as Spark does). However, Parquet is "one-dimensional", i.e., your pruning would be efficient on Date, but not on Stock (you'd have to "partition" your Parquet files with Spark or Hive and things could get quite complicated). Most importantly, you wouldn't be able to update the Parquet files; you would have to generate new files and build a catalog on top (or use services like Delta Lake) to manage your Parquet files. And this is an extremely cumbersome task. TileDB abstracts everything for you, while allowing you to slice fast on any number of columns. You just define Date and Stock as "dimensions", and slicing on both those columns becomes uber efficient locally or in the cloud. Effectively, you turn this dataframe into a sparse 2D array. Updates and time traveling are handled by TileDB. You get to use Spark, Dask, MariaDB and PrestoDB as you did before, but there is no need for Hive, Delta Lake or any other cataloging service.
Thank you for pointing out the confusion though. We just launched and we have tons of examples coming up. |
|