|
|
|
|
|
by gas9S9zw3P9c
2165 days ago
|
|
I'm in the same camp. I'm quite interested since it mentions asset data as an example, but I have no idea what this does from looking at the landing page. Does someone have an end-to-end example? Since this stores arrays, is this kind of like Apache Arrow but with a persistence layer? Is this suited for large amounts (~1TB) of time series data? |
|
Compared to Apache Arrow we have some similarities but also some significant differences. Arrow as a project has many components, the most directly comparable are the in-memory data structure and parquet for on-disk storage. For the in-memory data structure TileDB has similar goals of doing zero-copy for moving data between libraries and applications. In-fact we even use Arrow in our TileDB-VCF[2] genomics project for zero copy into spark and python (pyarrow). We are looking to expand support for arrow into other integrations where appropriate.
For parquet, a brief comparison is that parquet is a one dimensional columnar storage format, where TileDB is multi-dimensional. TileDB subsumes parquet in that we include all of its functionality and more. TileDB natively handles the eventual consistency of cloud object stores, and natively handles updates through its MVCC design. TileDB is a complete storage engine, not only a file format. That said parquet does have some advantages, a primary one it has a several year headstart on TileDB in being integrated into many tools, so its more well known.
> Is this suited for large amounts (~1TB) of time series data?
Yes, it's well suited for time series data. We natively support timestamp/datetime fields, in the core library and in many of our integrations (TileDB-Py, TileDB-R, Spark, MariaDB to name a few). We allow for fast sub-slicing on the dimension. You also have configurable tiling[3] so you can shape the array to fit your timestamp granularity and volume. The support for updates also can help if your timeseries data ever gets updated. Many timeseries databases don't recommend updates to records, or they recommend no primary keys and to have duplicates. TileDB supports fast and efficient updates (and duplicates) so you have full control of your design and implementation.
[1] https://news.ycombinator.com/item?id=23897086
[2] https://github.com/TileDB-Inc/TileDB-VCF
[3] https://docs.tiledb.com/main/performance-tips/choosing-tilin...