Hacker News new | ask | show | jobs
by lowsenberg 3154 days ago
This looks very interesting. Currently we are storing our dense simulation (and experimental) data in NetCDF/HDF5. Given correct chunking, this seems to be pretty efficient both performance and compression wise. What would we gain using TileDB? How does performance compare with HDF5?
2 comments

Stavros from TileDB, Inc. here: HDF5 is a great software and TileDB was heavily inspired by it. HDF5 probably works great for your use case. TileDB matches the HDF5 performance in the dense case, but in addition it addresses some important limitations of HDF5, which may or may not be relevant to your use case. These include: sparse array support (not relevant to you), multiple readers multiple writers through thread- and process-safety (HDF5 does not have full thread-safety, whereas also it does not support parallel writes with compression - I am assuming you are using MPI and a single writer though, so still HDF5 should work well for you), efficient writes in a log-structured manner that enables multi-versioning and fault tolerance (HDF5 may suffer from file corruption upon error and file fragmentation - you are probably not updating, so still not very relevant to you). Having said that and echoing Jake's comment, we would love to hear from you how TileDB could be adapted to serve your case better.

A general comment: TileDB’s vision goes beyond that of the HDF5 (or any scientific) format. Considering though the quantities of HDF5 data out there (and the fact that we like the software), we are thinking about building some integration with HDF5 (and NetCDF). For instance, you may be able to create a TileDB array by “pointing” to an HDF5 dataset, without unnecessarily ingesting the HDF5 files but still enjoying the TileDB API and extra features.

Jake from TileDB, Inc. Performance wise I would look at the referenced paper in this thread which provides benchmarks for various workloads. As to what advantages TileDB may offer you that is problem dependent, esp. compared to dense simulation output data which is the use case HDF5 was designed for. If you have specific suggestions for ways to improve HDF5 for your use case we would love to hear about them.