Hacker News new | ask | show | jobs
by monk_the_dog 4852 days ago
I use hdf5 as the file format for my project. I'm not at all a hdf5 expert, but I know it supports: partial reads/writes, extending data sets, and transparent compression. It's a nice file format and it's very easy to use from python (and matlab, C++, R, and others).

I read the article, and blz looks interesting. What is it aiming to provide that is missing from HDF5? (Speed?).

1 comments

HDF5 is a very nice format indeed, and in fact, BLZ is borrowing a lot of good ideas from it. However, HDF5 has its own drawbacks, like not being able to compress variable length datasets, the lack of a query/computational kernel or its flaky resiliency during updates. Also, its approach for distributing data among nodes diverges from our goals.

Finally, you are right, speed is pretty important for us, and we think that our approach can make a better use of current computer architectures.

I use HDF5s for storage/analytics of tick data; my experience has been that the performance for storing large sparse matrices is both expensive in storage space and speed.

I know that a lot of my peers in HFT have an illicit love for column stores, but for a lot of work, there's the need for converting to 'wide' format, which quickly can take a 10 million row matrix with a few columns to one now with a few thousand columns. (And thus stuff like KX, FastBit, etc becomes sort of suboptimal)

The need for massive 'last-of' information for time series leads to basically abandoning python/pandas/numpy and using C primitives and doing a lot more than you'd typically like 'online' but really a lot of this could happen behind the scenes with intelligent out of memory ops.

So...I'm pretty excited for innovation in data stores -- I look forward to seeing more!