Hacker News new | ask | show | jobs
by shoyer 3693 days ago
Thanks for sharing your story!

Let me comment on file formats as someone familiar with both netCDF and deep learning.

I agree that netCDF is a sane binary file format for this application. It's designed for efficient serialization of large arrays of numbers. One downside is that netCDF does not support streaming without writing the data to intermediate files on disk.

Keep in mind that netCDF v4 is itself just a thin wrapper around HDF5. Given that your input format is basically a custom file format written in netCDF, I would have just used HDF5 directly. The API is about as convenient, and this would skip one layer of indirection.

The native file format for TensorFlow is its own custom TFRecords file format, but it also supports a number of other file formats. TFRecords is much simpler technology than NetCDF/HDF5. It's basically just a bunch of serialized protocol buffers [1]. About all you can do with a TFRecords file is pull out examples -- it doesn't support the fancy multi-dimensional indexing or hierarchical structure of netCDF/HDF5. But that's also most of what you need for building machine learning models, and it's quite straightforward to read/write them in a streaming fashion, which makes it a natural fit for technologies like map-reduce.

[1] https://www.tensorflow.org/versions/r0.8/api_docs/python/pyt...

1 comments

Thanks for that! And boy, I wish I had the resources the TensorFlow team has to build standards like this and also to write their own custom CUDA compiler.

I do want the multi-dimensional indexing for RNN data though. Maybe support HDF5 directly is the path forward.

Thanks again!