Hacker News new | ask | show | jobs
by sfsylvester 3035 days ago
This is exactly my go-to-move as well.

pandas.read_hdf has beaten out ray.dataframe.read_csv in terms of speed on the few files I've just initially tested now. But I imagine the programmable flexibility csvs have over hdfs (I've never used a Unix command to edit a hdf for example) is why this new approach could get some traction.

1 comments

Try parquet if your data is tabular, pyarrow and related tools are getting parquet up to a pretty comparable speed to hdf5, with arguably more flexibility and a better multithreading story.