| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sfsylvester 3035 days ago
	This is exactly my go-to-move as well. pandas.read_hdf has beaten out ray.dataframe.read_csv in terms of speed on the few files I've just initially tested now. But I imagine the programmable flexibility csvs have over hdfs (I've never used a Unix command to edit a hdf for example) is why this new approach could get some traction.

1 comments

tavert 3035 days ago

Try parquet if your data is tabular, pyarrow and related tools are getting parquet up to a pretty comparable speed to hdf5, with arguably more flexibility and a better multithreading story.

link