| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hendiatris 474 days ago
	You may be able to get close with sufficiently small row groups, but you will have to do some tests. You can do this in a few hours of work, by taking some sensor data, sorting it by the identifier and then writing it to parquet with one row group per sensor. You can do this with the ParquetWriter class in PyArrow, or something else that allows you fine grained control of how the file is written. I just checked and saw that you can have around 7 million row groups per file, so you should be fine. Then spin up duckdb and do some performance tests. I’m not sure this will work, there is some overheard with reading parquet, which is why it is discouraged to have small files and row groups.