| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jaychia 842 days ago

Hello! Daft developer here. The benchmarks we performed aren’t directly comparable to the benchmarks on TPC-H’s own page because of differences in hardware, storage etc.

For hardware, we were using AWS i3.2xlarge machines in a distributed cluster. And on the storage side we are reading Parquet files over the network from AWS S3. This is most representative of how users run query engines like Daft.

The TPC-H benchmarks are usually performed on databases which have pre-ingested the data into a single-node server-grade machine that’s running the database.

Note that Daft isn’t really a “database”, because we don’t have proprietary storage. Part of the appeal of using query engines like Daft and Spark is to able to read data “at rest” (as Parquet, CSV, JSON etc). However this will definitely be slower than a database which has pre-ingested the data into indexed storage and proprietary formats!

Hope that helps explain the discrepancies!

1 comments

pletnes 842 days ago

I’m sure you mean slow = high latency, but I do hope you’re «high throughput», like spark/…?

link