Hacker News new | ask | show | jobs
by desk_minion 1492 days ago
This is a killer idea. However , I do not see anything in the README about distributed querying. Is that something you wish to tackle?

Also, any benchmarks comparing this to Apache Arrow or Apache Presto?

2 comments

Hi! One of the authors here. We do have support for distributed querying, but it's not implemented in the command-line tool. (It makes for a much more complicated demo if you need multiple machines.) The query planner is happy to use as many machines as you can throw at it.

We don't yet have good comparative benchmarks against Arrow or Presto, although I'm hoping we can get those.

Sneller head of product here. Arrow is a data exchange format, are you referring to benchmarking against DataFusion or Ballista? Also, on Presto - we did early benchmarks against Amazon's Athena (Presto under the covers) running on parquet, and will rerun these benchmarks shortly. The interesting thing to note vs Presto is that it is clunky to use with raw JSON - see https://prestodb.io/docs/current/functions/json.html. While benchmarking against Athena we actually used AWS Glue (Spark under the hood) to transform JSON into parquet, but that adds both complexity and latency to the overall pipeline, which doesn't show up in just query timings
If you check out the Kubernetes folder in the repo, then you find the Kubernetes setup to run in a distributed environment (that is also highly available).