| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bicubic 2680 days ago

The ingest time is due to updating the merge tree. You don't need a merge tree for etl... It's like the worst backing store you could possibly choose. You're also comparing an intentionally horizontally distributed query to a purely vertical one on a single node. You can see just slightly below the same query takes 0.2 seconds on a single node.

I was hoping to see some serious consideration given to these kinds of benchmarks, considering Clickhouse is one of the most cost effective tools I've used in the real world and occasionally outperforms things like mapd.

I was expecting your solution to outperform Clickhouse at least in some aspects, and a benchmark showing where it wins. Instead you reveal ignorance of Clickhouse and even the benchmarks you linked.

Your comment comes off as incredibly arrogant and at the same time incredibly misinformed. Disappointing to see this attitude from the team.

1 comments

felipe_aramburu 2680 days ago

I am ignorant of clickhouse. It doesn't really compete in the workloads we are interested in. Sorry you feel this way but we are a small team and need to consider tools that integrate with Apache Arrow and CUDF natively.

If it doesn't take input from Arrow and CUDF and it doesn't produce output that is Arrow CUDF or one of the file formats we are decompressing on the GPU. Then we don't care unless one of our users asks us for this.

We are 16 people and a year ago were 5. We can't test everything out just the tools our users need to replace in their stacks. I apologize if I came off as arrogant. I have tourette's syndrome and a few other things that make it difficult for me to communicate, particularly when discussing technical matters. If I have offended you I do apologize but not a single one of our users has said to me I am using clickhouse and want to speed up my GPU workloads. Maybe its so fast they don't mind paying a serialization cost going from clickhouse to GPU workload and if so thats great for them!

link

bicubic 2680 days ago

Understood.

I do suggest you seriously benchmark against clickhouse, because where single node performance is concerned, it is the tool to beat outside arcane proprietary stuff like kdb+ and brytlytdb. I have used single-node clickhouse and seen interactive query times where an >10 node spark cluster was recommended by supposed experts.

Clickhouse is not a mainstream tool (and I have discussed its limitations in other threads) but it is certainly rising in popularity, and in my view it comes pretty close to 1st place for general purpose perf short of Google scale datasets.

link

felipe_aramburu 2680 days ago

Ok. Right now we are in tunnel vision mode to get our distributed version out by GTC in mid march. We will benchmark against clickhouse sometime in March. Do you know of any benchmark tests that are a bit more involved in terms of query complexity? We are most interested in queries where you can't be clever and use things like indexing and precomputed materializations.

The more complex the query the less you can rely on being clever and the more the guts need to be performant and that is more important to us right now.

link

hodgesrm 2677 days ago

I work for Altinity, which offers commercial support for ClickHouse. We like benchmarks. :)

We use the DTC airline on time performance dataset (https://www.transtats.bts.gov/tables.asp?DB_ID=120) and Yellow Taxi trip data from NYC Open Data (https://data.cityofnewyork.us/browse?q=yellow%20taxi%20data&...) for benchmarking real-time query performance on ClickHouse. I'm working on publishing both datasets in a form that makes it easy to load them quickly. Queries are an exercise for the reader but see Mark Litwintschik's blog for good examples of queries: https://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html.

We've also done head-to-head comparisons on time series using the TSBS benchmark developed by the Timescale team. See https://www.altinity.com/blog/clickhouse-timeseries-scalabil... for a description of our tests as well as a link to the TSBS Github project.

link

einpoklum 2673 days ago

On an unrelated note: Oh, if you guys are using the OnTime data, have a look at this: https://github.com/eyalroz/usdt-ontime-tools

link

hodgesrm 2677 days ago

BTW, I think you do need to consider materialized views. ClickHouse materialized views function like projections in Vertica. They can apply different indexing and sorting to data. Unless your query patterns are very rigid it's hard to get high performance in any DBMS without some ability to implement different clustering patterns in storage.

link