| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by felipe_aramburu 2684 days ago

And you think https://tech.marksblogg.com/billion-nyc-taxi-rides-clickhous... for example is something that can be considered fast? It takes the user 55 minutes just to load its data into a state so that it can be "queryable".

After importing then they spend 34 more minutes making the data into a columnar representation. Alright so 89 minutes in and we still haven't run queries.

Oh but its not distribute yet. Darn I have to run some non standard sql commands like

CREATE TABLE trips_mergetree_x3 AS trips_mergetree_third ENGINE = Distributed(perftest_3shards, default, trips_mergetree_third, rand());

Ok can I query my data yet? No you have to move it into this distributed representation and that takes 15 more minutes. Oh ok...

And now? Yes you can run your queries but they aren't really very fast.

SELECT cab_type, count(*) FROM trips_mergetree_x3 GROUP BY cab_type;

Can take 2.5 seconds on a 108 cpu core cluster for only 1.1BN rows? Thats not fast. That's particularly slow given that requires you to ingest and optimize your data.

Maybe you want to show us an example of some simple tests you have run with blazing and clickhouse. As I read it now its not worth our time to look into becuase its so very different from what we are trying to offer which is:

Connect to your files wherever you have them ETL quickly Train / Classify Move on!

1 comments

bicubic 2684 days ago

The ingest time is due to updating the merge tree. You don't need a merge tree for etl... It's like the worst backing store you could possibly choose. You're also comparing an intentionally horizontally distributed query to a purely vertical one on a single node. You can see just slightly below the same query takes 0.2 seconds on a single node.

I was hoping to see some serious consideration given to these kinds of benchmarks, considering Clickhouse is one of the most cost effective tools I've used in the real world and occasionally outperforms things like mapd.

I was expecting your solution to outperform Clickhouse at least in some aspects, and a benchmark showing where it wins. Instead you reveal ignorance of Clickhouse and even the benchmarks you linked.

Your comment comes off as incredibly arrogant and at the same time incredibly misinformed. Disappointing to see this attitude from the team.

link

felipe_aramburu 2684 days ago

I am ignorant of clickhouse. It doesn't really compete in the workloads we are interested in. Sorry you feel this way but we are a small team and need to consider tools that integrate with Apache Arrow and CUDF natively.

If it doesn't take input from Arrow and CUDF and it doesn't produce output that is Arrow CUDF or one of the file formats we are decompressing on the GPU. Then we don't care unless one of our users asks us for this.

We are 16 people and a year ago were 5. We can't test everything out just the tools our users need to replace in their stacks. I apologize if I came off as arrogant. I have tourette's syndrome and a few other things that make it difficult for me to communicate, particularly when discussing technical matters. If I have offended you I do apologize but not a single one of our users has said to me I am using clickhouse and want to speed up my GPU workloads. Maybe its so fast they don't mind paying a serialization cost going from clickhouse to GPU workload and if so thats great for them!

link

bicubic 2684 days ago

Understood.

I do suggest you seriously benchmark against clickhouse, because where single node performance is concerned, it is the tool to beat outside arcane proprietary stuff like kdb+ and brytlytdb. I have used single-node clickhouse and seen interactive query times where an >10 node spark cluster was recommended by supposed experts.

Clickhouse is not a mainstream tool (and I have discussed its limitations in other threads) but it is certainly rising in popularity, and in my view it comes pretty close to 1st place for general purpose perf short of Google scale datasets.

link

felipe_aramburu 2684 days ago

Ok. Right now we are in tunnel vision mode to get our distributed version out by GTC in mid march. We will benchmark against clickhouse sometime in March. Do you know of any benchmark tests that are a bit more involved in terms of query complexity? We are most interested in queries where you can't be clever and use things like indexing and precomputed materializations.

The more complex the query the less you can rely on being clever and the more the guts need to be performant and that is more important to us right now.

link

hodgesrm 2681 days ago

I work for Altinity, which offers commercial support for ClickHouse. We like benchmarks. :)

We use the DTC airline on time performance dataset (https://www.transtats.bts.gov/tables.asp?DB_ID=120) and Yellow Taxi trip data from NYC Open Data (https://data.cityofnewyork.us/browse?q=yellow%20taxi%20data&...) for benchmarking real-time query performance on ClickHouse. I'm working on publishing both datasets in a form that makes it easy to load them quickly. Queries are an exercise for the reader but see Mark Litwintschik's blog for good examples of queries: https://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html.

We've also done head-to-head comparisons on time series using the TSBS benchmark developed by the Timescale team. See https://www.altinity.com/blog/clickhouse-timeseries-scalabil... for a description of our tests as well as a link to the TSBS Github project.

link

einpoklum 2677 days ago

On an unrelated note: Oh, if you guys are using the OnTime data, have a look at this: https://github.com/eyalroz/usdt-ontime-tools

link

hodgesrm 2681 days ago

BTW, I think you do need to consider materialized views. ClickHouse materialized views function like projections in Vertica. They can apply different indexing and sorting to data. Unless your query patterns are very rigid it's hard to get high performance in any DBMS without some ability to implement different clustering patterns in storage.

link