Hacker News new | ask | show | jobs
by jayleeg 2354 days ago
I've worked with MPP DBs, Hadoop, Spark, ElasticSearch, Druid, kdb, DolphinDB and now ClickHouse and performance wise it's all true - in our case ClickHouse was 10-20x faster than Spark and used 4x less memory. I've seen it outperform the fastest commercial timeseries stores by 2x.

This will make me unpopular but my conclusion is that the file based data lake, splitting data from compute, is not the right approach in many (not all) cases and that Spark was not really that revolutionary. I would go as far to say that the direction data has taken has been a failure and ClickHouse and such come closer to solving the real problem of 'BigData'.

So two things here about 'loading'...

1) ClickHouse table/data files are completely portable (like Parquet) and can be moved from one server to another, copied or cloned etc.. there is even a mechanism to allow remote execution or to pull just the files from a remote server or an S3 store etc.. Just because the CH native file format isn't spoken about in the same circles as Parquet and ORC doesn't mean it can't be treated the same way if thats your thing. The CH native format is far more performant/compressible than Parquet or ORC and the specification is Open Source. Someone could implement a CH native file format serdes for Hive for example.

2) In this instance they were generating the data so no different to running Spark and writing to a Parquet file and running analytics on it later. Spark can't write / generate this amount of data in this amount of time on these resources and write out / compress the data to Parquet or whatever other preferred format. I've tried.

ClickHouse isn't perfect and I'm not affiliated with the Altinity guys but I can tell you this is the real deal.

1 comments

I would like to see comparisons between CH files, I would specifically challenge the compressability of them vs ORC which pretty much maxes out current compression techniques.

As soon as I see CH format being widespread enough to interact with the multitude of other tools that are available then I would consider getting on board - for now a "loadable" data warehouse does little for the kind of workflows we deal with as the loading would take longer than the processing.

With regards to item two - we use a standard consumer GPU (1060 GTX) to handle the conversion from CSV to ORC / Parquet and it is much much faster and cheaper than a 20+ node spark cluster - hence the preference to work on files.

As everything else runs off these files it is kind of integral to our workload