Hacker News new | ask | show | jobs
by zX41ZdbW 238 days ago
This and similar tasks can be solved efficiently with clickhouse-local [1]. Example:

    ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt
I've tested it and it is faster than both sort and this Rust code:

    time LC_ALL=C sort data.txt | uniq -c | sort -rn > /dev/null
    32 sec.

    time hist data.txt > /dev/null
    14 sec.

    time ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt > /dev/null
    2.7 sec.
It is like a Swiss Army knife for data processing: it can solve various tasks, such as joining data from multiple files and data sources, processing various binary and text formats, converting between them, and accessing external databases.

[1] https://clickhouse.com/docs/operations/utilities/clickhouse-...

6 comments

Disclaimer: the author of the comment is the founder and CTO of ClickHouse
And all their comments are shilling Clickhouse either directly or via a project built on top of it, without disclosure.
Considering that it's an open source tool, I don't know if it's that bad to be shilling for the commons, basically.
I've edited my profile to provide a link to GitHub.
Disclosure, not disclaimer.

They want to own the claims made.

Yes, sorry, it should be "disclosure"
When using clickhouse-local like this, does it build a logical plan and run the optimizer on it? Does it have any kind of code generation, since it knows the query (and physical data layout) ahead of time?
Exactly. I love this and DuckDb and other such amazing tools.
Just noting that in your benchmark (which we know nothing about), your "naive" data point is just 2.29x slower than hist. In their testing it was 27x slower! And it's not quite the same naive shell command, which isn't helpful.
I'd not heard of clickhouse before. It does seem interesting, but I just can't get behind a project that says:

> The easiest way to download the latest version is with the following command:

> curl https://clickhouse.com/ | sh

Like, sure, there is some risk downloading a binary or running an arbitrary installer. But this is just nuts.

It's Apache licenced and you could also install it via your favourite package installer. Given all the crazy supply chain attacks going on, I don't really feel this is any worse than downloading a binary from a distro archive, and specifically this pipe | sh doesn't expect you to run it as root (which a lot of other cut-and-paste installers do).
> I don't really feel this is any worse than downloading a binary from a distro archive

Please don't say that. It denigrates the work of all the packagers that actually keep our supply chains clean. At least in the major distributions such as Red Hat/Fedora and Debian/Ubuntu.

The distro model is far from perfect and there are still plenty of ways to insert malware into the process, but it certainly is far better than running binaries directly from a web page. You have no idea who have access to that page and its mirrors and what their motives are. The binary isn't even signed, let alone reviewed by anyone!

I’m not sure how much better this is the man blindly “npm i thing”, where I have no real assurance I’m not downloading a giant piece of malware either.
That's exactly why it's insane. People remember the pad-left fiasco.

Previously discussed here: https://news.ycombinator.com/item?id=11348798

Article now resides here: https://www.davidhaney.io/npm-left-pad-have-we-forgotten-how...

>Like, sure, there is some risk downloading a binary or running an arbitrary installer. But this is just nuts.

It's literally exactly the same thing

Chdb is just a binary. You can just grab that. Also pipe to sh is used by a ton of projects
it's used by many projects but still regarded as an anti-pattern and security issue
it's really exactly the same as wget file;./file and not a real anti-pattern in any way
A ton of people drink and drive too, doesn't make it any more fine.
Y’all are so pure. Just don’t install it that way. Sheesh.
how is this any less secure than running a binary/installer? the binary could run this inside?
To be more fair you could also add SETTINGS max_threads=1 though?
How is that “more fair”?
Well, fair in a sense that we'd compare which implementation is more efficient. Surely, ClickHouse is faster, but is it because it's using actually superior algorithms or is it just that it executes stuff in parallel by default? I'd like to believe it's both, but without "user%" it's hard to tell
Last time I checked, writing efficient, contention-free and correct parallel code is hard and often harder than pulling an algorithm out of a book.
Would you take half the wheels off a car to compare it to a motorcycle?
Motorcycles are faster than cars though
Not necessarily, that really depends on what you mean by fast. Cars definitely go higher in top speed than bikes do for example. If I'm not mistaken, racing electric cars also accelerate comparable or faster than bikes. A bike can generally go around a track faster than a car, but that only holds true in dry conditions. Etc, many ways to define fast and what you actually mean.
Musk's roadster is currently going in excess of 10,000 mph. Which bike is faster than that? :)