| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by AlexClickHouse 428 days ago

Thanks for creating this issue, it is worth investigating!

I see you also created similar issues in Polars: https://github.com/pola-rs/polars/issues/17932 and DuckDB: https://github.com/duckdb/duckdb/issues/17066

ClickHouse has a built-in memory tracker, so even if there is not enough memory, it will stop the query and send an exception to the client, instead of crashing. It also allows fair sharing of memory between different workloads.

You need to provide more info on the issue for reproduction, e.g., how to fill the tables. 16 GB of memory should be enough even for a CROSS JOIN between a 10 billion-row and a 100-row table, because it is processed in a streaming fashion without accumulating a large amount of data in memory. The same should be true for a merge join.

However, there are places when a large buffer might be needed. For example, if you insert data into a table backed by S3 storage, it requires a buffer that can be in the order of 500 MB.

There is a possibility that your machine has 16 GB of memory, but most of it is consumed by Chrome, Slack, or Safari, and not much is left for ClickHouse server.

1 comments

Boxxed 424 days ago

Yeah I feel like I'm on crazy pills, I'm OoM'ing all these big data tools that everyone loves very trivially -- duckdb OoM'd just loading a CSV file, and Polars OoM'd just reading the first couple rows of a parquet file?

I do want to get a better reproduction on CH because it seems like it's an interplay between the INSERT INTO...SELECT. It's just a bit of work to generate synthetic data with the same profile as my production data (for what it's worth I did put quite a bit of effort into following the doc guidelines for dealing with low-memory machines).

link