Hacker News new | ask | show | jobs
by mytherin 1352 days ago
The particular way in which the data is loaded into DuckDB and the particular machine configuration on which it is run triggers a problem in DuckDB related to memory management. Essentially the standard Linux memory allocator does not like our allocation pattern when doing this load, which causes the system to run out-of-memory despite freeing more memory than we allocate. More info is provided here [1].

As it is right now the benchmark is not particularly representative of DuckDB's performance. Check back in a few months :)

[1] https://github.com/duckdb/duckdb/issues/3969#issuecomment-11...

2 comments

Thanks. Btw, we use DuckDB (via Node/Deno) for analytics (on Parquet/JSON), and so I must point out that despite the dizzying variation among various language bindings (cpp and python seem more complete), the pace of progress, given the team size, is god-like. It has been super rewarding to follow the project. Also, thanks for permissively licensing it (unlike most other source-available databases).

Goes without saying, if there are cost advantages to be had due to DuckDB's unique strengths, then serverless DuckDB Cloud couldn't come here soon enough.

> despite freeing more memory than we allocate

> despite DuckDB freeing more buffers than it is allocating

Can you please clarify how is that even possible?

We are allocating and freeing buffers repeatedly. Despite freeing more buffers than we allocate, memory usage might still increase because of internal fragmentation in the allocator. Essentially, fragmentation might create "unused" space that does take up space. This phenomenon is called heap fragmentation [1].

[1] https://cpp4arduino.com/2018/11/06/what-is-heap-fragmentatio...

> Despite freeing more buffers than we allocate

Technically, I hope you understand that this isn't possible but maybe I am misinterpreting what you're trying to say.

  auto buff = malloc(N);
  free(buff);
  free(buff);
is one way to free "more" buffers than allocated but this will lead to an UB and depending on the underlying system allocator implementation it may or may not crash.

However, given how silly this would be I believe this is not what you're trying to convey?

Here's what mytherin wrote, ...we are allocating and freeing buffers repeatedly. Despite freeing more buffers than we allocate...

So, I assume, the context is, DuckDB allocates x buffers, frees x - m buffers at some point later, then allocates n buffers where n <<<< m, and yet malloc fails.

In the GitHub thread mytherin linked to above, Alexey Milovidov, ClickHouse CTO, points out that ClickHouse uses jemalloc and makes for a better choice than glibc malloc given the issue with fragmentation. It is likely that DuckDB switches to jemalloc, too.

You are misinterpreting it indeed.

The scenario I am describing is roughly the following:

Suppose we allocate 100K buffers that all have an equal size, and our memory usage is now 10GB. After that point we free 20K buffers, but allocate 10K more. In other words, from that point on we are freeing more buffers than we are allocating.

Now, since we are freeing more than we are allocating, you would expect our memory usage to go down. However, when using the standard glibc malloc on Linux, our memory usage unexpectedly goes up. After this happens several times in a row the system runs out of memory and new calls to malloc fails.