| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kragen 529 days ago

This is a very nice article! The diagrams in particular are very clear. (They seem to have been done in draw.io: https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al... which seems to be a pretty decent Apache-2.0 free software licensed diagram editor: https://github.com/jgraph/drawio/)

I think it would be improved further if it began with the calling interface callers use to invoke the allocator, because it's easier to understand the implementation of some service such as pool allocation when you begin by knowing what service, exactly, is required. I began with the assumption that the author was describing an arena allocator under another name, rather than a fixed-size block allocator, both because the APR's influential arena allocator is called "apr_pool_t" and because one of the main headings in the table of contents (a greatly appreciated feature, like all the typography of the article) was "Closing the pool". It took me quite a while to figure out that I was mistaken. It's plausible that I only had this problem because I am an unusually stupid, ignorant, or pigheaded person (some would say all three), but, if not, many other people will have the same difficulty I had of taking several minutes to figure out what the topic of the article is. (I could have saved myself by clicking either of the first two links in the Introduction, but I didn't because I thought I already knew what it was.)

The current top-voted comment on this thread is by someone who had the same reading comprehension problem I did, but never realized their erorr: https://news.ycombinator.com/item?id=42643951 and nobody had pointed out their mistaek until I did just now. So I think we can say with some confidence that many HN readers will have the same difficulty.

I think that the article would also be improved by a few other small additions:

- It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.

- Probably when one benchmarks the implementation given, one will find that it is only slightly faster than jemalloc or even gnumalloc, because they both dispatch small allocations to a fixed-block allocator similar to this one, and the dispatch isn't very expensive. This can be fixed by inlining the allocation fast path and the deallocation function, which will then compile down to a few instructions each. A simple demonstration of how to do this is in my arena allocator kmregion in http://canonical.org/~kragen/sw/dev3/kmregion.h and http://canonical.org/~kragen/sw/dev3/kmregion.c (GPLv3+). The current implementation of 8dcc libpool is guaranteed to be unable to do this unless you're using link-time optimization or a so-called "unity" or "amalgamation" build process.

- It's worth distinguishing between average-case performance (in which this allocator is slightly better than malloc) and worst-case performance (in which case malloc, generally speaking, isn't even in the running).

- It may be worth mentioning that often such pool allocators are used in conjunction with initializing the objects in the pool to a valid state when the pool is created; that way you don't have to reinitialize objects to a valid state every time you deallocate and reallocate them, which is usually more work than malloc does to find the right free list. This does require not overwriting the user data with your freelist pointer, so it's a space/time tradeoff.

- Maybe you should mention alignment, especially now that even on amd64 GCC has taken to using new instructions that require alignment.

Even as it is, though, it's highly educational, visually pleasant, and very understandable (once I got over my initial misconception of having a totally wrong idea of what the article was about).

1 comments

8dcc 529 days ago

> The diagrams in particular are very clear. (They seem to have been done in draw.io: https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al... which seems to be a pretty decent Apache-2.0 free software licensed diagram editor: https://github.com/jgraph/drawio/)

Yes, this is true. Furthermore, the diagram information for draw.io is included in the SVG images themselves, so one could edit them there.

> It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.

You are right, I was referring to "malloc-like" allocators (i.e. for variable-size blocks). It would be a good idea to benchmark them.

link

kragen 529 days ago

In general when you're working on an optimization (and this is an optimization) benchmarking it is essential. About a third of the optimizations I try end up making my code slower instead of faster, often because of performance considerations I hadn't thought of. You can't just say "it's much faster" without measuring.

link

cb321 529 days ago

Modern CPUs (with all their buffers & caches & predictor warm-ups & spin-ups) can make benchmarking really tricky unless you have a very specific workload in mind and even then it requires care (like looking at both the edge of[1] and full timing distributions[2] for best-case, average/median/typical case, worst case analysis, etc.). Briefly, the number of ways to get (un)lucky has proliferated over the decades, as has the variety of deployment platforms.

Given the depth of that rabbit hole, as writing advice, I might suggest instead motivating via "more space efficient" and refer to exactly fitting fixed-size objects where more general purpose allocators will (usually)¹ waste space to the next power of two. Measuring less dynamic space is also easier (very dynamic space like %cache used can be tricky, though). Controlling scope and pedagogy in general is also tricky, but I think this rhetorical advice is common practice and might also double as a way to address your initial misread of the topic. EDIT: he can always do refinements / more analysis in follow up posts/something if this library has any legs for him.

[1] https://github.com/c-blake/bu/blob/main/doc/tim.md

[2] https://github.com/c-blake/bu/blob/main/doc/edplot.md

¹ Sure, fancy any-size allocators can tack on a histogram of maybe somewhat rounded lg sizes (to economize histo space) and spawn fixed-size object sub-arenas if there are enough of them, say >= enough to fill a VMem page or 2, but this is rare in my experience and also complicates the dispatch to subarenas that you mention above to slightly fancier search. This kind of topic rapidly spirals into allocator-complete discussion and even systems-complete - e.g. a filesystem is largely an allocator with good old MS-DOS FAT being close to a "file is a pool of disk blocks" model.

link

kragen 529 days ago

This is an excellent comment. Thank you.

link

8dcc 529 days ago

Yes, you are right. Do you have any suggestions on how I should do the benchmarking? I am not sure how this is usually done, but if the output is some kind of graph, I would also like to know how these graphs are usually produced.

link

kragen 529 days ago

I'm no expert on this, so probably someone else can give you better advice, but I'm happy to share what I do know. Here are the most important points.

Benchmarking is a full-fledged scientific experiment. You have a hypothesis that a particular change that you can make, such as using a different allocator, will have a particular effect, such as making your program run in less time. So you make just that one single change and measure the results, while carefully controlling all other conditions that could also affect your program's run time. You have to do the test more than once in order to find out whether you've succeeded in controlling them. If you do a good enough job controlling the conditions of the experiment, it becomes reproducible and therefore falsifiable. Programming usually isn't science, but benchmarking is. A great thing about it is that each run of the experiment can take a fraction of a second, so you can do the whole scientific method from beginning to end in a few minutes and get a reproducible, falsifiable result.

Benchmarking can get arbitrarily sophisticated, but the most important thing is to dodge the bullets that can turn benchmarks into nonsense, rather than produce very sophisticated graphs or get very high precision.

The simplest thing on Linux or macOS is to compile a command-line executable you can run in a terminal and run

    time ./testprogram

at least three times to get an idea of how long the wallclock time is and how variable it is. (The user and system times shown are not very reliable anymore.) Presumably there is also some way to do the same thing on Microsoft Windows. If you're on a laptop, check to see if it makes a difference if you're plugged in. Also, check whether the result from

    time ./testprogram; time ./testprogram

is consistent; sometimes lags in CPU speed scaling can produce unpredictable results.

Linux is not very consistent, so you should expect variations in the range of 1–10% from run to run when your program takes on the order of a second to run.

If you get consistently shorter wallclock times after recompiling the program with a different allocator and the same compiler options (probably make sure optimization isn't completely turned off), you can probably attribute that difference to the different allocator. It's a good idea to switch back to the original allocator and verify that you can reproduce your original results to make sure you didn't accidentally change something else at the same time (maybe your laptop went into low-power mode because the battery is getting low, or maybe you changed the number of iterations in a loop and forgot you changed it, or maybe Firefox loaded a YouTube video in the background and is making the machine swap, that kind of thing).

It's useful to ensure that you aren't swapping and that your CPU load other than the benchmark is minimal. `top` (or, better, `htop`) and `vmstat 5` (or, better, `dstat 5`) can be helpful here. Unless your program is heavily multithreaded, the CPU load is not as crucial as it used to be now that we all have multicore processors.

It's helpful to check the particular version of the program you're benchmarking into Git, including the Makefile or whatever specifies the compiler options. That way when you write down your results you can refer to the specific Git commit hash. Also, record your compiler version, your operating system version, and your CPU. Ideally someone who isn't you should be able to read your notes on the benchmark, run it again, and get the same results, if they have access to the same hardware.

The actual benchmark program itself can be as simple as allocating and deallocating in a loop, ideally inside another loop to run the whole benchmark multiple times; but on processors with cache (which includes every PC since the early 90s) the working set size can be very important, and you may get significantly different results for allocating 20 kilobytes 64 bytes at a time and 20 gigabytes 64 bytes at a time. If you pass in the number of iterations for one of the loops on the command line, instead of hardcoding it in the program code, it's easy to start with a small iteration count and work your way up a factor of 10 at a time until the program takes a few seconds to run. In this case, also be sure to record the specific command line you were measuring the performance of in your notes; knowing that a program took 2.205–2.216 seconds to run is of no value if you don't know if it was running a thousand iterations or a million.

It's a good idea to actually compute something that you output using the memory you're allocating and deallocating, just in case GCC's optimizer decides that your benchmark loop has no effect and therefore can be safely removed from your program. (In http://canonical.org/~kragen/sw/dev3/kmregion_example.c I computed the sum of the data value stored in each node of the 24th of all the 5000-node linked lists I allocated and deallocated. That's because, when I read the disassembly of my test code with `objdump -d kmregion_example`, I found out that GCC had removed most of the code inside the test loop because it could prove that nobody ever read the struct fields I was initializing. Reading disassembly is neither necessary nor sufficient for dodging all bullets that invalidate benchmarks.)

Varying the number of iterations in both the innermost and outermost loop should produce roughly proportional increases and decreases in wallclock time. (A helpful technique is to subtract the time for 1000 iterations from the time for 1001000 iterations to get the time for the last million iterations.) If it doesn't, you're not measuring what you think you're measuring. For example, especially with JIT compilers, you may be measuring startup overhead. That bit me this week with http://canonical.org/~kragen/sw/dev3/hadamardsin.js, which I erroneously reported was slower to run than the LuaJIT version. It turns out that it's slightly faster to run, just much slower to compile.

Running the benchmark on more than one computer is helpful because sometimes changes that speed things up on one system slow them down on another. I remember a recursive backtracking search program for logic circuit optimization I wrote in C long ago which used longjmp() to terminate the search when it found a solution; it ran reasonably fast on Linux but very slowly on my friend's Macintosh because MacOS implemented setjmp() as sigsetjmp(), transforming it from a very cheap operation into a very expensive one.

That's all you need to know to reliably do optimizations. In fact, that's probably more than you need to know. Everything below this point is extra.

— ⁂ —

Allocators are especially tricky to benchmark because often, by changing where data is stored in memory, they profoundly affect the speed of the rest of your program, in ways that vary from program to program. More memory fragmentation, for example, can increase how many TLB misses your program runs into. I don't know what to do about that except try to benchmark some real programs as well as simple nested loops.

In cases where you're trying to measure an effect that's roughly the size of your measurement precision, statistical tests can be useful. https://github.com/c-blake/bu/blob/main/doc/edplot.md linked by cb321 above discusses using the 2-sample Kolmogorov-Smirnov test, which is not nearly as terrifying as it sounds. But most effects will be either much larger than your measurement error, in which case the result is obvious enough that you don't need the statistics, or much smaller, in which case all they'll tell you is that you can't be sure about the direction of the effect, much less its magnitude. In medicine, where your experiments take years and people die in them, using sophisticated statistics to make the most of your precious data is very important. In performance benchmarking, where your experiments take milliseconds, it's usually better to improve your measurement precision instead. Sometimes just running the same benchmark more times is enough, but there are more powerful techniques available.

A way to get higher precision than `time` can give you is to run the program under `valgrind` (on Linux). Even raw valgrind will tell you exactly how many instructions it ran, and in most cases that number is the same from run to run, so instead of a measurement with ±3% precision like `time` usually gives you when you've controlled everything properly, you get a measurement typically with ±0.00000001% precision. This is fantastic, far beyond most things you can do in a physics, electronics, or chemistry lab. It means that even with just two runs you can tell if your optimization affected that number by even 0.0000001%, or, more interestingly, 0.1%. And it isn't affected by things like which CPU you run it on or whether your laptop is unplugged, and it only slows the program down about 10×.

Unfortunately, it measures the wrong thing, because a program that runs more instructions can still be faster. `valgrind --tool=cachegrind` tries to simulate your CPU's cache hierarchy to give you a better idea of how long the program will actually take; the `kcachegrind` GUI can give you a "cycle estimation" number based on the numbers it spits out. But it can still be unrelated to how fast the program actually runs for a variety of reasons; system calls are the biggest one. `kcachegrind` is also a "profiler": it shows you how much time (simulated instructions, simulated cache misses, cycle estimation, whatever) is used by each subroutine in your program.

You might wonder why an optimization that speeds your program up by 0.1% matters. The answer is that if you do 106 such optimizations you have sped your program up by 10%. SQLite has improved its performance by more than a factor of 3 over a 10-year period by using this approach: https://www.sqlite.org/cpu.html

There's a more basic profiler called gprof built into GCC; building your program with `-pg` will add profiling instrumentation in to it, so that it writes profiling information into a file called `gmon.out`. After running a program built with gprof, you can run `gprof testprogram` to analyze `gmon.out`; it will tell you how much time each subroutine took (including and excluding its transitive callees) and how many times it was called.

Linux also has a built-in profiler called `perf_events` https://perfwiki.github.io/main/ which uses hardware performance counters to get numbers like cachegrind's, but using the real hardware instead of simulation, and including a lot more detail; it can also profile the time your program spends in system calls. I've only used it in very basic ways, so I don't know very much about it. Intel also has a proprietary thing called VTune which can do things perf can't and runs on OSes that aren't Linux, but doesn't support AMD's processors.

— ⁂ —

I hope this is of some value to you! I'd be happy to answer any other questions. Hopefully correctly.

link

8dcc 529 days ago

Wow. This is probably the best comment I have ever seen on HN. Thank you so much for the information, I will add benchmarking to my 'libpool' project and mention it in the article. I can't think of any questions right now, but I will let you know if anything comes up. Thank you again. :)

Aww, thanks! *blush*

Actually, I have one question. You said:

> The actual benchmark program itself can be as simple as allocating and deallocating in a loop [...]

What do you mean, exactly? Deallocating immediately after allocating? I feel like there would be a big difference between that and allocating multiple blocks before freeing them.

link

cb321 529 days ago

You are right - there will be a big difference for these two different workloads. As I am 100% sure @kragen already knows, this is arguably the first, smallest step in thinking for you to realize just how many possible workloads you might want to model. :-) And to realize why senior engineers will say pithy things like "the best benchmark is your actual application".

While that is true (and I've said it myself), one reason I like the design of a little "workload interpreter shell" as in https://github.com/c-blake/bst/blob/main/test/bst-interact.c (or maybe better https://github.com/c-blake/adix/blob/master/tests/btshell.ni... which has a preprocessor) is that you can use them to run "isolated tests" as a kind of "half step" between a pure hot loop dumb thing and a full-on application with all its attendant noise and self-interaction/self-disturbance.

So, for example, in this allocation problem setting you could write a similar 30-line "shell/interpreter" that interprets a binary stream of "commands" or allocation requests. The origin a of said binary stream could be a recording/log from some actual app you have doing something actually useful. Then you could apply that one recording in the "shell" in different configurations as @kragen mentions to study the allocator performance (both space & speed) in isolation. At that point you could start to say pretty meaningful things from replaying logs many times about this or that tradeoff on this|that workload.

EDIT: Of course, at that point you start to get into debates about "how representative" different workloads are, but at least you could potentially share your log with someone who had a different computer and have them run benchmarks, too, and at least the workload is not synthetic but relates to some actual possible program. Moving from synthetic benchmarks to log-based ones was a big thing in OS research on filesystems in the 1970s & 80s and it helped the "generalizability of results" (I think, anyway..I'm sure many still do only synthetic benchmarking due to its convenience).

EDIT2: I just thought of a cute name for the log replay half-step between synthetic microbenchmarks and a full application - "The millibenchmark". ;-) Maybe someone else has thought of this before, though.

link

kragen 529 days ago

You are sure right about that! The example I linked allocates 5000 blocks and makes a linked list out of them, then deallocates them.

link