The time complexity of the libc++ deque push_front implementation is O(log n) | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	The time complexity of the libc++ deque push_front implementation is O(log n) (1f604.blogspot.com)
	80 points by x1f604 1666 days ago

9 comments

colanderman 1666 days ago

The best thing I learned in algorithms class is:

For all realizable n, ln(n) is less than 40.

So when amortizing over a large window (say, a block size of 512), it can be the case that, for any realizable n, the logarithmic amortized costs are completely dominated by the constant non-amortized costs (adding an element to a block in this case)... making the operation effectively constant-time.

tomerv 1665 days ago

Practically speaking, it is not a good advice to just treat log(n) as if it is constant. While it's true that log(n) is always small in practice, the conclusion should not be to ignore it, but rather to notice the constant factor. And in practice, usually data structures with O(log n) complexity also have a bigger hidden constant. For example, std::unordered_map is much faster in practice than std::map. Of course, this is not strictly correct, it's just a heuristics. Quicksort with its O(log n) [Edit: O(n log n)] complexity is a counter-example to this.

colanderman 1665 days ago

To be clear, that is not the advice I'm giving -- but rather, when your performance looks like p*log n + q, if q is much greater than p/40 -- that is, the constant term dwarfs the logarithmic term -- then it is safe to consider it constant.

xdavidliu 1665 days ago

> p*log n + q, if q is much greater than p/40 -- that is, the constant term dwarfs the logarithmic term

I think you meant to say "if q is much greater than p TIMES 40".

colanderman 1665 days ago

Ah good catch, yes you are correct.

kadoban 1665 days ago

That seems like a pretty good argument _for_ treating it as constant though and just shifting your focus to how large the constants actually are.

jltsiren 1665 days ago

In a virtual memory system, random access to an array of size n takes O(log n) time, and the constant factors in that O(log n) are also nontrivial. Algorithms that do O(log n) computation with O(log n) independent elements tend to take O(log^2 n) time, while those that do O(log n) computation with O(log n) contiguous elements or O(log n) iterations with O(1) elements still take O(log n) time. If the constant factors are small enough, it can be hard to distinguish the latter two from algorithms doing O(1) computation with O(1) elements.

HALtheWise 1665 days ago

In practice, for any memory system with caches and limited by the speed of light, random (unpredictable) access to an array of size n takes much closer to O(sqrt(n)), not O(log(n)). There's an excellent article discussing this that you can search for, and it holds both in emperical tests on modern hardware and in the theoretical physical limit.

jltsiren 1665 days ago

That depends on your perspective. If multiple levels of memory hierarchy are relevant (such as when scaling from 1 MiB to 1 GiB), you will see something resembling O(sqrt(n)). If you remain within the same level (e.g. from 1 GiB to 1 TiB), the scaling resembles O(log n) more closely. Or, in other words, it depends on whether you assume that cache size grows with n or is independent of it.

kwertyoowiyop 1665 days ago

And of course, nothing is important until you’ve profiled your code and measured that it is.

gpderetta 1665 days ago

Random fact of the day: The currently best known upper bound for the complexity of the Union-Find algorithm is the reverse Ackermann function, which can be treated as a constant (4) for all remotely practical (and most of the impractical) values of n.

tylerhou 1665 days ago

Tarjan's proof of inverse* Ackermann bound is hard to understand; it's much easier to understand the proof of the log*(n) bound, which is also an extremely slow growing function: https://people.eecs.berkeley.edu/~vazirani/algorithms/chap5....

where log* is the iterated log function; i.e. the number of times you have to apply log repeatedly until you reach a value of 1 or smaller. For reference, log_2*(10^10000) = 5.

https://en.wikipedia.org/wiki/Iterated_logarithm

maple3142 1665 days ago

I wonder if it is true that everything is O(1) if there is an upper bound? Even for an algorithm with O(n^n) complexity, it is still O(1) if n is bounded, just with a extremely large constant.

colanderman 1665 days ago

n has to be bounded much much smaller, and the constant overhead of the algorithm in question much much larger, for any of those approximations to be valid. The approximation works for log n only because it's not uncommon to have constant overheads which dwarf log n for all plausible n (and only works in such cases!) This is very unlikely to be true even for a linear algorithm.

blahgeek 1666 days ago

While I very much appreciate and respect author's detailed analysis, I am still not 100% convinced without a corresponding benchmark result. If it's really O(log n) vs O(1), it should be very easy to verify in some micro-benchmark.

Instead of that, the author mentioned that:

> The tests that I did run, initially the libstdc++ and the libc++ versions have roughly the same performance, but the performance diverges after a number of insertions somewhere between 6,400 and 12,800. After that the runtime of the libc++ version is roughly 2-3 times that of the libstdc++ version. (...) the outcomes of which depend on the compiler version anyway.

This does not seem right.

x1f604 1666 days ago

Here are the results of the benchmark that I did: https://imgur.com/a/2lCktTO

It didn't look quite right to me so I didn't post it.

Here's another benchmark that I did where you can clearly see that something has gone wrong: https://imgur.com/a/s2rA8qE

logicchains 1666 days ago

Anyone who programs low-latency software knows std::deque is an absolutely horrible-performance deque implementation due to how much it allocates and the pointer chasing involved. In most cases it's better to use a vector-backed deque (ring buffer) for anything performance-sensitive.

mgaunard 1666 days ago

That is not true.

Anyone who programs low-latency C++ knows that the libstdc++ implementation is great (which is what 99.9% of people use) while others tend to be less stellar.

It's just a segmented vector. The libstdc++ implementation always allocates one segment even if empty, and while I've seen low-latency guidelines arguing for empty things not allocating, my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.

A ring buffer is a completely different data structure, it only works for fixed-capacity queues.

logicchains 1666 days ago

>which is what 99.9% of people use

Maybe we have different ideas about what constitutes "low-latency" but in HFT std::deque is rarely used. Much like std::unordered_map, which allocates every insert potentially costing up to a microsecond for each insert.

>It's just a segmented vector.

https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER... we can look at the source. This "segmented vector" is `_Tp** _M_map`, essentially a vector of vectors. This means potentially twice the pointer-chasing, as we have to do two lookups. It also means allocation/deallocation each time a new segment is added/destroyed, which leads to more allocation than when using a vector (first need to allocate the segment, then potentially allocate entries within that).

>A ring buffer is a completely different data structure, it only works for fixed-capacity queues.

Where possible it's better to use a fixed capacity queue, especially one where the capacity is a power of 2 so wrap-around calculation can be done more efficiently. But the same kind of thing can be done for a dynamic-capacity queue, by resizing the whole backing array when capacity is reached.

dan-robertson 1666 days ago

HFT needn’t mean low latency is required but of course the two tend to be related.

If trivial allocations might cost you a microsecond then you can’t allocate at all on a latency-sensitive critical path and should probably get a better allocator. Also, be wary of power-of-two sized things because if they are aligned they are likely to have poor cache performance (too many things compete for the same set of cache lines).

mgaunard 1666 days ago

I've worked in many HFT shops and that's not my experience. A lot of those firms have poor C++ code though, but std::deque was typically used in the better ones.

It's not one allocation per element, as said previously it's a segmented vector. It works similarly to virtual memory: fixed-size segments are allocated for contiguous storage, while another parent array points to the segments. So appending/prepending either needs to append/prepend to the last/first segment or to allocate a new segment if full, which is what makes those operations more efficient than on a vector. Yes there is indirection, that's what segmented means. You also get stability of the memory location which is important for many data structures. The article even tells you how big the segments are in the two implementations being looked at.

Going back to HFT, a similar data structure that's definitely heavily used there is a pool, which has a configurable segment size (potentially variable) and doesn't have order, enabling faster deallocation with a freelist.

monocasa 1665 days ago

I thought a resizable ring buffer was a valid implementation, but maybe I'm thinking of Rust's VecDeque and there's something about the semantics of std::deque that doesn't allow that.

kadoban 1665 days ago

I don't think insert at front/back is allowed an amortized bound, it has to be flat constant. But then I don't see how libc++'s is conforming if this article is true, so I'm not entirely sure what's going on there.

monocasa 1665 days ago

Ohhhh, I think I found the issue with a ring buffer implementation. Would edit my post if I was in the edit window.

resize() doesn't invalidate existing references in std::deque.

mgaunard 1665 days ago

How could you make any operation O(1) with resizing?

CoolGuySteve 1665 days ago

Most STL data structures take an allocator template argument. If you fight the compiler enough to make your own std::allocator with preallocation then the STL is actually pretty nice.

IshKebab 1665 days ago

I think he might be talking about a segmented ring buffer, which is similar to a deque but each segment is a ring buffer. It gives much better insertion complexity.

Anyway I have to echo his point that I've never found a situation where deque performs better than vector, even when you really might expect it to. I guess the extra allocations are too slow. Maybe it would be more useful with arenas or something.

ComputerGuru 1666 days ago

> my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.

That works until probabilistically there is a decent chance the capacity will always be zero.

323 1666 days ago

How do you feel about the absl containers? Are they also slow for low-latency?

ncmncm 1665 days ago

They are OK. But "slow for low latency" is not really defined; for many low-latency cases, only static or at-startup allocation is permitted, thus not deque at all. If you might need to build for production with MSVC, definitely do not use the std::deque that comes with it.

krona 1666 days ago

CppReference defines the time complexity is constant: https://en.cppreference.com/w/cpp/container/deque/push_front

So does this mean: - We're talking about different complexities (Amortized vs something else) - The libc++ implementation isn't standards conformant - The analysis in the c++ standard is incorrect. - Something else

Kranar 1665 days ago

Neither is correct. Generally when complexity is specified in the standard, it's with respect to a specific operation. In this case the standard specifies that it's the number of calls to the constructor of T that must be constant, which means that push_back and push_front can not copy or move its elements.

There is never a way to guarantee that the physical amount of time is bounded by O(1) in the worst case. You could always have pathological scenarios or data structures written so that performing a move or copy of a single T could require O(n^n) time for all anyone knows.

gpderetta 1665 days ago

Memory allocation in particular is very hard to analyze.

im3w1l 1666 days ago

It's not conformant.

mkl 1666 days ago

To make bullet points on HN, put a blank line between the paragraphs.

beached_whale 1666 days ago

std::deque is odd in that there's no std size of the blocks used. We have libstdc++ that is about 512bytes, libc++ at 4096bytes, and MS STL at around 64bytes. This leaves a bunch of room for optimizing for certain cases. Smaller is more insert friendly, larger more op[] reading friendly. But push_front after a pop_front should be O(1) on all of them. Otherwise it is moving/copying data to make room. Another nice/similar data structure is something like Boosts devector that adds a pointer to vector for the front capacity.

nly 1666 days ago

Boosts deque let's you decide the the block size.

brandmeyer 1665 days ago

> MS STL at around 64bytes.

Citation? Are you sure it isn't 64 size-independent elements?

MauranKilom 1665 days ago

It's worse actually. The MSVC STL uses 16 byte (or however large your element is, if greater) chunks. Source:

https://github.com/microsoft/STL/blob/main/stl/inc/deque#L56...

They have made clear that this won't be changed, for ABI stability reasons.

That makes std::dequeue basically unusable in a portable context. In virtually any situation, "allocate each element separately" and "allocate elements in 4k chunks" are on opposite ends of the performance spectrum.

brandmeyer 1665 days ago

That is quite unfortunate, and I freely admit that I'm unfamiliar with MSVC's current quirks.

> They have made clear that this won't be changed, for ABI stability reasons.

Also unfortunate. Once upon a time, users had to be extra careful to control which version of the runtime DLL their program transitively linked to since MS would issue a new ABI-breaking version with every major compiler release. Sounds like they got ABI stability religion just in time to lock in some unfortunate decisions.

beached_whale 1665 days ago

Thanks, I remembered incorrectly.

williamkuszmaul 1665 days ago

Very nice post!

Small comment: Ideally, big-O notation is for upper bounds. If you are doing lower bounds, you should ideally use big-Omega notation. But Omegas are harder to format in plain text, so it may be better to abuse notation and use big-O...

im3w1l 1666 days ago

So iiuc:

The libc++ implementation is bad. The libstdc++ implementation is fine. The issue with the former is that it doesn't have enough spare capacity so it has to shift elements around too often.

Actually I think the push_front is even worse than stated: O(n). Consider an array with capacity N+1, contains N elements. Now if you alternate push_front and pop_back then every push_front will cause memmove of N elements.

Oh and to make like a 4th addition to this comment: It's kind of funny that the code is so unreadable that the author found it more helpful to look at the generated assembler code.

rightbyte 1666 days ago

Reading those C or C++ standard libraries is like a joke. Almost nothing is done in a direct or clear way and internal methods have cryptic names hidden in macros.

Maybe for a good reason I dunno. But it would be nice if the code was clearer so you could make sense of it when gdb or callgrind jumps into an inlined chunk ...

brandmeyer 1665 days ago

> internal methods have cryptic names

They choose names like _Capitalized and __lowercase because those identifiers are reserved for the implementation. Its a consequence of the preprocessor's lack of hygiene.

So where you might see a convention of naming members like m_thingish_foo, in the implementation's library headers they would be named _M_thingish_foo or __m_thingish_foo.

rightbyte 1665 days ago

Ye true. Maybe I am just whining because the inlined chunks in e.g. callgrind or gdb looks so random. I should use "-Og" more often ...

brandmeyer 1665 days ago

GDB has a pluggable pretty-printing system.

http://sourceware.org/gdb/wiki/STLSupport

That doesn't necessarily help if you are backtracing from your own lambda up through some <algorithm>. But it does help many other use cases with the standard library.

ncmncm 1665 days ago

The most important thing to know about std::deque is that the implementation in MSVC is really just awfully bad. So, if your code might need to be built there, you are should consider using an implementation from somewhere other than std::.

Labo333 1666 days ago

Interesting!

It reminds me of the hash maps that are said to be amortized O(1) but can be O(n) for some sequences of operations in various languages like Python and JS: https://blog.toit.io/hash-maps-that-dont-hate-you-1a96150b49...