So when amortizing over a large window (say, a block size of 512), it can be the case that, for any realizable n, the logarithmic amortized costs are completely dominated by the constant non-amortized costs (adding an element to a block in this case)... making the operation effectively constant-time.
Practically speaking, it is not a good advice to just treat log(n) as if it is constant. While it's true that log(n) is always small in practice, the conclusion should not be to ignore it, but rather to notice the constant factor. And in practice, usually data structures with O(log n) complexity also have a bigger hidden constant. For example, std::unordered_map is much faster in practice than std::map. Of course, this is not strictly correct, it's just a heuristics. Quicksort with its O(log n) [Edit: O(n log n)] complexity is a counter-example to this.
To be clear, that is not the advice I'm giving -- but rather, when your performance looks like p*log n + q, if q is much greater than p/40 -- that is, the constant term dwarfs the logarithmic term -- then it is safe to consider it constant.
In a virtual memory system, random access to an array of size n takes O(log n) time, and the constant factors in that O(log n) are also nontrivial. Algorithms that do O(log n) computation with O(log n) independent elements tend to take O(log^2 n) time, while those that do O(log n) computation with O(log n) contiguous elements or O(log n) iterations with O(1) elements still take O(log n) time. If the constant factors are small enough, it can be hard to distinguish the latter two from algorithms doing O(1) computation with O(1) elements.
In practice, for any memory system with caches and limited by the speed of light, random (unpredictable) access to an array of size n takes much closer to O(sqrt(n)), not O(log(n)). There's an excellent article discussing this that you can search for, and it holds both in emperical tests on modern hardware and in the theoretical physical limit.
That depends on your perspective. If multiple levels of memory hierarchy are relevant (such as when scaling from 1 MiB to 1 GiB), you will see something resembling O(sqrt(n)). If you remain within the same level (e.g. from 1 GiB to 1 TiB), the scaling resembles O(log n) more closely. Or, in other words, it depends on whether you assume that cache size grows with n or is independent of it.
Random fact of the day: The currently best known upper bound for the complexity of the Union-Find algorithm is the reverse Ackermann function, which can be treated as a constant (4) for all remotely practical (and most of the impractical) values of n.
where log* is the iterated log function; i.e. the number of times you have to apply log repeatedly until you reach a value of 1 or smaller. For reference, log_2*(10^10000) = 5.
I wonder if it is true that everything is O(1) if there is an upper bound? Even for an algorithm with O(n^n) complexity, it is still O(1) if n is bounded, just with a extremely large constant.
n has to be bounded much much smaller, and the constant overhead of the algorithm in question much much larger, for any of those approximations to be valid. The approximation works for log n only because it's not uncommon to have constant overheads which dwarf log n for all plausible n (and only works in such cases!) This is very unlikely to be true even for a linear algorithm.
While I very much appreciate and respect author's detailed analysis, I am still not 100% convinced without a corresponding benchmark result. If it's really O(log n) vs O(1), it should be very easy to verify in some micro-benchmark.
Instead of that, the author mentioned that:
> The tests that I did run, initially the libstdc++ and the libc++ versions have roughly the same performance, but the performance diverges after a number of insertions somewhere between 6,400 and 12,800. After that the runtime of the libc++ version is roughly 2-3 times that of the libstdc++ version. (...) the outcomes of which depend on the compiler version anyway.
Anyone who programs low-latency software knows std::deque is an absolutely horrible-performance deque implementation due to how much it allocates and the pointer chasing involved. In most cases it's better to use a vector-backed deque (ring buffer) for anything performance-sensitive.
Anyone who programs low-latency C++ knows that the libstdc++ implementation is great (which is what 99.9% of people use) while others tend to be less stellar.
It's just a segmented vector. The libstdc++ implementation always allocates one segment even if empty, and while I've seen low-latency guidelines arguing for empty things not allocating, my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.
A ring buffer is a completely different data structure, it only works for fixed-capacity queues.
Maybe we have different ideas about what constitutes "low-latency" but in HFT std::deque is rarely used. Much like std::unordered_map, which allocates every insert potentially costing up to a microsecond for each insert.
>It's just a segmented vector.
https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER... we can look at the source. This "segmented vector" is `_Tp** _M_map`, essentially a vector of vectors. This means potentially twice the pointer-chasing, as we have to do two lookups. It also means allocation/deallocation each time a new segment is added/destroyed, which leads to more allocation than when using a vector (first need to allocate the segment, then potentially allocate entries within that).
>A ring buffer is a completely different data structure, it only works for fixed-capacity queues.
Where possible it's better to use a fixed capacity queue, especially one where the capacity is a power of 2 so wrap-around calculation can be done more efficiently. But the same kind of thing can be done for a dynamic-capacity queue, by resizing the whole backing array when capacity is reached.
HFT needn’t mean low latency is required but of course the two tend to be related.
If trivial allocations might cost you a microsecond then you can’t allocate at all on a latency-sensitive critical path and should probably get a better allocator. Also, be wary of power-of-two sized things because if they are aligned they are likely to have poor cache performance (too many things compete for the same set of cache lines).
I've worked in many HFT shops and that's not my experience.
A lot of those firms have poor C++ code though, but std::deque was typically used in the better ones.
It's not one allocation per element, as said previously it's a segmented vector. It works similarly to virtual memory: fixed-size segments are allocated for contiguous storage, while another parent array points to the segments. So appending/prepending either needs to append/prepend to the last/first segment or to allocate a new segment if full, which is what makes those operations more efficient than on a vector. Yes there is indirection, that's what segmented means. You also get stability of the memory location which is important for many data structures.
The article even tells you how big the segments are in the two implementations being looked at.
Going back to HFT, a similar data structure that's definitely heavily used there is a pool, which has a configurable segment size (potentially variable) and doesn't have order, enabling faster deallocation with a freelist.
I thought a resizable ring buffer was a valid implementation, but maybe I'm thinking of Rust's VecDeque and there's something about the semantics of std::deque that doesn't allow that.
I don't think insert at front/back is allowed an amortized bound, it has to be flat constant. But then I don't see how libc++'s is conforming if this article is true, so I'm not entirely sure what's going on there.
Most STL data structures take an allocator template argument. If you fight the compiler enough to make your own std::allocator with preallocation then the STL is actually pretty nice.
I think he might be talking about a segmented ring buffer, which is similar to a deque but each segment is a ring buffer. It gives much better insertion complexity.
Anyway I have to echo his point that I've never found a situation where deque performs better than vector, even when you really might expect it to. I guess the extra allocations are too slow. Maybe it would be more useful with arenas or something.
They are OK. But "slow for low latency" is not really defined; for many low-latency cases, only static or at-startup allocation is permitted, thus not deque at all. If you might need to build for production with MSVC, definitely do not use the std::deque that comes with it.
So does this mean:
- We're talking about different complexities (Amortized vs something else)
- The libc++ implementation isn't standards conformant
- The analysis in the c++ standard is incorrect.
- Something else
Neither is correct. Generally when complexity is specified in the standard, it's with respect to a specific operation. In this case the standard specifies that it's the number of calls to the constructor of T that must be constant, which means that push_back and push_front can not copy or move its elements.
There is never a way to guarantee that the physical amount of time is bounded by O(1) in the worst case. You could always have pathological scenarios or data structures written so that performing a move or copy of a single T could require O(n^n) time for all anyone knows.
std::deque is odd in that there's no std size of the blocks used. We have libstdc++ that is about 512bytes, libc++ at 4096bytes, and MS STL at around 64bytes. This leaves a bunch of room for optimizing for certain cases. Smaller is more insert friendly, larger more op[] reading friendly. But push_front after a pop_front should be O(1) on all of them. Otherwise it is moving/copying data to make room. Another nice/similar data structure is something like Boosts devector that adds a pointer to vector for the front capacity.
They have made clear that this won't be changed, for ABI stability reasons.
That makes std::dequeue basically unusable in a portable context. In virtually any situation, "allocate each element separately" and "allocate elements in 4k chunks" are on opposite ends of the performance spectrum.
That is quite unfortunate, and I freely admit that I'm unfamiliar with MSVC's current quirks.
> They have made clear that this won't be changed, for ABI stability reasons.
Also unfortunate. Once upon a time, users had to be extra careful to control which version of the runtime DLL their program transitively linked to since MS would issue a new ABI-breaking version with every major compiler release. Sounds like they got ABI stability religion just in time to lock in some unfortunate decisions.
Small comment: Ideally, big-O notation is for upper bounds. If you are doing lower bounds, you should ideally use big-Omega notation. But Omegas are harder to format in plain text, so it may be better to abuse notation and use big-O...
The libc++ implementation is bad. The libstdc++ implementation is fine. The issue with the former is that it doesn't have enough spare capacity so it has to shift elements around too often.
Actually I think the push_front is even worse than stated: O(n). Consider an array with capacity N+1, contains N elements. Now if you alternate push_front and pop_back then every push_front will cause memmove of N elements.
Oh and to make like a 4th addition to this comment: It's kind of funny that the code is so unreadable that the author found it more helpful to look at the generated assembler code.
Reading those C or C++ standard libraries is like a joke. Almost nothing is done in a direct or clear way and internal methods have cryptic names hidden in macros.
Maybe for a good reason I dunno. But it would be nice if the code was clearer so you could make sense of it when gdb or callgrind jumps into an inlined chunk ...
They choose names like _Capitalized and __lowercase because those identifiers are reserved for the implementation. Its a consequence of the preprocessor's lack of hygiene.
So where you might see a convention of naming members like m_thingish_foo, in the implementation's library headers they would be named _M_thingish_foo or __m_thingish_foo.
That doesn't necessarily help if you are backtracing from your own lambda up through some <algorithm>. But it does help many other use cases with the standard library.
The most important thing to know about std::deque is that the implementation in MSVC is really just awfully bad. So, if your code might need to be built there, you are should consider using an implementation from somewhere other than std::.
For all realizable n, ln(n) is less than 40.
So when amortizing over a large window (say, a block size of 512), it can be the case that, for any realizable n, the logarithmic amortized costs are completely dominated by the constant non-amortized costs (adding an element to a block in this case)... making the operation effectively constant-time.