Hacker News new | ask | show | jobs
by harrid 942 days ago
Yes... But you don't need to memorize all this, and it leaves out a simple rule: vectors (or arrays) outperform everything else if the dataset is small. Small is usually in the order of 100-300, but can vary wildly.

Also note that all <algorithm>s have built-in fully automatic parallelism via <execution>, a massively underused feature. In typical CPP fashion though, their newer views:: counterparts lack those overloads for the moment.

8 comments

Never heard of `execution`, great tip. Is there some big caveat or just generally unknown?
You need to be aware that these invocations are going to blast your cores full throttle as you obviously don't have fine grained control. But as long as your data is easily parallelized on a vector with computations that don't depend on each other, it's a game changer. I use it all the time to multithread things with literally a single line of code.
I looked into it (my video, probably too long https://youtu.be/9oh66SF91LA?si=azDCSOAJKA9Gpzim), and the general result was that they make sense for non-small datasets and are a solid way to to parallelize something without having to pull in OpenMP or something.
They're only supported in MSVC and GCC (for the latter you need to link against Intel's TBB to make it work). Support in libc++ (Clang) is work in progress.
Clang does support parallel stl already (requires either TBB or OpenMP). Our project https://github.com/elalish/manifold made use of this to speed up mesh processing algorithms a lot.
Does it? You mean if you link against libstdc++ instead of libc++?
I remember it works for libc++ (partially, see https://libcxx.llvm.org/Status/PSTL.html), but forgot when I linked against libc++ last time...
Incomplete support from Clang’s STL (especially in Apple Clang).
Yes, but actually small can often be much larger then 100-300 depending on the specifics. Programmers often vastly underestimate how fast cache, and prefetch can go compared to complicated data structures.
... Or much smaller, I remember a benchmark for a case I had a few years ago, the cutoff I had for a map being faster than std::vector/array & linear probing was closer to N=10 that 100.

(Not std::map, at the time it must have been something like tsl:: hopscotch_map).

Note also that nowadays for instance boost comes with state-of-the-art flat_map and flat_unordered_map which gives both the cache coherency for small sizes and the algorithmic characteristics of various kinds of maps

Exactly. I measured vector vs an efficient linear probing hash map very recently and the cutoff was single digit. Even against unordered_map or plain std::map the cutoff was surprisingly low (although in this case I would trust a synthetic benchmark significantly less).
Sure, the point here is with all the specific context specific details, your most likely comparing apples versus oranges. So simple complexity analysis or a general rule without benchmarks and a good understanding of the system and how it interacts with your details is not going to solve your problem.
So you're saying that if I had had to store 100 elements in the memory, I would be better off using hashmap instead of vector/array? What type of elements did you use in your experiment or how large they were and what was your access pattern?
A successful search in a vector will do, on average, 50 comparisons, while the hash map version would hash the key, look up the bucket, typically find a single-item in that bucket (with only a 100 items in the hashtag, hash collisions will be highly unlikely), and do a single comparison.

For an unsuccessful search, the vector version would do 100 key comparisons, and the hashtag would do a single hash, lookup the bucket, and almost certainly find it empty.

So, if you make the comparison function relatively expensive, I can see the hash map being faster at search.

Even relatively short string keys might be sufficient here, if the string data isn’t stored inline. Then, the key comparisons are likely to cause more cache misses than accessing a single the bucket.

Of course, the moment you start iterating over all items often, the picture will change.

Searching the vector is literally incrementing a pointer over the data. The number of instructions needed to do the search is very small - e.g. ~15. This means that it can very easily fit into the CPU uOp cache but also makes it a candidate for the LSD cache. Both of those will be a major factor in hiding the latencies or getting rid of them in the CPU frontend fetch-decode pipeline, effectively making all the for-loop iterations only left to be bound by the CPU backend, or more specifically, branch-mispredictions (aka ROB flushing) and memory latencies.

Given the predictable access nature of vectors and their contiguous layout in the memory, the CPU backend will be able to take advantage of those facts and will be able to hide the memory latency (even within the L1+L2+L3 cache) by pre-fetching the data on consecutive cache-lines just as you go through the loop. Accessing the data that resides in L1 cache is ~4 cycles.

The "non-branchiness" of such code will make it predictable and as such will make it a good use of BTB buffers. Predictability will prevent the CPU from having to flush the ROB and hence flushing the whole pipeline and starting all over again. The cost of this is one of the largest there are within the CPU and it is ~15 cycles.

OTOH searching the open-addressing hashmap is the super-set of that - e.g. almost as if you're searching over an array of vectors. So, only the search code is: (1) By several factors larger, (2) Much more branchy, (3) Less predictable and (4) Less cache-friendly.

Algorithmically speaking, yes, what you're saying makes sense, but I think the whole picture can only be made once the hardware details are also taken into account. Vector approach will literally be only bound by the number of cycles it takes to fetch the data from L1 cache and I don't see that happening for a hash-map.

No, benchmark it for your particular type, and decide based on that.
I think you're missing my point. I'm highly suspicious, or let's say intrigued, under what conditions one can come up with such conclusion. Therefore I asked for a clarification.
vectors (or arrays) outperform everything else if the dataset is small. Small is usually in the order of 100-300, but can vary wildly.

This is a very poor way to choose a data structure. How many items you want to store is not what someone should be thinking about.

How you are going to access it is what is important. Looping through it - vector. Random access - hash map. These two data structures are what people need 90% of the time.

If you are putting data on the heap it is already because you don't know how many items you want to store.

Isn't that because view performs the procedure and access when the data is accessed? It's mainly meant so that you don't have to load all this memory in or stall when accessing the view, and if you don't need that capability, the regular algorithm stl is better.
iirc, on msvc, execution parallel delegates to the OS (windows) to decide how many threads to create, at it is usually more than the total number of vcpu contrary to the usual recommendation.
It's very much implementation defined yes. I'm currently using this for something that runs for about 10 seconds, and even music playback and mouse cursor movement gets affected.

(But I'm about to move it to GPU)

Not quite. Many other datastructures can be shoehorned into contiguous, cache efficient representations.
I rarely meet a CS major who will accept that small things will always stay small. They will frequently talk you into more complex data structures with the assurance that you are just too stupid to realize your problem will suddenly need to scale several orders of magnitude. Do they teach you in CS-101 to interpret '+' as the exponential operator? It often feels that way.
Are they fresh graduates? It is very important to understand the workload distribution for any optimization. Even if small things can sometimes get large, optimizing for the small case can often yield large gain as they may occur frequently. And complex data structures are usually worse in the small case...
There’s computer science and there’s software engineering. The best developers are good at both.

…but in order for this to really matter, communication is required, since even the best developers don’t scale.

It's even more true for larger collections. Stroustrup gave a talk on this back at GoingNative2012. Tl;Dr: "Use a Vectah" -Bjarne

https://youtu.be/YQs6IC-vgmo

A vector should be people's default data structure, but this presentation is bizarre, because it is based on looping through every element of a vector or a list to find the element you want.

This is never a scenario that should happen, because if you are going to retrieve an arbitrary element it should be in a hash map or sorted map.