Hacker News new | ask | show | jobs
by mehrdadn 1926 days ago
If you're wondering whether this is a theoretical or practical problem: I actually observed some of this effect in practice first, and only after thinking about it for a while did the larger issue (and the complexity implications) dawn on me.

I had something like a set<string> or a set<set<string>> (or map... I can't remember which) somewhere in my program a few years ago, and I was trying to improve the program's performance. I tried breaking into it several times and found it quite perplexing that the bottleneck appeared to be the set<> container. I mean, I get that cache locality and all has an effect, but it seemed to be having a little too much of an effect. Why did I find this odd? Because other languages (like C#) have tree-based sets and maps too, but I'd never felt they were quite as slow as I was seeing them in C++. So I felt something weird must be going on.

I tried to step through the program for a while and observe what's going on, and at some point, I realized (perhaps I hit the same breakpoint twice? I don't recall) that these functions were being called more often than I intuitively thought would be necessary. Which was obvious in hindsight, as less() needs to be called multiple times on the same objects (4 times at level 2). Now, this hadn't resulted in quadratic behavior, but that was only because my data structures weren't arbitrarily deep—the slowdown was nevertheless a significant constant factor at the core of my program, only linear because the data structure's depth was bounded.

So once I had realized the implications of this, including that constant-factor differences can actually turn into polynomial ones, I eventually decided to post an article about it on arXiv... hence this article. Aside from the complexity issue illustrated in the article, one of my (unexpected) higher-level takeaways was that you can't just take any utility function in a program and blindly put it into a library: even if it's correct, you probably have hidden assumptions about its use cases that won't hold true in general. You really need to think about how it could be used in ways you didn't initially expect, and this may well require a different kind of code review with an entirely different mindset than before. It really drove home the point for me that writing a library is quite a bit different (and in some ways more difficult) than writing an application.

It's possible there is more theory lying underneath this that I haven't touched on—it would be interesting if someone can take this and find more interesting consequences of it. For example, maybe when analyzing algorithms we need to consider something akin to a "recursive time complexity", too? Is there another useful notion of complexity that we can generalize here?

Anyway, hope people enjoy reading it and take something useful away from it. :)

3 comments

By the way for the hashing example in count() - Rust has the Entry API for HashMaps to avoid exactly this problem: https://doc.rust-lang.org/std/collections/struct.HashMap.htm...

The PartialOrd trait also uses 3-way comparisons so I think the other issue is mitigated too, but it'd be interesting to check: https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html#tyme...

I'm failing to understand some things, maybe because I glanced through the paper with morning coffee. Apology if my tone ends up a little bit harsh, but these are just constructive criticism :)

The first one is lt2 and lt3 implementations. You are implementing cmp2 through lt2, but lt3 through cmp3 (is this omission?). Both of them are stack-sensitive. Without being too harsh, I'm getting the impression that the intention was to write the most horrible comparison possible, which is different than worst-case time complexity.

In the paper, lt2 (actually cmp2 in the paper) will always be at least two passes, and lt3 is at least one pass. I would not say they are two/single pass algorithms because complexity increases depending on list depth when lists are involved.

Maybe I'm wrong, but both Python and C++ comparison operators are designed to be general-purpose comparison functions (and I'm more sure about C++, because this was touted through hundreds of books). As such, they should be good enough for most average cases. If you want speed, you go with balanced trees or something funkier.

Also, for C++ Tree implementation, you are again using probably the worst approach - appending to vector recursively. Use list for this. Python list implements all sorts of tricks, comparing to C++ vector.

And the last thing, but not the least: C++ containers depend on implementation (gcc libstdc++, stlport, msvc whatever), and I've seen substantial speed differences in standard operations. Hell, my old (almost conforming) list implementation was much faster than libstdc++ implementation because it wasn't trying to be too clever with slices and other magic.

I'm sad you haven't used a more scientific approach with much more rigor here: what C++ compiler was used, what version, what assembly output was produced, on what processor, after how many runs, etc... Claiming "Python is faster than C++" sounds like a clickbait title.

I'm not trying to write the most horrible comparison at all. Perhaps the most important thing to keep in mind here is that this is a general computer science paper, and my comparison of C++ and Python is just intended to serve as a familiar (and vivid) illustrative example of the general phenomenon I'm trying to describe. The paper is emphatically not intended to be a "Python vs. C++" paper. Everything you see there that is "concrete" (the language, the running times, etc.) is intended to be a mere illustration of the overarching concept (design decisions & their consequences) being discussed, and it could manifest itself in any language.

The context to keep in mind when reading the paper is: When designing a programming language & its standard library (or any API), we need to define an interface we can use as a building block, and we're analyzing the consequences of our choice of building blocks. In particular, we first examine the case of comparison-based data structures, which requires defining ordering primitives. In C++, the primitive is the < operator. In Python 2, it's cmp(). (In Python 3, it's a mix of < and ==, whose implications I discuss as well.) We assume user-provided types implement that basic interface, and we implement everything else we need on top of that.

So the question I'm analyzing in that example is: What happens if my primitive comparison operation is a 2-way comparison (lt(), like in C++) and then I implement 3-way comparison in terms of that (such as when I need it for for a binary search tree)? Now, what if we do the opposite: what happens if instead my primitive comparison operation is a 3-way comparison (cmp(), like in Python 2) and I only need to perform a 2-way comparison later? What are the trade-offs?

To do this, I take both approaches, implementing each in terms of the other, and compare how they behave complexity-wise. The conclusion I come to is that the choice of the primitive (which is often made by the language designers) isn't merely a question of aesthetics, but rather, it actually affects the time complexity of common operations (like traversing search trees). Similarly, the decision to cache a hash code doesn't just result in a constant-factor improvement, but it can actually change the time complexity of a program. And so on.

I think if you re-read the paper with these in mind, it should hopefully make more sense. The rest of what you said doesn't enter the picture at all... these are already balanced binary trees, the decision to use less<> is fundamentally independent of what C++ stdlib implementation you use, and the time of the vector concatenation isn't even being measured. Those things are unrelated to the point of the paper entirely. I was just trying to minimize the extraneous lines of code so we can focus on the heart of the problem instead of getting distracted by boilerplate.

Thanks for the interesting case. I guess many readers have read too much into lt2 and lt3 but overlooked the code under 2.4.2: C++ could actually be that bad. In that code, you only showed the timing in comments. The result may be worth a table. Perhaps another way to structure the manuscript is to give the surprising result of 2.4.2 first and then to explain what makes that happen.
Yeah—I think I laid out the paper more like a story, but I might indeed need to change that as it appears it leaves people confused before they get to the punchline. Thanks for the feedback! Glad you found it interesting.
1. People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

2. The C++ standard library's maps and sets are known to be rather slow. See, for example:

https://stackoverflow.com/q/42588264/1593077

when you have string values, it's even worse, as you describe. But it's not clear that an overly-clever implementation, which caches numeric ranks of strings etc., is a good idea to have.

Ordering is not the only concern here. std::set actually provides a logarithmic worst-case guarantee, whereas std::unordered_set does not. This is a factor to consider depending on the application, regardless of whether ordering is necessary. Whichever one prefers in any case, though, is beside my point—I'm merely trying to use trees and hashtables to illustrate a far more general CS phenomenon that can occur in lots of data structures and languages.
If performance is a concern then you should still avoid std::set though by default. Logarithmic worst case when it's just always slow isn't really useful.

There may be a benchmark out there where std::set can beat std::unordered_set, but you'll be really hard pressed to find it

Sure, but this isn't a benchmarking paper.
Imho "but technically..." is not a valid opinion while in practice the access is O(1) on average. Yea sure, it becomes linear if your hash is "return 42;", it can't grantee that you supplied a good hasher.
It's not technically. Worst-case guarantee is required for some applications. Hash Maps have better amortized complexity but some implementations have bad worst case time.

Java uses balanced trees instead of linked list in their chained-hashtable implementation, if I recall correctly.

There are lots of reasons to prefer trees (and, correspondingly, lots of reasons to prefer hashtables); I just pointed out ordering isn't the only one, and I merely gave another (worst-case performance guarantee). For example, yet another one is iterator stability; insertion can invalidate iterators for unordered containers, making them useless for some applications that need to analyze data efficiently as they insert.

I could go on, but it's very much detracting from the point of the paper to argue about the data structure choice when the paper isn't on this topic at all or trying to analyze any particular application to begin with.

Technically is a very valid opinion if you're writing software that needs to be robust and secure in the face of user input. std::set is guaranteed to be logarithmic strictly in the number of comparisons in the worst case without exception.

std::unordered_set has some very difficult to reason about worst case scenarios that must factor in both the size of the collection and the hash function. Unless you are very careful about the type of hashing you use, you can easily be vulnerable to DDOS attacks that scale on the order of O(k * n) for key size k and container size n.

Note that this isn't just some user-supplied bad hashers. Hash for std::string in GNU library is not much harder to hack for collisions than your example. And also it uses the whole string, so is slow on long strings. I didn't investigate if other implementations of STL are any better.
> People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

Aren't unordered_set and unordered_map quite new (IIRC, they came only with C++0x)? For most of C++'s history, if you preferred to use the standard library, what you had was only ordered sets and maps.

Depends what’s your definition of “quite new”. That’s over 10 years go now :)
They've been in the language for about a decade now, and before that we had the tr1 hash_map classes that were available in most environments.
C++: the language that's been around for 40 years and is completely new!
Actually, C++ has barely been around at all. It is only with C++20, IIANM, that they key features Bjarne Stroustrup wanted to imbue it with are in place (IIRC; see his "Design & Evolution of C++" book from 1994). Some might even argue that the language C++ _should_ be is not even all here; it's still gradually arriving.
By now, this is no longer "quite new". A recent C++ community survey suggests that under 10% of developers currently use C++98/C++03 the most. Naturally this is not a valid sample of the whole userbase, but it's a good indication.

Also, in 2005 IIANM, Google published their dense and sparse hash map implementations, which were fast-ish and quite usable.

They were in TR1 in 2005, if I am not mistaken.
Why is the default set implementation ordered in the first place? The formal data structure is unordered, which probably informs people's assumptions about its performance characteristics. Should it not be "set" and "ordered_set"?
One reason would be because std::set guarantees O(log n) complexity for each operation on the worst case. std::unordered_set has average complexity O(1) and O(size) for the worst case (it being a hash table) which can be unexpected to debug in those rare cases.
Probably because the underlying implementation is a binary search tree which requires a comparison operator. Haskell likewise has Data.Set which requires an Ord instance. There's HashSet which requires a Hashable instance which can be automatically derived for any data type, so in principle you don't need ordering, but I would imagine in C++ you would have to provide your own hashing function from whatever data type you're trying to put in a hash set.
> Why is the default set implementation ordered in the first place?

"Sorted" rather than ordered (an order can be arbitrary and I'd personally associate the word with insertion order).

> The formal data structure is unordered

The formal data structure doesn't have complexity bounds, the STL does.

It's a mis-feature. And since C++ is very committed to backwards compatibility, they didn't change it later on.

If/when `std2::` happens, this is one of the things I assume would change.