Hacker News new | ask | show | jobs
by blovescoffee 877 days ago
Because people tend to write c++ for performance reasons and the perf profiles of std::vector are not always sufficient
2 comments

The only thing you can really quibble about with std::vector is whether your library has made an optimal choice of growth strategy, which you can often hack around by reserving. Aside from that, access via `operator[]` and growth via `emplace_back` will compile down to optimal code that is going to be close to impossible to beat. After the compiler gets done with it, it looks the same as if you had hand-coded it with arrays in a C-style, but without the plethora of bugs that often results from that approach.
> The only thing you can really quibble about with std::vector is [...]

That's actually not true, though I certainly don't fault you for believing it :-) but there are definitely more things to quibble about around vector if you're serious about performance. As an example, try writing a can_fit(n) function, which tells you whether the vector can fit n elements without reallocating. Observe the performance difference between (a) a smart manual version, (b) a naive manual version, and (c) the only STL version: https://godbolt.org/z/88sfM1sxW

  #include <vector>

  template<class T>
  struct Vec { T *b, *e, *f; };

  template<class T>
  bool can_fit_fast(Vec<T> const &v, size_t n) {
    return reinterpret_cast<char*>(v.f) - reinterpret_cast<char*>(v.e) >= n * sizeof(T);
  }

  template<class T>
  bool can_fit(Vec<T> const &v, size_t n) {
    return v.f - v.e >= n;
  }

  template<class T>
  bool can_fit(std::vector<T> const &v, size_t n) {
    return v.capacity() - v.size() >= n;
  }

  struct S { size_t a[3]; };

  template bool can_fit_fast(Vec<S> const &, size_t);
  template bool can_fit(Vec<S> const &, size_t);
  template bool can_fit(std::vector<S> const &, size_t);
I was curious why the std::vector version produced so much more assembly, so I tried to dig a little. At least in libc++, the relevant code is (cleaned up):

    template<typename T, 
    class vector {
    private:
        pointer __begin_;
        pointer __end_;
        __compressed_pair<pointer, allocator_type> __end_cap_;
    public:
        constexpr const pointer& __end_cap() const noexcept {
            return this->__end_cap_.first();
        }
        constexpr size_type size() const noexcept {
            return static_cast<size_type>(this->__end_ - this->__begin_);
        }
        constexpr size_type capacity() const noexcept {
            return static_cast<size_type>(__end_cap() - this->__begin_);
        }
    };
So if the functions are fully inlined we should end up with

    template<class T>
    bool can_fit(std::vector<T> const &v, size_t n) {
        return static_cast<size_t>(v.__end_cap_.first() - v.__begin_) - static_cast<size_t>(v.__end_ - v.__begin_);
    }
At least algebraically (and ignoring casts) it should be equivalent to v.__end_cap_.first() - v.__end_, which is more or less what the manual implementations do. Maybe the optimizer can't make that transformation for some reason or another (overflow and/or not knowing the relationship between the pointers involved, maybe)?

If you change can_fit(Vec<S>) to:

    return (v.f - v.b) - (v.e - v.b) >= n;
You end up with code that looks pretty similar to the can_fit(std::vector<S>) overload (the same for clang, a bit different for GCC), so it does seem it might be something about the extra pointer math that can't be safely reduced, and the casts aren't really relevant.

(I'm also a bit surprised that can_fit_fast produces different assembly than the Vec can_fit overload)

Kind of, yeah. The main thing to realize here is that pointer subtraction is not just arithmetic subtraction of the addresses; it's also followed by a division by sizeof(T). (This is not obvious in this context unless you've seen it before.) Thus for the compiler to prove that (f - b) - (e - b) == f - e, it has to prove the remainders of each address subtraction (mod sizeof(T)) don't affect the result. This is certainly possible, but it requires compilers to actually do some extra legwork (e.g.: assume these come from the same array, then prove that implies the remainders don't affect the result, etc.) prior to reducing these to arithmetic operations. For whatever reason they don't do that. My guess is the reason is something between "we believe the general case would break too much existing code" and "we believe this special case isn't common enough to be worth it". But who knows; maybe if someone files this as a missed optimization, they would try to implement it. I haven't tried.

(There's also a secondary issue here, which is that sizeof(S) isn't a power of 2. That's what's introducing multiplications, instead of bit shifts. You might not see as drastic of a difference if sizeof(S) == alignof(S).)

Yeah, the extra instructions from the division helped clue me into what might be going on, since the individual (f - b) and (e - b) calculations are visible in Clang's output.

I feel the division by sizeof(T) shouldn't matter that much, since the compiler knows it has pointers to T so I don't think the divisions would have remainders. I want to say pointer overflow and arithmetic on pointers to different objects (allocations?) should also be UB, so I suppose that might clear up most obstacles? I think I'm still missing something...

Does make me wonder how frequently this pattern might pop up elsewhere if it does turn out to be optimizable.

My 1st paragraph was directly answering your 2nd paragraph here (starting from "This is certainly possible..." to the end). I was saying, compilers can optimize this if they want to, but it requires work to implement, and I can only guess (the reasons I listed) as to why they might not have done so yet.

> Does make me wonder how frequently this pattern might pop up elsewhere if it does turn out to be optimizable.

Probably a fair bit, but as I mentioned, it might break a lot of code too, because there's too much code in the wild doing illegal things with pointers (like shoving random state into the lower bits, etc.). Or not... the Clang folks would probably know better.

I wonder if you actually measured it with proper compiler options? I saw some folks insist that STL is somehow slow blaming unnecessary layers but many of them are pretty much groundless. There are some genuine spots where STL has a real performance problems, but I don't think your example illustrates it well.
Click the link in my post, I added it so you can see everything in action for yourself.

(Although, note I didn't claim "the STL is slow"... that's painting with a much broader stroke than I made.)

Why would that be faster? Those function calls are going to be inlined.
It's not the function calls they are alluding to, it's the way the compiler generates a bunch of shifts and multiplies except in the can_fit_2 case.
Turns out GCC is more clever than Clang here, and MSVC is just horrendous. (See update, I posted a link.)
I somewhat agree with your point (esp. that MSVC is hideous) but I also stand by mine. I don't feel that checking the capacity is something that would be in my tight loop, because checking for capacity to hold N is something you do before adding N items, which amortizes the cost, meaning the capacity check is just off the real hot path. So it doesn't feel like a realistic use case.

Speaking of realism, putting these in quickbench seems to confirm that the differences between them are not material, and that the STL version is in fact the quickest, but they are all essentially free. There's not a way to make a realistic microbenchmark for this, for the same reason that it doesn't feel like a real-world performance issue.

By the way clang does a much better job here: https://quick-bench.com/q/XcKK782d-7A6YHbiBRTlOnIRnPY

There's also the issue that std::vector<bool> is required by the standard to be specialized as a bitset, which is a footgun in generic code since you can normally take the address of a vector element but not if it's a vector of bool. Having a bitset in the standard library is fine but it should have been a seperate type.

Admittedly that's not a performance issue, but it's annoying.

That's true but I think everyone knows about vector<bool> being quirky. By the way, the standard does not require vector<bool> to be implemented as a bitset. Instead, it relaxes some of the details of vector, in a way that allows the implementation to do it that way. But these choices are implementation details.

Vector<bool> is a little weird if you are just starting with C++, but it does have major performance benefits in its niche, and it came from the 1990s so we can be generous in overlooking its rough edges.

It came from the 1990s, but it overstayed its welcome. I remember seeing proposals to remove the vector<bool> special case, what is the situation?
What specifically is wrong with vector? There have been a lot of hash maps done with flat memory to minimize allocations and pointer hopping over the STL but vector doesn't have those problems.