Hacker News new | ask | show | jobs
by aw1621107 877 days ago
I was curious why the std::vector version produced so much more assembly, so I tried to dig a little. At least in libc++, the relevant code is (cleaned up):

    template<typename T, 
    class vector {
    private:
        pointer __begin_;
        pointer __end_;
        __compressed_pair<pointer, allocator_type> __end_cap_;
    public:
        constexpr const pointer& __end_cap() const noexcept {
            return this->__end_cap_.first();
        }
        constexpr size_type size() const noexcept {
            return static_cast<size_type>(this->__end_ - this->__begin_);
        }
        constexpr size_type capacity() const noexcept {
            return static_cast<size_type>(__end_cap() - this->__begin_);
        }
    };
So if the functions are fully inlined we should end up with

    template<class T>
    bool can_fit(std::vector<T> const &v, size_t n) {
        return static_cast<size_t>(v.__end_cap_.first() - v.__begin_) - static_cast<size_t>(v.__end_ - v.__begin_);
    }
At least algebraically (and ignoring casts) it should be equivalent to v.__end_cap_.first() - v.__end_, which is more or less what the manual implementations do. Maybe the optimizer can't make that transformation for some reason or another (overflow and/or not knowing the relationship between the pointers involved, maybe)?

If you change can_fit(Vec<S>) to:

    return (v.f - v.b) - (v.e - v.b) >= n;
You end up with code that looks pretty similar to the can_fit(std::vector<S>) overload (the same for clang, a bit different for GCC), so it does seem it might be something about the extra pointer math that can't be safely reduced, and the casts aren't really relevant.

(I'm also a bit surprised that can_fit_fast produces different assembly than the Vec can_fit overload)

1 comments

Kind of, yeah. The main thing to realize here is that pointer subtraction is not just arithmetic subtraction of the addresses; it's also followed by a division by sizeof(T). (This is not obvious in this context unless you've seen it before.) Thus for the compiler to prove that (f - b) - (e - b) == f - e, it has to prove the remainders of each address subtraction (mod sizeof(T)) don't affect the result. This is certainly possible, but it requires compilers to actually do some extra legwork (e.g.: assume these come from the same array, then prove that implies the remainders don't affect the result, etc.) prior to reducing these to arithmetic operations. For whatever reason they don't do that. My guess is the reason is something between "we believe the general case would break too much existing code" and "we believe this special case isn't common enough to be worth it". But who knows; maybe if someone files this as a missed optimization, they would try to implement it. I haven't tried.

(There's also a secondary issue here, which is that sizeof(S) isn't a power of 2. That's what's introducing multiplications, instead of bit shifts. You might not see as drastic of a difference if sizeof(S) == alignof(S).)

Yeah, the extra instructions from the division helped clue me into what might be going on, since the individual (f - b) and (e - b) calculations are visible in Clang's output.

I feel the division by sizeof(T) shouldn't matter that much, since the compiler knows it has pointers to T so I don't think the divisions would have remainders. I want to say pointer overflow and arithmetic on pointers to different objects (allocations?) should also be UB, so I suppose that might clear up most obstacles? I think I'm still missing something...

Does make me wonder how frequently this pattern might pop up elsewhere if it does turn out to be optimizable.

My 1st paragraph was directly answering your 2nd paragraph here (starting from "This is certainly possible..." to the end). I was saying, compilers can optimize this if they want to, but it requires work to implement, and I can only guess (the reasons I listed) as to why they might not have done so yet.

> Does make me wonder how frequently this pattern might pop up elsewhere if it does turn out to be optimizable.

Probably a fair bit, but as I mentioned, it might break a lot of code too, because there's too much code in the wild doing illegal things with pointers (like shoving random state into the lower bits, etc.). Or not... the Clang folks would probably know better.

Right, I suppose my second paragraph was waffling on the extra legwork the compiler would have to do, but I know validating optimizations is just the start of the work.

Maybe this could be a good way to jump into messing with LLVM...

Out of curiosity, how much of a performance difference did you observe in practice when you made this optimization?

I don't recall unfortunately, it's been a few years.