Hacker News new | ask | show | jobs
by nanolith 2723 days ago
I recommend three things for wrangling compile times in C++: precompiled headers, using forward headers when possible (e.g. ios_fwd and friends), and implementing an aggressive compiler firewall strategy when not.

The compiler firewall strategy works fairly well in C++11 and even better in C++14. Create a public interface with minimal dependencies, and encapsulate the details for this interface in a pImpl (pointer to implementation). The latter can be defined in implementation source files, and it can use unique_ptr for simple resource management. C++14 added the missing make_unique, which eases the pImpl pattern.

That being said, compile times in C++ are going to typically be terrible if you are used to compiling in C, Go, and other languages known for fast compilation times. A build system with accurate dependency tracking and on-demand compilation (e.g. a directory watcher or, if you prefer IDEs, continuous compilation in the background) will eliminate a lot of this pain.

4 comments

pImpl pattern is great for those who don’t care about performance but it’s inappropriate for most header libraries. You wouldn’t want a library that hides the implementation of std::vector for example. With a visible implementation the compiler compile e.g. operator[] down one x86 instruction. With a pImpl pattern it will be an indirect function call in all likelihood that will be hundreds of times slower. It can make sense for libraries where every function is really expensive anyway, but it’s ruinous for STL and the like.
For cases when performance matters one can replace the members with a stab of the same size and alignment and cast the the stab to the real defition in the implementation.
Ugh, that violates strict aliasing, does it not?
I suppose it can, but in practice it works with any sane compiler that reasonably deals with reinterpret_cast and aliasing as long as aliasing requirements for the stub and the real thing are the same. The latter can be enforced with static asserts.
Using the pimpl pattern doesn't mean an indirect function call. The function to be called is always known. It's just an extra indirection in the data member. It's cheap. Think of it as Java style memory layout: everything that's not primitive stored in an object is a reference and therefore behind one level of indirection. The performance of Java is acceptance in the vast majority of use cases. Using pimpl will be the same.
”It's just an extra indirection in the data member. It's cheap”

That extra indirection often means a cache miss. That isn’t cheap. Accessing each item traversed through a pointer can easily halve program speed.

Java tries hard to prevent the indirections (local objects may live in the stack, their memory layout need not follow what the source code say, objects may even only exist in cpu registers)

Hmm... if you were a horrible person you could declare a `char[n]` member instead of a pointer. Then you could placement-new the impl in the constructor, and static-assert that `sizeof(impl)>=n`... No more cache misses :-).

:-(

This doesn't take into account the alignment of the type though (you'd want to use std::aligned_storage<sizeof(T), alignof(T)>), but that requires knowing enough about T to be able to use sizeof() and alignof(), which means no incomplete types, bringing us back to where we started.
When you need this, use aligned storage: https://en.cppreference.com/w/cpp/types/aligned_storage
That’s not that gross. There are types in Abseil that do it.
That's basically how modules would work, at least if you ignore LTCG.
Java JIT-compiler inlines short method calls whenever possible. Though C++ compiler should be able to do the same.
And they do, when given PGO data, or when doing LTO.
Given this is a topic of slow c++ builds, mentioning LTCG should come with the caveat that it will absolutely destroy your compile times.

It's also not infallible and you might find it difficult to track a regression if introduced by someone silently breaking a heurestic in the optimiser.

Sure, I was only mentioning that it is possible.

However with VC++ it doesn't seem to be that bad, when incremental compilation and linking are enabled.

The only advantage of c++ is max perf. If we could skip a beat we couldn't justify using c++.
It is still wins in "portability + expressiveness + safer than C" areas.

There are still more platforms with a C++ compiler available than Ada, Java or C# ones, let alone Go, D, Rust, Swift.

So if the goal is to make the code available to all platforms, without having to deal with C's lack of safety, then C++ it is.

This depends on what you are building. Don't commit the sin of early optimization.

Does a client of the framework you are writing -- which is probably using STL internally -- need a single instruction operation for adding a value for a call that you make less than 0.001% of the time?

Optimization is about end results. Apply the Pareto Principle, and don't forget that your users also need to compile your code in a reasonable amount of time.

That only makes sense if you are planning to offer two implementations of your library. Which I of course urge you to not do. This article is about the STL headers. The reason std::sort beats the pants off all other languages’ sort routines is because the iterators of every collection, all specializations of swap, and the comparator can all be visible to the compiler. If they weren’t, it would be a lot slower.

Premature optimization is not really a thing but foreclosing future avenues of optimization definitely can be.

> Premature optimization is not really a thing

Okay. I'm going to stop this thread right there and take some opportunity to provide some mentoring. I hope you accept this, as it will help in your career.

Read this paper. It is a classic.

https://pic.plover.com/knuth-GOTO.pdf

You should read the paper a bit more closely.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified"

We know that cache misses are not a small in-efficiency. This has been measured & observed on many real systems as a real, systemic problem. It's why data-oriented design is currently the King of game engine world, because cache misses kill performance. It is not premature to warn against it as a general practice as a result, as that's systemic de-optimization that will likely impact the critical 3%.

I think you may want to read the quote from the paper a bit more carefully. "...he will be wise to look carefully at the critical code; but only after that code has been identified"

I was told that "premature optimization is not really a thing" as a response to a reply I received that pImpls should be avoided at all costs.

When we analyze the performance impact of software, we don't shotgun change things because of a generalized fear of cache misses. We examine critical code paths and make changes there based on profile feedback. That is the spirit of what Knuth is saying in this quote. Look carefully at critical code, BUT ONLY AFTER that code has been identified.

A cache miss is critical when it is in a critical path. So, we write interfaces with this in mind. Compilation time matters, as does runtime performance. Either way, we identify performance bottlenecks as they come up and we optimize them. Avoiding clearer coding style, such as encapsulation, because it MIGHT create faster code, is counter-productive.

We can apply the Pareto Principle to understand that 80% of the performance overhead can be found in 20% of the code. The remaining 80% of the code can use pImpls or could be rewritten in Haskell for all it matters for performance. But, that code still needs to be compiled, and often by clients who only have headers and library files. Subjecting them to long compiles to gain an unimportant improvement in performance in code that only rarely gets called is a bad trade. Spend that time optimizing the parts of the code that matter, which, as Knuth says, should only be done after this code has been identified.

EDIT: the downvoting on this comment is amusing, given that "avoid pImpls" is exactly the sort of 97% cruft that Knuth was addressing.

you may not be aware but your comment came off as somewhat condescending, given that you don't really have any idea where parent poster is coming from or what their background is
If someone says that premature optimization isn't a thing, I don't think it is condescending to point out that it is by posting original source material. :-)
Software still needs to be architected for performance from the start. Trying to micro optimize a loop before you know you need it what Knuth was saying to avoid.
Right. Much like avoiding pImpl because it might make a function call that occurs 0.001% of the time faster. That is the basis of the thread I was replying to.

Understanding what you are optimizing FOR and where the most attention should be spent is the crux of Knuth's argument. Trying to be clever up-front is often counter-productive.

There is nothing wrong with making some architectural decisions up front, but that is much different than avoiding pImpls at all costs because indirection is slower. Indirection doesn't always matter, and it should only be tackled when and where it does.

Knuth wasn't saying "just ignore performance altogether", he was saying "stop making things needlessly complicated for the last bit of juice".
For instance, pulling in the STL for an interface header instead of encapsulating these details. :-)

No one is claiming that we should ignore performance all together. But, understanding through profiling where performance issues are and designing toward a faster implementation is more important than trying to inline definitions up front.

Overall user experience, mobile battery life, and many other metrics are really hard to fix by micro-optimizing a few functions. The key to a system that doesn't feel sluggish is being conscious of performance issues when making design decisions.

This pendulum swings back and forth, and we went from "every bit counts" madness of the early days, to the polar opposite of "just burn cycles, whatever".

Systems where every interaction feels sluggish are a pain to use, and often nearly impossible to refactor for better performance.

One hickup is that with unique_ptr you now have a rule of 5 thing, you need to declare a destructor which means you need to declar the copy/move constructor and assignment too. Not usually a big deal, but is extra code.
This is why the rule of zero advocates are getting louder.
Rule of zero classes are awesome. It forces a separation of concerns too, generally a good thing :), as the handling of special things is done by a class that does that(e.g. unique_ptr, vector...) and your class describes only what is in it and how to interact with it. But no more detailed than that.
Totally agree. Recently used unique_ptr with a custom deleter is to consume a C library that requires heap allocation with its own alloc/free functions. No destructor!
I have a class, boringly called value_ptr, that is like unique ptr but if the underlying value supports copy will do a copy constructor and assignment too. Then I don’t have to make one is the classes like this too where I am using a ptr for other reasons. It also has const correctness
This is true. Fortunately, these do not need to be inlined, which can still free client code of compile time overhead.

It's a tradeoff between compile time and complexity.

Another option is something like pimp but keep the state in the public class. Now you get stack allocation but the private part is still private and firewalling all the other headers and details not needed for the public interface. Just pass this to the private methods.

Edit: Something like https://github.com/beached/stack_pimpl

I don’t get the example, can you explain? What does this buy that you don’t have by simply putting the content of private.cpp into stack_pimpl.cpp? The data members are already in stack_pimpl.h (so nothing private or hidden about that) and the methods are already declared in both and implmented in both cpp files, so what are you buying over just putting the declarations in one header and implementation in one source and not delegating from one to the other?

Was it just a oversimplistic example and the benefit is actually if priv_t has a bunch of internal methods that you want to keep out of the stack_pimpl.h interface?

So PIMPL is a compile firewall to keep the compile times and changes in one section from cascading and imacting your whole project with a recompile. It is not going to keep things secret as I can look at the binary and pick it apart.

So with that, it keeps the data(state) in the public facing class. This allows one to keep everythign stack allocated instead of defaulting to the heap. So for something this is created en mass(A vector of them) or created and destroyed often, this is a runtime win.

What it does is have a proxy that mirrors the public interface that is passed the this pointer. That proxy a friend class. Because only the proxies header(in this case private.h which I should probably rename firewalled.h) has static members that mirror the public members on the public class that limits the interaction between your classes users and it's implementation, as it is also with unique_ptr(or whatever pointer/heap way) based PIMPL designs. So changes in private.cpp that does all the work are only reflected in that one file. This file also brings in the heavy templates or algorithm code that may have large compile times.

So, you mean that changing private.h only requires private.cpp and stack_pimpl.cpp to be recompiled, whereas changing stack_pimpl.h would require anything that uses it to be recompiled too? Ok, that makes sense.

However...

> So with taht, it keeps the data(stat) in the public facing class. This allows one to keep everything stack allocated instaed of defaulting to the heap.

Ok, having it on the stack is useful, but in my personal experience, the state is exactly the thing that I find changes the most (typically together with the code), so by keeping the state in the public header, changes will still require recompiliation of anything that includes the header, so I’m not sure this really wins much (at least, based on my own C++ projects).

Never mind, I was thinking wrong and neglected the ABI stability
Additionally, actually make use of binary libraries across modules, and extern templates for common type parameters.
Any advise for reducing link times?
Try the gold or lld linkers.