Adding on to this, and related somewhat, binary addition even with look ahead is a multi clock cycle process for non trivial word sizes. Arithmetic is not trivial by any means. We all just take it for granted.
By "non-trivial word sizes," I presume you mean larger than a single word of most computers? My understanding is that modern BDD/ZDD techniques have gone a long way to showing how to do the entire operation in a single pass for most sized numbers. Such that propagating the carry information isn't really how it is done, anymore. (This is a genuine question, btw. I have been a long way from my VLSI classes back in college.)
I'm talking about constructing boolean circuits. The usual parallel prefix algorithm yields a circuit of depth O(log n) with O(n log n) gates. Using some trickery, however, this can be reduced to O(n) gates, with a constant factor increase in depth.