Actually changing the DOM has a lot of implications. Recalculating CSS, layout, possible side effects that generate events, repainting, etc, etc. It's a lot of work to go through for an intermediate state that may not even last long enough to be visible to the user. Buffering and batching those changes can save a lot of that effort. I suppose in theory the browser could optimize this as well but in practice it seems most are still optimized for rendering static pages very quickly rather than for handling rapid changes.
Because when your code changes DOM, browser does not know whether you are going to stop at it for now, or do something else, so it has to re-render. In React you explicitly request application of virtual DOM changes to real DOM once you are done.
React simply stores a /representation/ of what the DOM looks like (just the tree of nodes). When actually hitting the DOM, the browser recalculates styles, layout, and may preemptively calculate other things for speedier rendering. Additionally, it's a lot "heavier" so jumping through a node's children, children's children, etc. is much more expensive than plain nested objects. Therefore, it's most efficient to wait until the browser is actually about to render (requestAnimationFrame) and update only what is different in one go.
When I read the article I understood it to be that individual actual DOM operations are expensive compared to the virtual DOM. So, if you have N operations, if you apply them to the virtual DOM, and then do a diff, you may end up with M <= N operations one must apply to the actual DOM to do to get it to an equivalent state.