I think a lot of it is the massive amount of compute we've got in the last decade. While inference may have been possible on the hardware the training would have taken lifetimes.
Once you have three layers (i.e. one "hidden" layer) then you can map to arbitrary functions, so a three layer network has the same "power" as an arbitrarily large network.
I'm sure that's what the text book meant, rather than any point about the expense of computing power.
People believe that more parameters would lead to overfit instead generalization. The various regularization methods we use today to avoid overfit hadn't been discovered yet. Your statement is mostly likely about this.
I think the problems with big network were diminishing gradients, which is why we now use the ReLU activation function, and training stability, which were solved with residual connections.
Overfitting is the problem of having too little training data for your network size.
Compute was just too expensive to have neural networks big enough for this not to be true.