There are a bunch of articles about it. National Geographic [1], Harvard Business Review [2], BB [3]. One of the theories is that we have to work harder to pick up on non-verbal cues, which consumes energy.
A data point from the BBC article: "One 2014 study by German academics showed that delays on phone or conferencing systems shaped our views of people negatively: even delays of 1.2 seconds made people perceive the responder as less friendly or focused."