Vision language models are an incredible achievement in the generality and usability. But they pay a hefty price in fidelity and speed