That's a tough one to answer right now, but to be perfectly honest, we're off by 2-3 orders of magnitude in terms of chars/W.
That said, VLMs are extremely powerful visual learners with LLM-like reasoning capabilities making them more versatile than OCR for practically all imaging domains.
In a matter of a few years, I think we'll essentially see models that are more cost-performant via distillation, quantization and the multitude of tricks you can do to reduce the inference overhead.
A lot worse. But, higher quality OCR will reduce the amount of human post-processing needed, and, in turn will allow us to reduce the number of humans. Since humans are relatively expensive in energy use, this can be expected to save a lot of energy.
> Since humans are relatively expensive in energy use
Are they? I'm seeing figures around 80 watts at rest, and 150 when exercising. The brain itself only uses about 20 watts [1]. That's 1/35 of a single H100's power consumption (700 watts - which doesn't even take into account the energy required to cool the data center, the humans who build and maintain it, ...).
The PUE of humans for that 80 watts is terrible, though. Ridiculous multiples of additional energy needed to convert solar power to a form of a energy that they can use, and even the manufacturing lifecycle and transport of humans to the datacenter is energy inefficient.
People really only started talking about the cost of running things when LLMs came out. Most everything before that was too cheap to be a serious consideration.
That said, VLMs are extremely powerful visual learners with LLM-like reasoning capabilities making them more versatile than OCR for practically all imaging domains.
In a matter of a few years, I think we'll essentially see models that are more cost-performant via distillation, quantization and the multitude of tricks you can do to reduce the inference overhead.