This is amazing, I feel like these vision models are going to make everything so much more accessible. Between the Be My Eyes app integration and now this, I'm really excited for how this transforms the web.
I agree, and I think we're a year or two away from a full end-to-end trained screen reader. The ground truth from existing systems would provide great training material.
As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.
As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.