My initial testing was with charts - I've been waiting on local vision models to be good enough to feed technical documents and my initial testing is looking very good. Example:
I've tried with some ppt images rather than Clevr ones and it does much better. It can count circles and triangles and differentiates between them quite well. It can recognise the colours of the objects as well.
I think that the faux 3d of clevr images is too much for the model, it's interesting because much smaller pre-transformer specialist models were very good at clevr.
I think that the faux 3d of clevr images is too much for the model, it's interesting because much smaller pre-transformer specialist models were very good at clevr.