| HN Mirror

I've tried with some ppt images rather than Clevr ones and it does much better. It can count circles and triangles and differentiates between them quite well. It can recognise the colours of the objects as well.

I think that the faux 3d of clevr images is too much for the model, it's interesting because much smaller pre-transformer specialist models were very good at clevr.