Hacker News new | ask | show | jobs
by bigmadshoe 7 days ago
I agree, but since we're talking about imagine understanding with text output, clearly a CNN is unsuitable. My previous comment was overly reductive and CNNs can still be SoTA depending on your performance metrics. I spent the earlier part of my career training CNNs, and they are very pleasant to work with.
1 comments

You can run a CNN and use the downsampled feature map the same way as patch tokens.