| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by md2rp 800 days ago
	A Visual Guide to Vision Transformers This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. Vision Transformers apply the transformer architecture, originally designed for natural language processing (NLP), to image data. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and how the flow of the data through the model looks like.

1 comments

bArray 800 days ago

Nice! A small piece of feedback: I would have the dimensions mentioned in the text also annotated on the diagram. It wasn't exactly clear how the input data was flattened for example.

link

byteknight 800 days ago

Would also add, as a 100% math idiot, linear transformations, and how it performs them is not explained.

Entirely plausible this is intended for someone more "mathmatical" than myself but appreciate the work regardless.

link

md2rp 800 days ago

Thanks for the feedback! I left it out intentionally but probably worth thinking about doing a more fundamental guide!

link

md2rp 800 days ago

Thanks for the feedback! Will add it in the revision!

link