Hacker News new | ask | show | jobs
by avisoori1x 792 days ago
I implemented a vision language model consisting of an image encoder, a multimodal projection module and a decoder language model in pure PyTorch. Think of this as a simplified version of what you see in GPT-4 or Claude 3 in terms of vision capabilities demonstrated by a language model. The name ‘seemore’ is my way of paying homage to Andrej Karpathy’s ‘makemore’ because here I use a character level autoregressive language model much like in his basic transformer implementation. seemore.py in the repo has the single file with everything in it.