Hacker News new | ask | show | jobs
by lxe 1180 days ago
Keep in mind that image transformer models like stable diffusion are generally smaller than language models, so they are easier to fit in wasm space.

Also. you can finetune llama-7b on a 3090 for about $3 using LoRA.

2 comments

Only for images. People want to generate videos next and those models will be likely GPT-sized.
There is a video model making the rounds on /r/stablediffusion and it is just a tiny bit larger than Stable Diffusion.
You're not kidding! it's far from perfect, but pretty funny still...

https://www.reddit.com/r/StableDiffusion/comments/126xsxu/ni...

Too bad SD learned the Shutterstock watermark so well, lol

It's cool though not very stable in details over temporal axis.
Of course the quality is horrible relative to a proper video, it just illustrates that txt2vid might not need 100B+ parameters.
Generative image models don't use transformers, they're diffusion models. LLMs are transformers.
Diffusion models can use a transformer architecture, example: DiT. Stable Diffusion is using a U-Net architecture with transformer blocks.
Ah yes that's right. Well they technically do use a visual transformer for CLIP text encoder as I understand.