|
|
|
|
|
by magicalhippo
47 days ago
|
|
Modern smaller LLMs like Qwen3.6 27B is quite good at visual tasks like describing images. I wouldn't trust it on receipts unless you're fine with a bit less than 100% accuracy, say 90-ish%. For descriptions of images and such I've found they do quite well indeed. A key change was the introduction of more or even dynamic visual tokens, that really helped the model "see" more details. Generating cat videos is the domain of diffusion models. If you have at least a 16GB GPU and a fair bit of patience you can get quite good results, check out ComfyUI reddit for example. |
|
Here's the output:
[1]: https://i.pinimg.com/originals/41/08/dc/4108dcf51f15af464bb6...