|
|
|
|
|
by icyfox
197 days ago
|
|
We talked about this model in some depth on the last Pretrained episode:
https://youtu.be/5weFerGhO84?si=Eh_92_9PPKyiTU_h&t=1743 Some interesting takeaways imo: - Uses existing model backbones for text encoding & semantic tokens (why reinvent the wheel if you don't need to?) - Trains on a whole lot of synthetic captions of different lengths, ostensibly generated using some existing vision LLM - Solid text generation support is facilitated by training on all OCR'd text from the ground truth image. This seems to match how Nano Banana Pro got so good as well; I've seen its thinking tokens sketch out exactly what text to say in the image before it renders. |
|