|
|
|
|
|
by GaggiX
747 days ago
|
|
>fine-tuned using outputs from Llama 3. Llama 3 outputs text and can only see text, this is a vision model. >that would make it Llama-2-based. It's based on Llama 3, Llama 2 has nothing to do with it. They took Llama 3 Instruct and CLIP-ViT-Large-patch14-336, train the projection layer first and then later finetuned the Llama 3 checkpoint and train a LoRA for the ViT. |
|