|
|
|
|
|
by reaperman
1111 days ago
|
|
Extremely curious that PaLM-E, PaLI, and GPT-4 were trained to be multimodal (accept non-text inputs, such as images) but the released API's are text-only. In GCP's case, here, they've released PaLM-2 which is not multimodal like PaLM-E and PaLI. This prevents using it for visual reasoning[0]. I'm just wondering why multiple parties seem reluctant to allow the public to use this. 0: https://visualqa.org |
|
Basically, multi-modal functionality should be an OOM increase in compute, traffic, and storage requirements for anyone providing it compared to a text-only model (or an only-text-allowed model).