|
|
|
|
|
by lucubratory
1105 days ago
|
|
The image compression/decompression from their special token system wouldn't be free, it would be just as expensive as any other per-pixel transformation on an image file, and it would be entirely custom software doing it that they would have to run on their servers. Image upload and download is a very significant increase in net traffic compared to just text and could make the whole venture cost a lot more. And finally, an image even when downsized is going to be composed of a lot of tokens, so that's going to be a lot of computational cost just to run inference on it. If they haven't implemented statefulness (which many haven't right now despite the simplicity of the technique, field is still very new), that computational cost must be repeated with every fresh API call. Basically, multi-modal functionality should be an OOM increase in compute, traffic, and storage requirements for anyone providing it compared to a text-only model (or an only-text-allowed model). |
|