Hacker News new | ask | show | jobs
by reaperman 1111 days ago
Extremely curious that PaLM-E, PaLI, and GPT-4 were trained to be multimodal (accept non-text inputs, such as images) but the released API's are text-only. In GCP's case, here, they've released PaLM-2 which is not multimodal like PaLM-E and PaLI. This prevents using it for visual reasoning[0].

I'm just wondering why multiple parties seem reluctant to allow the public to use this.

0: https://visualqa.org

3 comments

The image compression/decompression from their special token system wouldn't be free, it would be just as expensive as any other per-pixel transformation on an image file, and it would be entirely custom software doing it that they would have to run on their servers. Image upload and download is a very significant increase in net traffic compared to just text and could make the whole venture cost a lot more. And finally, an image even when downsized is going to be composed of a lot of tokens, so that's going to be a lot of computational cost just to run inference on it. If they haven't implemented statefulness (which many haven't right now despite the simplicity of the technique, field is still very new), that computational cost must be repeated with every fresh API call.

Basically, multi-modal functionality should be an OOM increase in compute, traffic, and storage requirements for anyone providing it compared to a text-only model (or an only-text-allowed model).

The voice of the people is sometimes a bit raucous
Plus, there is a frenzy on how to maximally exploit these as fast as possible from all angles, and all parties.

Anyone who acts all casual, as if there is not a constellation of vultures circling AI right now should consider themselves 'off-grid'

I wish they would just open the floodgates. The vultures will realize that their extractive problems won't be solved by a generative model, no matter how "multimodal" its inputs are. Of course, that won't happen, because that would require certain charlatans admitting that their models won't hold up in half the places the even more greedy vultures are vying for.
Presumably they're harder to censor or enforce ideological constraints on. I can't see any other reason other than them being worried about bad press because someone made the model do something that they want to play up as bad.
I can think of two very important reasons just off the top of my head.

1. --- It will kill captchas for good. Half of the internet is protected by Cloudflare or Google captchas at this point. Spam, fraud, and other trouble has a maximum possible volume because you can only pay a human in India so little to solve them for you. If you have an algorithm that can complete it, the game is up. Sites may as well not have a captcha at all. Prevention then becomes much more Orwellian with hardware TPM attestation solutions and the internet as we know is forever changed.

2. --- It will show corporations and governments just how all-seeing video surveillance could be. Human or (by some reports, above-human) level computer vision is a Pandora's box all by itself.

OpenAI might simply be wanting to avoid opening any more family-size cans of worms than there already are.

> If you have an algorithm that can complete it, the game is up.

This is very much already a thing, I'm sad to say.