Y
Hacker News
new
|
ask
|
show
|
jobs
by
derac
16 days ago
Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.
1 comments
causal
16 days ago
That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?
link