| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ilaksh 174 days ago
	I wonder if some day there will be a video codec that is essentially a standard distribution of a very precise and extremely fast text-to-video model (like SmartTurboDiffusion-2027 or something). Because surely there are limits to text, but even the example you gave does not seem to me to be beyond the reach of a text description, given a certain level of precision and capability in the model. And we now have faster than realtime text to video.

2 comments

zephen 172 days ago

Maybe?

To the extent that that could work, I would imagine that I, personally, would be happy reading the textual description instead of watching the video, and for me, we'd now be even closer to text wins 100% of the time.

In other words, it's not that you _can't_ give excellent descriptions that would obviate the need for video, it's just that people _don't_, even, or perhaps even especially, when they think they do.

If someone writes text that creates a video that shows exactly how to get something apart, then _presumably_ they also watch the video to make sure it works.

So the video becomes a debugging tool for their instructions. Perhaps not as good as watching 100 people do it, but maybe even better in some ways.

So the video codec you describe could be a useful tool to help create more programmers.

https://www.commitstrip.com/en/2016/08/25/a-very-comprehensi...

link

tsimionescu 171 days ago

I think it's quite obvious that any textual description that had any hope of being converted to video in this way would be entirely useless for a human mind. It wouldn't say something like "the fastener is on the under side of the chair about 3/5s of the way", it would say somerhing like "there is a square-shaped object in view 5cm from the top of the view and 120cm from the right; the object is 2cm x 2.2cm, color 0x7F325A".

link

zephen 171 days ago

> entirely useless for a human mind.

You may be right, although, of course, current LLMs often do the right thing with "about 3/5ths of the way."

OTOH, as someone who has done CAD and schematic drawings by programming, I am not 100% convinced about the inevitability of unreadability.

In any case, though, the bar is not really whether any human can interpret the text, but whether the average human will interpret the text or video faster, and here, to your point, yes, the video probably still wins handily.

The closest analogy I can think of is animated math gifs like these:

https://en.wikipedia.org/wiki/User:LucasVB/Gallery

Which can be a huge aid in learning.

But this leads to another conundrum. Where do animated GIFs end and video begin? Because I could see a simple line-drawing style animated GIF being sufficient for most purposes.

link

egypturnash 174 days ago

This sounds incredibly precarious and prone to breaking when you update to a new model.

link

ilaksh 174 days ago

It would be impossible to change the model. It would be like a codec, like H.264 but with 1-2GB of fixed data attached to that code name. Changing the model is like going to H.265. Different codec.

link