Hacker News new | ask | show | jobs
by largbae 111 days ago
I think this article just speaks to the immaturity of our use of AI at this "moment."

Production grade systems might be written by agents running on filesystem skills, but the production systems themselves will run on consistent and scalable data structures.

Meanwhile the UI of AI agents will almost certainly evolve away from desktop computers and toward audio/visual interfaces. An agent might get more context from a zoom call with you, once tone and body language can be used to increase the bandwidth between you.

2 comments

I don't think written prompting will ever go away. Writing helps you organize your thoughts in a way that speaking, umm, ah, wait no, hang on, does not. Writing I can go back and change what I've already written before I hit send. Anybody who's prompted with speech for any length has been "wait no nevermind start over". So STT will get better, sure, it's already quite good. I just don't see text extry entirely going away because Human Intelligence (HI) just doesn't work in a way that speech would be the only interface.
Totally agree. Speech is powerful and it will always have its place. It will continue to evolve and become far more useful than it is today. But at its core, it remains a highly lossy medium compared with text, especially when it comes to expressing (and consuming expressions thereof) ideas. Even the best voice memo cannot rival a clear, well-structured email when it comes to explaining something even moderately complicated.

Voice assistants, AI pins, and whatever other speech-based interfaces they come up with next will always be "nice to have", but I don't think anybody should be throwing away their keyboards anytime soon. We may have transformed how we make computers work for us, yet the ways we interact with them are much harder to revolutionize, because they are grounded in the physical, neurological, and habitual constraints of human existence. All of which is to say, when I look at the future, I still see a lot of typing.

https://www.youtube.com/watch?v=GH9-EmgtABw

Saw this video recently, by an AI company working to get contextual cues from tone and body language. I think they're converting it to text and feeding it into a LLM, so not natively multimodal, but I still thought it was really cool.