Do you think Pollen is applicable to distributed AI inference? I think it could work for realtlime Voice Agents running directly on mobile hardware.
There are speech-to-speech LLMs that are big and do pure audio in audio out. But you can also make voice agents that use multiple smaller models cascadded. ASR for transcription, LLM for response text, TTS for speech, interrupt detection. If you try to load ASR, LLM, and TTS models that actually do a good job onto the same mobile device all at once, you can't get it to be realtime. But if you run them in a distributed setup, where each device has only one model loaded and streams its output to the next task device, you might achieve realtime performance while using stronger models for each task.
Does this sound possible, or am I misunderstanding how Pollen works?
From a conceptual, workload-deployment perspective, I'd say yes--this is largely what I'm trying to achieve with Pollen. In fact I'd go so far as to say that it would be the recommended way of deploying workloads. Pollen's placement model responds better to single functions per seed rather than a single module with multiple, disparate functions, because you'd get a natural balancing of compute; heavy functions scale more aggressively, light functions less so.
The wonder if the limiting factor would be _which_ models can actually be compiled into a reasonably sized WASM module (I'm not familiar with this right now--are you aware of efforts in this space?). If there are genuinely effective WASM models that fit into a reasonable sized modules, then it would fit nicely.
All this with the previously acknowledged limitation that it's not yet on mobile (but perhaps a number of edge Pollen nodes could act as ingresses into the cluster in the interim).
I'm super interested to hear how you might employ it though, if you did start experimenting. I'd be interested to learn where it's useful and where it falls short. Please feel free to hit me up on Github or by email (in my profile)!
Just emailed you but I'll reply here as well in case anyone comes across this thread and finds it useful later.
-TTS: I am actively working on this at Wfloat and just released a 30M param model with 20 voices, emotion, and intensity control that supports running on even legacy 2017 phones.
-ASR: I think this is relatively in a good spot, the current ones small enough to fit on-device just mess up more at transcribing
-LLM: For sure the main bottleneck. I know a bunch of people are working on this one. The problem with LLMs is just that they have to be so big to actually know how to do anything.
There are speech-to-speech LLMs that are big and do pure audio in audio out. But you can also make voice agents that use multiple smaller models cascadded. ASR for transcription, LLM for response text, TTS for speech, interrupt detection. If you try to load ASR, LLM, and TTS models that actually do a good job onto the same mobile device all at once, you can't get it to be realtime. But if you run them in a distributed setup, where each device has only one model loaded and streams its output to the next task device, you might achieve realtime performance while using stronger models for each task.
Does this sound possible, or am I misunderstanding how Pollen works?