| Great write-up! It is a pleasure to see more people explore this area. You can make it even more lean and frugal, if you want. Here is how we built a voice assistant box for Bashkir language. It is currently deployed at ~10 kindergartens/schools: 1. Run speech recognition and speech generation on server CPU. You need just 3 cores (AMD/Intel) to have fast enough responses. Same for the SBERT embedding models (if your assistant needs to find songs, tales or other resources). 2. Use SaaS LLM for prototyping (e.g. mistral.ai has Mistral small and mistral medium LLMs available via API) or run LLMs on your server via llama.cpp. You'll need more than 3 cores, then. 3. Use ESP32-S3 for the voice box. It is powerful enough to run wake-word model and connect to the server via web sockets. 4. If you want to shape responses in a specific format, review Prompting Guide (especially few-shot prompts) and also apply guidance (e.g. as in Microsoft/Guidance framework). However, normally few-shot samples with good prompts are good enough to produce stable responses on many local LLMs. NB: We have built that with custom languages that aren't supported by the mainstream models, this involved a bit of fine-tuning and custom training. For the main-steam languages like English, things are way more easy. This topic fascinates me (also about personal assistants that learn over time). I'm always glad to answer any questions! |