That's essentially what an embedding model is - a smaller, faster model that's good at finding information quickly. Then you feed that to a larger, more powerful reasoning model to synthesize and you've invented RAG.
In my limited weekend testing with just a CPU, the 1.5B model is the larger and more powerful model at the end!
I'm definitely excited to see what new applications are possible with NPUs, when we can run this stuff for real on stuff anyone other than enthusiasts can afford, without waiting 40 seconds.
I'm definitely excited to see what new applications are possible with NPUs, when we can run this stuff for real on stuff anyone other than enthusiasts can afford, without waiting 40 seconds.