|
|
|
|
|
by llmsolutions
1032 days ago
|
|
This is great, we've built eerily similar tooling for our internal projects. Unfortunately, in our experiments with OpenAI chat completions, the "reproducible" part has proven to just not be possible. There's nothing more frustrating than spending an hour debugging your chain, only to realize a binary classification prompt has decided to flip after hundreds of consistent executions! |
|
Instead, I think composing LLMs needs to be done in a way that degrades gracefully, with resilience to failure being a fundamental consideration. Biology has similar properties; complex biological systems (ecosystems, cells, etc) have feedback loops, redundancy, and most of all diversity. If we take a similar approach to building LLM apps, we'll end up with things like:
- multiple different prompts used in parallel, with results joined e.g. with voting. A change in how one prompt behaves can thus only have a bounded effect on the system as a whole.
- some way for an LLM to productively express 'this thing you're asking me to do is nonsense', with monitoring and continuous evaluation hooked up to that signal, and maybe runtime retry behavior as well. This can help with when you get into situations where prompt A gets an "I'm afraid I can't do that" response, and then you give that to prompt B as if it is a valid thing, and that cascades through the rest of the application.
llmtaskgraph as a library is designed to make building, operating and maintaining systems with these sorts of features easier - without good observability, it's impossible to know if some feedback loop is doing its job, or which prompts in a pool are behaving well vs poorly, much less what effect they are having on the rest of the system.
Sorry for the wall of text, I got a bit nerd-sniped. :)