Hacker News new | ask | show | jobs
by raw_anon_1111 82 days ago
And this is another easily solved problem by someone who knows what they are doing…

Voice -> speech to text engine -> LLM creates JSON that the orchestrator understands -> JSON -> regular code as the orchestration -> text based response -> text to speech

Notice that I am not using the LLM to produce output to the user and if the orchestrator (again regular old code) doesn’t get valid input, its going to error. Sure you can jailbreak my LLM interpretation. But my orchestrator is going to have the same role based permission as if I were using the same API as a backend for a website. Because I probably am

Source: creating call centers with Amazon Connect is one of my specialties

2 comments

> Notice that I am not using the LLM to produce output to the user

So what output does the user get?

The programmatically generated response from the orchestrator which could be either a confirmation or request for more information.
Sure - but does this have the context of the original question that the user asked? If not it seems that it isn’t really conversational and more of a “compiler”.

How would something like “I want an appointment either on Monday afternoon after 4pm or one on Tuesday before 11am” work?

Unless all the parameters given by the user fit within the constraints of the json format then the LLM would need the context of the request and the results to answer properly, would it not?

For reference, my last discussion about this

https://news.ycombinator.com/item?id=47241412

This is a constrained space. I would do the naive implementation at first and then talk to the humans (like you) and then my JSON definition would include a timespan type field.

My orchestrator would then say “I have these times available [list of times]. What time would you like?” and then return a specific LLM prompt to parse the information I need once the user responds. But I would send that exact text to the user. Yes I’m purposefully constraining the implementation where the LLM is never used for output and never directly controls the backend

There is also the concept of “semantic alignment” where you ask the LLM to generically answer the question - “does the users answer make sense with regard to the question” as a first level filter that only returns true or false. This is again a constrained function that you pass in the question and answer to the LLM and if you get something besides true or false your code errors.

The purpose of an LLM or even before that an old school intent based system (see my link) isn’t perfection it’s “deflection”. The more that you can handle through automation the less you have to bring a human in. An American based call center when a person is an agent costs from $3–$7 a call fully allocated. An automated call can costs tenths of a penny.

Of course that doesn’t include the cost of the accepting a call in the first place over a 1-800 number and in my case the price that AWS charges per minute for Amazon Connect

> This is again a constrained function that you pass in the question and answer to the LLM and if you get something besides true or false your code errors.

Code erroring is fine for code, but what is the user experience here? Some sort of “computer says no” generic response, or something more contextual?

I’m trying to picture what the user says and hears as a response to an off-the-beaten-path question. Is it just “I don’t understand, here’s how to phrase it?”.

If there is an issue, they are transferred to a human operator. “I’m having trouble understanding you, let me transfer you to someone who can help”. On the CSRs screen, they will see the conversation that has taken so far.

There is also sentiment analyst built into the prompt so it can detect a negative sentiment and automatically short circuit the process and transfer to a human.

Could just have used NLP
NLP doesn’t have world knowledge and with one prompt, I can support almost any language. Of course the speech to text engine is specific to the language