Hacker News new | ask | show | jobs
by hubraumhugo 622 days ago
Has anyone seen AI agents working in production at scale? It doesn't matter if you're using Swarm, langchain, or any other orchestration framework if the underlying issue is that AI agents too slow, too expensive, and too unreliable. I wrote about AI agent hype vs. reality[0] a while ago, and I don't think it has changed yet.

[0] https://www.kadoa.com/blog/ai-agents-hype-vs-reality

10 comments

Yes we use agents in a human support agent facing application that has many sub agents used to summarize and analyze a lot of different models, prior support cases, knowledge base information, third party data sets, etc, to form an expert in a specific customer and their unique situation in detected potential fraud and other cases. The goal of the expert is to reduce the cognitive load of our support agent in analyzing some often complex situation with lots of information more rapidly and reliably. Because there is no right answer and the goal is error reduction not elimination it’s not necessary to have determinism, just do better than a human at understanding a lot of divergent information rapidly and answering various queries. Cost isn’t an issue because the decisions are high value. Speed isn’t an issue because the alternative is a human attempting to make sense of an enormous amount of information in many systems. It has dramatically improved our precision and recall over pure humans.
Isn’t the best customer service:

    Cost to Solve < Remaining LTV * Profit Margin
In other words, do the details matter? If the customer leaves because you don’t take a fraudulent $10 return, but he’s worth $1,000 in the long term, that’s dumb.

You might think that such a user doesn’t exist. Then you’d be getting the details wrong again! Example: Should ISPs disconnect users for piracy? Should Apple close your iCloud sub for pirating Apple TV? Should Amazon lose accounts for rejecting returns? Etc etc.

A business that makes CS more details oriented is 200% the wrong solution.

The fraud we deal with is a lot more than $10.
Do you find that the entities committing fraud are using generative AI tools to facilitate the crimes?
They use every tool you can imagine. Most are not imaginative but many are profoundly smart. They could do anything they set their minds to and for some reason this is what they do.
The problem with agents is divergence. Very quickly, an ensemble of agents will start doing their own things and it’s impossible to get something that consistently gets to your desired state.

There are a whole class of problems that do not require low-latency. But not having consistency makes them pretty useless.

Frameworks don’t solve that. You’ll probably need some sort of ground-truth injection at every sub-agent level. Ie: you just need data.

Totally agree with you. Unreliability is the thing that needs solving first.

> The problem with agents is divergence. Very quickly, an ensemble of agents will start doing their own things and it’s impossible to get something that consistently gets to your desired state.

Sounds like management to me.

Sticky goals and reevaluation of tasks is one way to keep the end result on track.

How does gpt o1 solve this?

I use my own agent all day, every day. Here is one example: https://x.com/xundecidability/status/1835085853506650269

I've been using the general agent to build specialised sub-agents. Here's an example search agent beating perplexity: https://x.com/xundecidability/status/1835059091506450493

Do you have any code to share?

I'm failing to see the point in the example, unless the agents can do things on multiple threads. For example let's say we have Boss Agent.

I can ask Boss agent to organize a trip for five people to the Netherlands.

Boss agent can ask some basic questions, about where my Friends are traveling from, and what our budget is .

Then travel agent can go and look up how we each can get there, hotel agent can search for hotel prices, weather agent can make sure it's nice out, sightseeing agent can suggest things for us to do. And I guess correspondence agent can send out emails to my actual friends.

If this is multi-threaded, you could get a ton of work done much faster. But if it's all running on a single thread anyway, then couldn't boss agent just switch functionality after completing each job ?

That particular task didn't need parallel agents or any of the advanced features.

The prompt was: <prompt> Research claude pricing with caching and then review a conversation history to calculate the cost. First, search online for pricing for anthropic api with and without caching enabled for all of the models: claude-3-haiku, claude-3-opus and claude-3.5-sonnet (sonnet 3.5). Create a json file with ALL the pricing data.

from the llm history db, fetch the response.response_json.usage for each result under conversation_id=01j7jzcbxzrspg7qz9h8xbq1ww llm_db=$(llm logs path) schema=$(sqlite3 $llm_db '.schema') example usage: { "input_tokens": 1086, "output_tokens": 1154, "cache_creation_input_tokens": 2364, "cache_read_input_tokens": 0 }

Calculate the actual costs of each prompt by using the usage object for each response based the actual token usage cached or not. Also calculate/simulate what it would have cost if the tokens where not cached. Create interactive graphs of different kinds to show the real cost of conversation, the cache usage, and a comparison to what it would have costed without caching.

Write to intermediary files along the way.

Ask me if anything is unclear. </prompt>

I just gave it your task and I'll share the results tomorrow (I'm off to bed).

True. In the classic form of automation, reasoning is externalized into rules. In the case of AI agents, reasoning is internalized within a language model. This is a fundamental difference. The problem is that language models are not designed to reason. They are designed to predict the next most likely word. They mimic human skills but possess no general intelligence. They are not ready to function without a human in the loop. So, what are the implications of this new form of automation that AI agents represent? https://www.lycee.ai/blog/ai-agents-automation-eng
I want hear more about this. I'm playing with langroid, crew.ai, and dspy and they all layer so many abstractions on top of a shifting LLM landscape. I can't believe anyone is really using them in the way their readme goals profess.
Not you in particular, but I hear this common refrain that the "LLM landscape is shifting", but what exactly is shifting? Yes new models are constantly announced, but at the end of the day, interacting with the LLMs involves making calls to an API, and the OpenAI API (and perhaps Anthropic's variant) has become fairly established, and this API will obviously not change significantly any time soon.

Given that there is (a fairly standard) API to interact with LLMs, the next question is, what abstractions and primitives help easily build applications on top of these, while giving enough flexibility for complex use cases.

The features in Langroid have evolved in response to the requirements of various use-cases that arose while building applications for clients, or companies that have requested them.

Sonnet 3.5 and other large context models made context management approaches irrelevant and will continue to do so.

o1 (and likely sonnet 3.5) made chain of through and other complex prompt engineering irrelevant.

Realtime API (and others that will soon follow) will made the best VTT > LLM > TTV irrelevant.

VLMs will likely make LLMs irrelevant. Who knows what Google has planned for Gemini 2.

The point is building these complex agents has been proven a waste of time over and over again until, at least until we see a plateau in models. It's much easier to swap in a single API call and modify one or two prompts than to rework a convoluted agentic approach. Especially when it's very clear that the same prompts can't be reused reliably between different models.

I encourage you to run evals on result quality for real b2b tasks before making these claims. Almost all of your post is measurably wrong in ways that cause customers to churn an AI product same-day.
I appreciate your comment.

I suppose my comment is reserved more for the documentation than the actual models in the wild?

I do worry that LLM service providers won't do any better than rest API providers in versioning their backend. Even if we specify the model in the call to the API, it feels like it will silently be upgraded behind the scenes. There are so many parameters that could be adjusted to "improve" the experience for users even if the weights don't change.

I prefer to use open weight models when possible. But so many agentic frameworks, like this one (to be fair, I would not expect OpenAI to offer a framework that work local first), treat the local LLM experience as second class, at best.

Years ago we complained about the speed with which new JavaScript frameworks were popping into existence. Today it goes one order of magnitude faster, and the quality of the outputs can only be suffering. Yes there's code but so and so, interfaces and APIs change dramatically, and the documentation is a few versions behind. Who has time to compare simply cannot do it in depth, and ideas get also dropped on the way. I don't want to call it a mess because it's too negative, to have many ideas is great but I feel we're still in the brainstorming phase.
> The underlying issue is that AI agents too slow,

Inference speed is being rapidly optimized, especially for edge devices.

> too expensive,

The half-life of OpenAI's API pricing is a couple of months. While the bleeding edge model is always costly, the cost of API's are becoming rapidly available to the public.

> and too unreliable

Out of the 3 points raised, this is probably the most up in the air. Personally I chalk this up to sideeffects of OpenAI's rapid growth over the last few years. I think this gets solved, especially once price and latency have been figured out.

IMO, the biggest unknown here isn't a technical one, but rather a business one- I don't think it's certain that products built on multi-agent architectures will be addressing a need for end users. Most of the talk I see in this space are by people excited by building with LLM's, not by people who are asking to pay for these products.

Frankly, what you are describing is a money-printing machine. You should expect anyone who has figured out such a thing to keep it as a trade secret, until the FOSS community figures out and publishes something comparable.

I don’t think the tech is ready yet for other reasons, but absence of anyone publishing is not good evidence against.

Agents can work in production, but usually only when they are closer to "workflows" that are very targeted to a specific use case.
If the solution the agents create is immediately useful, then waiting a few minutes or longer for the answer is fine.
Yes I built a lot of stuff (at batch, not to respond to user queries). Mostly large scale code generation and testing tasks.