| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by findingMeaning 694 days ago

Everything is LLMs these days. LLMs this, LLMs that. Am I really missing out something from these muted models? Back when it was released, they were so much capable but now everything is muted to the point they are mostly autocomplete on steroids.

How can adding analytics to a system that is designed to act like humans produce any good? What is the goal here? Could you clarify why would some need to analyze LLMs out of all the things?

> Rich text data makes LLM traces unique, so we let you track “semantic metrics” (like what your AI agent is actually saying) and connect those metrics to where they happen in the trace

But why does it matter? Because at the current state these are muted LLMs overseen by the big company. We have very little to control the behavior and whatever we give it, it will mostly be 'politically' correct.

> One thing missing from all LLM observability platforms right now is an adequate search over traces.

Again, why do we need to evaluate LLMs? Unless you are working in a security, I see no purpose because these models aren't as capable as they used to be. Everything is muted.

For context: I don't even need to prompt engineer these days because it just gives similar result by using the default prompt. My prompts these are literally three words because it gets more of the job done that way than giving elaborate prompt with precise example and context.

5 comments

otabdeveloper4 694 days ago

They're not "muted". You just got used to them and figured out that they don't actually generete knew knowledge or information, they only give a statistically average summary of the top Google query. (I.e., they are super bland, boring and predictable.)

sswatson 693 days ago

LLMs are pretty bland but they don’t just summarize the top Google result. They can generate correct SQL queries to answer complex questions about novel datasets. Summarizing a search engine result does not get you anywhere close to that.

It may be fair to characterize what they’re doing as interpolative retrieval, but there’s no reason to deny that the “interpolative” part pulls a lot of weight.

P.S. Yes, reliability is a major problem for many potential LLM applications, but that is immaterial to the question of whether they're doing something qualitatively different from point lookups followed by summarization.

otabdeveloper4 693 days ago

> They can generate correct SQL queries to answer complex questions about novel datasets.

"Correct" is a big overstatement, unless by "SQL" you mean something extremely basic and ubiquitous.

coderaptor 693 days ago

The output can be explicitly constrained to a formal syntax (see outlines.dev).

For many cases this is more than enough to solve some hard problems well enough.

cruffle_duffle 693 days ago

Honestly I think the reason it is “extremely basic” is because while it has been trained on “the entire internet” it doesn’t know anything about your specific database schema beyond what you provided in your prompts.

If these LLMs were cheap and easy to train (or is it fine tune?) using your own schema and code base on top of its existing “whole internet” training data… it could almost certainly do more than just provide “basic stuff”.

Of course I think the training for your own personal stuff would need to be “different” somehow so it knows that while most of its training is generalistic the stuff you feed it is special and it needs to apply the generalist training as a means for understanding your personal stuff.

Or something like that. Whatever the case is it would need to be cheap, quick and easy to pick up a generalist LLM and supplement it with the entirety of your own personal corpus.

thelittleone 693 days ago

I found a LOT more value with personal python based API tools once I employed well described JSON schemas.

One of my clients must comply with a cyber risk framework with ~350 security requirements, many of which are so poorly written that misinterpretation is both common and costly.

But there are other, more well-written and described frameworks that include "mappings" between the two frameworks.

In the past I would take one of the vague security requirements, read the mapping to the well described framework to understand the underlying risk, the intent of the question, as well as likely mitigating measures (security controls). On average, that would take between 45-60 minutes per question. Multiply that out it's ~350 * 45 minutes or around 262 hours.

My first attempts to use AI for this yielded results that had some value, but lacked the quality to provide to the client.

On this past weekend, using python, Sonnet 3.5, JSON schemas, I managed to get the entire ~350 questions documented with a quality level exceeding what I could achieve manually.

It cost $10 in API credits and approx 14 hrs of my time (I'm sure a pro could easily achieve this in under 1 hour). The code itself was easy enough, but the big improvements came from the schema descriptions. That was the change that gave me the 'aha' moment.

I read over final results for dangerous errors (but ended up changing nothing at all) but just in case, I ran the results through GPT-4o which also found no issues that would prevent sending it to the client.

I would never get that job done manually, it's simply too much of a grind for a human to do cheaply or reliably.

skull8888888 693 days ago

Have you tried BAML (https://github.com/boundaryml/baml)? It's really good at structured output parsing. We integrated it directly into our pipeline builder.

thelittleone 692 days ago

Not yet, but its the weekend is just beginning, thanks for the tip.

aaronvg 692 days ago

(BAML founder here) feel free to jump on our Discord or email us if you have any issues with BAML! Here's our repo (with docs links) https://github.com/BoundaryML/baml and a demo: https://boundaryml.wistia.com/medias/5fxpquglde

People have used it to do anything from simple classifications to extracting giant schemas.

skull8888888 692 days ago

You are welcome! The easiest way to get started with BAML on Laminar is with our pipeline builder and Structured Output template. Check out the docs here (https://docs.lmnr.ai/pipeline/introduction)

skull8888888 694 days ago

Hey there, apologies for the late reply.

> Could you clarify why would some need to analyze LLMs out of all the things?

When you want to understand trends of the output of your Agent / RAG on scale, without looking manually at each trace, you need to another LLM to process the output. For instance, you want to understand what is the most common topic discussed with your agent. You can prompt another LLM to extract this info, Laminar will host everything, and turn this data into metrics.

> Why do we need to evaluate LLMs?

You right, devs who want to evaluate output of the LLM apps, truly care about the quality or some other metric. For this kind of cases evals are invaluable. Good example would be, AI drive-through agents or AI voice agents for mortgages (use cases we've seen on Laminar)

Oras 694 days ago

Topic modelling and classifications are real problems in LLM observability and evaluation, glad to see a platform doing this.

I see that you have chained prompts, does that mean I can define agents and functions inside the platform without having it in the code?

skull8888888 694 days ago

Yes! Our pipeline builder is pretty versatile. You can define conditional routing, parallel branches, and cycles. Right now we support LLM node and util nodes (json extractor). If you can defined your logic purely from those nodes (and in majority of cases you will be), then great, you can host everything on Laminar! You follow this guide (https://docs.lmnr.ai/tutorials/control-flow-with-LLM) it's bit outdated by gives you a good idea on how to create and run pipelines.

baq 694 days ago

> Everything is LLMs these days. LLMs this, LLMs that. Am I really missing out something from these muted models? Back when it was released, they were so much capable but now everything is muted to the point they are mostly autocomplete on steroids.

it was my experience, too, then I tried out that cursor thing and turns out a well designed UX around claude 3.5 is the bees knees. it really does work, highly recommend the free trial. YMMV of course depending on what you work on; I tested it strictly on Python.

not_a_dane 694 days ago

LLMs and python don't sound good together.

phillipcarter 693 days ago

You're thinking about consumer use cases. Commercial uses cases are not "muted" by any means. The goal is to produce domain-specific JSON when fed some contextual data. And LLMs have only gotten better at doing so over time.