Hacker News new | ask | show | jobs
by JoeCortopassi 736 days ago
A lot of people here haven't integrated GPT into a customer facing production system, and it shows

gpt-4, gpt-4-turbo, and gpt-4o are not the same models. They are mostly close enough when you have a human in the loop, and loose constraints. But if you are building systems off of the (already fragile) prompt based output, you will have to go through a very manual process of tuning your prompts to get the same/similar output out of the new model. It will break in weird ways that makes you feel like you are trying to nail Jello to a tree

There are software tools/services that help with this, and a ton more that merely promise to, but most of the tooling around LLMs these days gives the illusion of a reliable tool rather than results of one. It's the early days of the gold rush still, and every one wants to be seen as one of the first

4 comments

Maybe we shouldn't be selling products built on such a shaky foundation? Like Health Insurance products for example.

[2]: https://insurtechdigital.com/articles/chatgpt-the-risks-and-...

--- please disregard [1] it was a terrible initial source I pulled of Google

[1]: https://medium.com/artivatic/use-of-chatgpt-4-in-health-insu...

Building products on shaky foundations is a tried-and-true approach in IT business.
For a different point of view from someone with extremely credible credentials (learned this stuff from Hinton among many other things) and a much more sober and balanced take on all this I recommend the following interview with Nick Frosst (don’t be put off by the clickbait YouTube title, that’s a very silly caption):

https://youtu.be/4JF1V2hzGKE

Minimum Viable Products are pretty much by definition built on shaky foundations. At least with software written by humans the failure modes are somewhat bounded by the architecture of the system as opposed to the who-knows-what-the-model-will-hallucinate of AI.
I think that is the key problem, a traditional MVP is a mostly known entity. It may be missing some features, some bugs, etc. But it is an MVP not because it was necessarily rushed out the door (I mean... it was, but differently) but because it has some rough edges and is likely missing major features.

Where what it seems we are getting with a lot of these companies shoving AI into something and calling it a product, is an MVP that is an MVP due to an unknown and untested nature.

The term MVP was cover for shoving poor quality software out on the market long before AI became involved. This is unfortunate, but inevitable when the term was popularized. AI is incredibly easy to tack on now, so people are doing that too.
That is true, but I think rushing to add AI features made it a completely different situation.

We get a lot of MVP crap before, don't get me wrong. But at least it was understood crap. Sure it may have bugs in it and that is to be expected. But there was a limit in how wrong it could go. Since at the end of the day it was still limited to the code within the application and the server (if there is one).

Meanwhile when an over-reliance on an LLM goes wrong, depending on how it goes wrong could be catastrophic.

As we have seen time and time again just in the last couple months, when LLM's are shoved into something we seem to get a serious lack of testing under the guise of "beta".

But ultimately we have to test and release things to see what works and what doesn't. Very many usecases don't require perfect accuracy.
I’m not really sure this is an entirely fair argument.

If you rely on third party packages of any type, you have dependencies that can rapidly and unexpectedly break with an update. Semantic versioning is supposed to help with this, but it doesn’t always help.

> It will break in weird ways that makes you feel like you are trying to nail Jello to a tree

Probably the best description of working with LLM agents I've read

It gets more interesting when you get to benchmarking your prompts for accuracy. If you don't have an evaluation set you are flying blind. Any model update or small fix could break edge cases while you don't know.
We are using benchmarking on our own eval sets, which makes it easier to measure the variance that I’ve found impossible to eliminate.
Make sure you don’t upload that evaluation set to any service that resells data (or gets scraped) for LLM training!
Came here to say the same thing, it sums it up perfectly
Hopefully you built a solid eval system around the core of your GenAI usage, otherwise, yes, this is going to be very painful :)
My naive answer: turn away from Silicon Valley modernity with its unicorns and runways and “”marketing””, and embrace the boring stuffy academics! https://dspy-docs.vercel.app/
I never got DSPy. I only tried a brief example, but can someone explain why it's better than alternatives? Not that I hold LangChain in particularly high regard...
hosted on Vercel and Github...
Is it Winter already?
Oh god, what I would give for an AI winter right now… I think we’ve officially hit AI global warming
It can't be, it's too big to die.
I've seen people mention this lib before and I have a hard time understanding the use cases Nad how it's used.