Hacker News new | ask | show | jobs
by ynniv 677 days ago
My take requires a lot of salt, but… this time it’s different.

Try writing single page web app or command line python app using the Claude 3.5 chat. Interact with it like you might in a pair programming session where you don’t have the keyboard. When you’ve got something interesting, have it rewrite it in another language. Complain about the bugs. Ask it what new features might are it better. Ask it to write tests. Ask it to write bash scripts to manage running it. Ask it how to deploy and monitor it. Run llama 3.1 on your laptop with ollama. Run phi3-mini on your phone.

The problem is that everyone says they aren’t going to get better, but no one has any data to back that up. If you listen carefully it's almost always based on a lack of imagination. Data is what matters, and we have been inventing new benchmarking problems because they're too good at the old ones. Ignore the hype, both for and against: none of that matters. Spend some time using them and decide for yourself. This time is different.

3 comments

The question is what does programming with an LLM get you over batteries-included frameworks with scaffolding like Rails or Django? If the problem only requires a generic infra solution put together by an LLM instead of a bespoke setup, why not look into low-code/no-code PaaS solutions to start with? Unless the LLM is going to provide you with some uniquely better results than existing tools designed to solve the same problems, it feels like a waste of resources to employ GPUs to do what templates/convention-over-configuration/autocomplete/etc already did.

The point isn't that LLMs are useless, or that they aren't interesting technology in the abstract. The point is that aside from the very real entertainment value of being able to conjure artwork apparently out of thin air, when it comes to solving practical problems in the tech space, it's not clear that they are achieving significantly more - faster or cheaper - than existing tools and methods already did.

You're right that it's probably too early to have data to prove their utility either way, but given how much time, money and energy many companies have already sunk into this - precisely without any evidence to prove it's worthwhile - it does come across rather more like a hype cycle at the moment.

The question is what does programming with an LLM get you over batteries-included frameworks with scaffolding like Rails or Django?

Three years ago an LLM would conversationally describe what the code would look like.

Two years ago it might crib common examples with minor typos.

Last year it could do something that isn't on StackOverflow at the level of an intern.

Earlier this year it could do something that isn't on StackOverflow at the level of a junior engineer.

Last week I had a conversation with Claude 3.5 that went something like this:

  Write an interactive command-line battleship game
  Write a mouse interactive TUI for it
  Add a cli flag to connect to `ollama` and ask it to make guesses
  There's a bug: write the AI conversation to a file so I can show you
  Try some other models: make options for calling OpenAI and Anthropic
  GPT and Anthropic are throwing this error (it needed to switch APIs)
  The models aren't doing as well as they can: engage them more conversationally
Elapsed time: a few hours. I didn't write any code. Keep in mind that unlike ChatGPT, Claude can't search the net for documentation - this was all "from memory".

What will LLMs do next year?

I read these stories about using LLMs and I always wonder if it's survivor bias. Like I believe your experience. I've also had impressive results. But also a lot of times the ai gets lost and doesn't know what to do. So I'm willing to see it as a developer tool, but it's hard to see it become more general purpose in the next 6 months time frame people have been promising for the last two years.
I played with it a year ago and it really hasn't improved much since then. I even had it produce a few things similar to your battle ship demo.

And next year I don't see it improving much either if the best idea anybody has it just to give it more data, which seems to be the mantra in ML circles. There's not an infinite supply of data to give it.

Absolutely. I posted a similar experience developing a Chrome extension with GPT 4o in a hour or so when it would have taken me at least a day to do on my own. I have no idea how people are hand waving LLMs away as no big deal.

I think the only justification for such a position is if you are a graybeard with full mastery of a stack and that's all you work in. I've dealt with these guys over the years and they are indeed wizards at Rails or Django or what have you. In those cases, I could see the argument that they are actually more efficient than an LLM when working on their specialty.

Which I guess is the difference. I'm a generalist and I'm often working in technologies that I have little experience in. To me LLMs are a invaluable for this. They're like pair programming with somebody that has memorized all of Stack Overflow.

Where did you get that it can figure out things which was not feed into it (e.g. not on Stackoverflow)? In the past year, none could answer any of my questions, for which I couldn’t find anything on Google, in any reasonable ways. They failed very badly when there was no answer to my question, and the question should have been changed.
> The question is what does programming with an LLM get you over batteries-included frameworks with scaffolding like Rails or Django?

You can use them on top of those frameworks. The point is, you + LLM is generally a way faster you no matter what tech you're using.

We’re going to have AI building Drupal sites soon. The platform is well architected for this. Most of the work is generating configuration files that scaffold the site. There are already AI integrations for content. The surface area is relatively small, and the options are well defined in code and documentation. I would not be surprised if we pull this off first. It’s one of the current project initiatives.

The coding part is still a hard problem. AI for front end and module code is still pretty primitive. LLMs are getting more helpful with that over time.

I have not seen evidence of LLM use making programming way faster. Both in my own work, or from the work of others who make this claim.
I’ve noticed that LLM speed me up working with languages I’m bad at, but slow me down when working in languages I’m good at.

When I hear people saying they use them for 80-90% of their code it kind of blows my mind. Like how? Making crazy intricate specs in English seems way more of a pain in the ass to me than just writing code.

How are you judging others who make this claim?

I'm a FAANG Sr. Software Engineer, use it both in my company and personal projects, and it has made me much faster, but now I'm just "some other person who made this claim".

Can you publish your workflow? I'm on the hunt for resources from people who make the claim. Publishing their workflow in a repeatable way would go a long way.

I'm skeptical that we aren't inundated with tutorials that prove these extraordinary claims.

What do you mean by "publish my workflow"? Do you want a blog post, a github md file? It's pretty simple.

Most recently I use Claude 3.5 projects with this workflow: https://www.youtube.com/watch?v=zNkw5K2W8AQ

Quick example, I wanted to make a clickable visible piano keyboard. I told it I was using Vue and asked for the HTML/CSS to do this (after looking at several github and other examples that looked fairly complicated). It spat out code that worked out of the box in about 1m.

I gave it a package.json file that got messed up with many dependencies versions being off from each other, it immediately fixed them up.

I asked it to give me a specific way using BigQuery SQL to delete duplicate rows while avoiding a certain function, again, 1 minute, done.

I have given it broken code and said "this is the error" and it immediately highlights the error and shows a fix.

All I can surmise from comments like this is that you must have invented some completely unreasonable bar for "evidence" that nothing can possibly pass. Either that, or you simply haven't looked at all.
Could just be different workflows.

I didn’t get anything from messing with LLM’s but I also don’t get much use out of stack overflow even as some people spend hours a week on that site. It’s not a question of skill just the nature of the work.

Then you don't understand how to use the tools. LLMs are an accelerator for people who learn how to work with the prompts correctly and already have a good grasp of the domain in which they are asking questions.
Can you point me to a tutorial or tutorials that clearly show the claimed effectiveness?
I spent some time trying to get chatgpt to write a front end in js. It would plot using a library and then when I complained about a bug it would say "Oh you're right, this library does not implement that commonly implemented method, instead use this code." and then would get in a circle of spitting out buggy code, fixing a bug, and then reintroducing an old bug.

It was okay, but kind of annoying. I understand js well enough to just debug the code myself, but I wanted it to spit out some boilerplate that worked. I can't remember if this was chatgpt omni, I was using or if it was still 3.5. It's been a short while.

Anyways, it is cool tech, but I don't feel like it offers the same predictive abilities as class ML involving fits, validation, model selection etc for very specific feature sets.

What you described was the exact same experience I had. I got so far off track in one of my conversations with corrections that I started all over again. It is neat that this technology can do it, but I probably would have been better off doing it manually to save time.

The other thing I’ve noticed is something you alluded to: the LLM being “confidently incorrect”. It speaks so authoritatively about things and when I call it out it agrees and corrects.

The more I use these things (I try to ask against multiple LLMs) the more I am wary of the output. And it seems that companies over the past user rushed to jam chatbots into any orifice of their application where they could. I’m curious to see if the incorrectness of them will start to have a real impact.

One thing I noticed about this behavior of LLMs "seeing" their error when you correct them is that sometimes I'm not even correcting them, just asking follow up questions that they interpret as me pointing out some mistake. Example:

Me: - Write a Go function that will iterate over the characters of a string and print them individually.

~Claude spits out code that works as intended.~

Me: - Do you think we should iterate over runes instead?

Claude: – You are absolutely right! Sorry for my oversight, here's the fixed version of the code:

I just wanted to reason about possibilities, but it always takes my question as if I'm pointing out mistakes. This makes me feel not very confident in their answers.

>If you listen carefully it's almost always based on a lack of imagination.

I actually find things to be the opposite. My skepticism comes from understanding that what LLMs do is token prediction. If the output that I want can be solved by the most likely next token, then sure, that’s a good use case. I’m perfectly capable of imagining those cases. People who are all in on AI seem to not get this and go wild.

There’s a difference between imagination and magical thinking.

My disappointment comes from understanding that what humans do is keystroke prediction. If the output that I want can be solved by the most likely next keystroke, then sure, that’s a good use case. I’m perfectly capable of imagining those cases. People who are all in on humanity seem to not get this and go wild.

Don't mistake the "what" for the "how". What we ask LLMs to do is predict tokens. How they're any good at doing that is a more difficult question to answer, and how they are getting better at it, even with the same training data and model size, is even less clear. We don't program them, we have them train themselves. And there are a huge number of hidden variables that could be encoding things in weird ways.

These aren't n-gram models, and you're not going to make good predictions treating them as such.

> Like previous GPT models, the GPT-4 base model was trained to predict the next word in a document…

https://openai.com/index/gpt-4-research/

What humans do is materially different than that. When someone asks me a question, I don’t come up with an answer by thinking, “What’s the first word of my response going to be? The second word?…”

I understand that the AI marketing wants us to believe there’s more magic than that quote, but the actual technical descriptions of the models are what should be considered.

Also, skepticism =/= disappointment and swapping those out greatly changes what the sentence says about my feelings on the matter. Tech from OpenAI and friends can’t really disappoint me. I have no expectation that it won’t just be a money grab ;)

> I don’t come up with an answer by thinking, “What’s the first word of my response going to be? The second word?…”

Actually, I'm not so sure that isn't exactly what we do. That's why it's called a "train of thought". You have a vague idea and you start talking and lo and behold out comes a pretty coherent encapsulation of your idea that is informed and bounded by the token relationships of your language.

Try answering a question with the order of your sentence reversed and you'll find it damn difficult. That answer of yours is not completely well formed just waiting for your mouth to get it all out. You're coming up with the answer one token at a time.

I usually try to think about what I’m going to say before I say it. My train of thought for this comment certainly did not start with “I usually”.
Like Heptapod B. The "next word" argument is pervasive and disappointingly ridiculous. If you present an LLM with a logic puzzle and it gives you the correct answer, how did it "predict the next word"? Yes, the output came in the form of additional tokens. But if those tokens required logical thought, it's a mistake to see the what as the how.
Maybe it’s pervasive because it’s literally the architecture of these models?