| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by romland 1199 days ago

I started a bit of an exploration around prompts and code a week or three back. I want to figure out the down/up-sides and create tools for myself around it.

So, for this project (a game), I decided "for fun" to try to not write any code myself, and avoid narrow prompts that would just feed me single functions for a very specific purpose. The LLM should be responsible for this, not me! It's pretty painful since I still have to debug and understand the potential garbage I was given and after understanding what is wrong, get rid of it, and change/add to the prompt to get new code. Very often completely new code[1]. Rinse and repeat until I have what I need.

The above is a contrived scenario, but it does give some interesting insights. A nice one is that since here is one or more prompts connected to all the code (and its commit), the intention of the code is very well documented in natural language. The commit history creates a rather nice story that I would not normally get in a repository.

Another thing is, getting an LLM (ChatGPT mostly) to fix a bug is really hit and miss and mostly miss for me. Say, a buggy piece comes from the LLM and I feel that this could almost be what I need. I feed that back in with a hint or two and it's very rare that it actually fixes something unless I am very very specific (again, needing to read/understand the intention of the solution). In many cases I, again, get completely new code back. This, more than once, forced my hand to "cheat" and do human changes or additions.

Due to the nature of the contrived scenario, the code quality is obviously suffering but I am looking forward to making the LLM refactor/clean things up eventually.

On occasion ChatGPT tells me it can't help me with my homework. Which is interesting in itself. They are actually trying (but failing) to prevent that. I am really curious how gimped their models will be going forward.

I've been programming for quite long. I've come to realize that I don't need to be programming in the traditional sense. What I like is creating. If that means I can massage an LLM to do a bit of grunt work, I'm good with that.

That said, it still often feels very much like programming, though.

[1] The completely new code issue can likely be alleviated by tweaking transformers settings

Edit: For the curious, the repo is here: https://github.com/romland/llemmings and an example of a commit from the other day: https://github.com/romland/llemmings/commit/466babf420f617dd... - I will push through and make it a playable game, after that, I'll see.

3 comments

celeritascelery 1199 days ago

That is really interesting experiment! I have so many questions.

- do you feel like this could be a viable work model for real projects? I recognize it will most likely be more effective to balance LLM code with hand written code in the real world.

- some of your prompts are really long. Do you feel like the code you get out of the LLM is worth the effort you put in?

- given that the code returned is often wrong, do you feel like you could feasible for someone who knows little to no code?

- it seems like you already know well all the technology behind what you are building (I.e. you know how to write a game in js). Do you think you could do this without already having that background knowledge?

- how many times do you have to refine a prompt before you get something that is worth committing?

link

romland 1199 days ago

I think it could be viable, even right now, with a big caveat, you will want to do some "human" fixes in the code (not just the glue between prompts). The downside of that is you might miss out on parts of the nice natural language story in the commit history. But the upside is you will save a lot of time.

Down the line you will be able to (cheaply) have LLMs know about your entire code-base and at that point, it will definitely become a pretty good option.

On prompt-length, yeah, some of those prompts took a long time to craft. The longer I spend on a prompt, the more variations of the same code I have seen -- I probably get impatient and biased and home in on the exact solution I want to see instead of explaining myself better. When it's gone that far, it's probably not worth it. Very often I should probably also start over on the prompt as it probably can be described differently. That said, if it was in the real world and I was fine with going in and massaging the code fully, quite some time could be saved.

If you don't know how to code, I think it will be very hard. You would at the very least need a lot more patience. But on the flip side, you can ask for explanations of the code that is returned and I must actually say that that is often pretty good -- albeit very verbose in ChatGPT's case. I find it hard to throw a real conclusion out there, but I can say that domain knowledge will always help you. A lot.

I think if you know javascript, you could easily make a game even though you had never ever thought about making a game before. The nice thing about that is that you will probably not do any premature optimization at least :-)

All in all, some prompts was nailed down on first try, the simple particle system was one such example. Some other prompts -- for instance the map-generation with Perlin noise -- might be 50 attempts.

A lot of small decisions are helpful, such as deciding against any external dependencies. It's pretty dodgy to ask for code around some that (e.g. some noise library) that you need to fit into your project. I decided pretty early that there should be no external dependencies at all and all graphics would be procedurally generated. It has helped me as I don't need to understand any libraries I have never used before.

Another note that is related to the above, there are upsides and downsides with high-ish temperature is you get varying results. I think I should probably change my behaviour around that and possibly tweak it depending on how exact I feel my prompt is.

I find myself often wondering where the cap of today's LLM's are, even if we go in the direction of multi-models and have a base which does the reasoning -- and I have to say I keep finding myself getting surprised. I think there is a good possibility that this will be the way some kinds of development will be. But, well, we'd need good local models for that if we work on projects that might be of a sensitive nature.

Related to amount of prompt attempts: I think the game has cost me around $6 in OpenAI fees so far.

One particularly irritating (time consuming) prompt was getting animated legs and feet: https://github.com/romland/llemmings/commit/e9852a353f89c217...

link

sk0g 1199 days ago

That's a beautiful readme, starred!

Out of curiosity, right now would you say you have saved time by (almost) exclusively prompting instead of typing the code up yourself? Do you see that trending in another direction as the project progresses?

link

romland 1199 days ago

It was far easier to get a big chunks of work done in the beginning, but that is pretty much how it works for a human too (at least for me). The thing that limit you is the context-length limit of the LLM, so you have to be rather picky on what existing code you feed back in. With this then comes the issue with all the glue between the prompts, so I can see that the more polished things will need to become, the more human intervention -- this is a trend I already very much see.

If there is time saved, it is mostly because I don't fear some upcoming grunt work. Say, for instance, creating the "Builder" lemming. You know pretty much exactly how to do it but you know there will be a lot of one-off errors and subtle issues. It's easier to go at it by throwing together some prompt a bit half-heartedly and see where it goes.

On some prompts, several hours were spent, mostly reading and debugging outputs from the LLM. This is where it eventually gets a bit dubious -- I now know pretty much exactly how I want the code to look since I have seen so many variants. I might find myself massaging the prompt to narrow in on my exact solution instead of making the LLM "understand the problem".

Much of this is due to the contrived situation (human should write little code) -- in the real world you would just fix the code instead of the prompt and save a lot of time.

Thank you, by the way! I always find it scary to share links to projects! :-)

link

sk0g 1199 days ago

No worries, going to check out some of the commits when I get a bit more free time as well. The concept is intriguing!

The usefulness of LLMs for engineering things is very hard to gauge, and your project is going to be quite interesting as you progress. No doubt they help with writing new things, but I spend maybe ~15% of my time working on something new, vs maintenance and extensions. The more common activities are very infrequently demonstrated, either the usefulness diminishes as the context required grows, or they simply make for less exciting examples. Though someone in my org has brought up an LLM tool that tries to remedy bugs on the fly (at runtime), which sounds absolutely horrific to me...

It sounds similar to my experience with Copilot then. In small, self-contained bits of code -- much more common in new projects or microservices for example -- it can save a lot of cookie cutter work. Sometimes it will get me 80% of the way there, and I have to manually tweak it. Quite often it produces complete garbage that I ignore. All that to say, if I wasn't an SE, Copilot brings me no closer to tackling anything beyond hello world.

One big benefit though is with the simpler test cases. If I start them with a "GIVEN ... WHEN ... THEN ..." comment, the autocompletes for those can be terrific, requiring maybe some alterations to suite my taste. I get positive feedback in PRs and from people debugging the test cases too, because the intention behind them is clear without needing to guess the rationale for the test. Win win!

link

ChatGTP 1199 days ago

Just curious, you’re using which version?

link

romland 1199 days ago

I have experimented quite a bit with various flavours of LLaMa, but have had little success in actually getting not-narrow outputs out of them.

Most of the code in there now is generated by gpt-3.5-turbo. Some commits are by GPT-4, and that is mostly due to context length limitations. I have tried to put which LLM was used in every non-human commit, but I might have missed it in some.

link