Hacker News new | ask | show | jobs
by jpollock 302 days ago
I've just spent the better part of two weeks trying to convince a LLM to automate some programming for me.

We use feature flags. However, cleaning them up is something rarely done. It typically takes me ~3minutes to clean one up.

To clean up the flag:

1) delete the test where the flag is off

2) delete all the code setting the flag to on

3) anything getting the value of the flag is set to true

4) resolve all "true" expressions, cleaning up if's and now constant parameters.

5) prep a pull request and send it for review

This is all fully supported by the indexing and refactoring tooling in my IDE.

However, when I prompted the LLM with those steps (and examples), it failed. Over and over again. It would delete tests where the value was true, forget to resolve the expressions, and try to run grep/find across a ginormous codebase.

If this was an intern, I would only have to correct them once. I would correct the LLM, and then it would make a different mistake. It wouldn't follow the instructions, and it would use tools I told it to not use.

It took 5-10 minutes to make the change, and then would require me to spend a couple of minutes fixing things. It was at the point of not saving me any time.

I've got a TONNE of low-hanging fruit that I can't give to an intern, but could easily sick a tool as capable as an intern on. This was not that.

3 comments

Might make sense getting it to instead create a CST traversal that deletes feature flags by their id. Then you have a re-usable trustworthy tool that you can incrementally improve/verify.
That was the lesson I was learning. I should use the LLM to generate the tools that I use for consistently repeatable tasks.

Then I can rinse and repeat using the tool, fixing the bugs in the tool myself instead of repeating the expensive (in time) cost of using the LLM.

That was my last attempt, but I ran out of time.

Which LLM? How are you prompting it?

I've been using Cursor for the last few months and notice that for tasks like this, it helps to give examples of the code you're looking for, tell it more or less how the feature flags are implemented and also have it spit out a list of files it would modify first.

I gave it explicit ordering, instructions on what tools to _not_ use, and before/after examples from the codebase. A full page of instructions.

After iterating on that for a while, I did a bunch manually (90) and then gave the LLM a list of pull requests as examples, and asked _it_ to write the prompt. It still failed.

Finally, I broke the problem up and started to ask it to generate tools to perform each step. It started to make progress - each execution gave me a new checkpoint so it wouldn't make new mistakes.

Yep, I think you did everything that's reasonable. I'm surprised myself only because I've been able to have Cursor do similar things for my codebase with no issues. Granted it's a react codebase following fairly standard practices.
If you have examples, can you provide git commit hashes that it can diff and use as a reference?

For repeating patterns I'll identify 1-3 commit hashes or PRs, reference them in a slash command, and keep the command up to date if/when edge cases occur.