Hacker News new | ask | show | jobs
by yegemberdin 3 days ago
How do you guys ensure that the refactoring improves the existing code?
2 comments

The answer to "how do you ensure refactoring improves code?" is embedded in the binary as a system prompt. It's his own blog post about the Embedded Design Principle. The binary contains 9 system prompts, all instruction templates for the LLM. None contain any code for measuring code quality (unfortunately) The pipeline is three steps: suggest-data-unifications - prompts the LLM with the blog post. The prompt starts literally with "For each data structure in the specified code, do the following." suggest-code-unifications - same agent, different prompt. Starts with "Now look at the file and apply the above guidelines." execute-refactoring - runs the LLM's suggestions through a coding agent. No verification between steps. No quality gate. No baseline comparison. The refactoring agent's entire context is the blog post, literally. Read it. Find duplication. Merge it. The closest thing to a "guardrail" is a function which calls eval() on arbitrary user-defined JavaScript. And AutoAcceptDecorator which intercepts LLM messages matching /proceed|go ahead|make|implement|apply/ and auto-replies "Yes, please proceed with the changes." So when you ask "how do you ensure it improves code?" the answer is: we ask an LLM to read a blog post about code quality and then we trust it. And we built a regex that auto-accepts its own changes. The binary also has a separate class for fiber-based refactoring execution, and a full walkthrough generation pipeline that auto-generates code walkthroughs from git diffs. There's a separate workflow for file organization that reads Jimmy Koppel's rule ("Make the design apparent in the code") and applies section headers to changed files. Completely independent from the deduplication agent but uses the same pipeline: read prompt, LLM, apply changes. And the DoItAll workflow chains everything together. DeDuplicate runs in parallel, then embedded-design and organize-file run on every changed file with concurrency:2. It's a full refactoring pipeline.... but every single step is just: read a blog post, LLM, apply. The entire product is two blog posts, a concurrency manager, and a regex.
Ooh. The answer is probably more interesting and philosophical than you expected

I can tell you that we do extensive testing, we figured out how to objectively measure the code quality on certain benchmark problems, empirically it's extremely helpful nearly all the time.

But in the general case: it is not actually possible to guarantee this.

That's because whether a change improves the code often depends on information which is literally not present in the codebase.

Some of these are more trite. E.g.: whether a comment is helpful or redundant slop depends on the audience.

Some are deeper. E.g.: whether a piece of duplication is good or bad depends on the intent, and that is often impossible to recover from the source. https://www.pathsensitive.com/2018/01/the-design-of-software...

A simpler example: There's a function that's never called. Should it be deleted?

There's a number of factors outside the codebase that determine the answer. Including the obvious one "Not if your next prompt is going to start using it."

You found a way to objectively measure code quality?? Sell that! Why even sell this course when you have the ability to literally beat every software company?
In honesty, that's not a bad idea, and we hadn't thought of that.

It's pretty expensive to measure even for small programs. It's also more of a relative than an absolute measure, i.e.: it scores two variants of the same codebase, but the raw scores aren't very meaningful on their own. So our goal had been to use this in the benchmark set we're working on when we release a standalone refactoring product.

But the more I think about this suggestion, the more I think: "Hmmm, why not?"