Hacker News new | ask | show | jobs
by cheald 588 days ago
The niche I've found for LLMs is for implementing individual functions and unit tests. I'll define an interface and a return (or a test name and expectation) and say "this is what I want this to do", and let the LLM take the first crack at it. Limiting the bounds of the problem to be solved does a pretty good job of at least scaffolding something out that I can then take to completion. I almost never end up taking the LLM's autocompletion at face value, but having it written out to review and tweak does save substantial amounts of time.

The other use case is targeted code review/improvement. "Suggest how I could improve this" fills a niche which is currently filled by linters, but can be more flexible and robust. It has its place.

The fundamental problem with LLMs is that they follow patterns, rather than doing any actual reasoning. This is essentially the observation made by the article; AI coding tools do a great job of following examples, but their usefulness is limited to the degree to which the problem to be solved maps to a followable example.

3 comments

Can't tell you how much I love it for testing, it's basically the only thing I use it for. I now have a test suite that can rebuild my entire app from the ground up locally, and works in the cloud as well. It's a huge motivator actually to write a piece of code with the reward being the ability to send it to the LLM to create some tests and then seeing a nice stream of green checkmarks.
> I now have a test suite that can rebuild my entire app from the ground up

What does this mean?

Sorry, should have been more clear. Firebase is (or was) a PITA when I started the app I'm working on a few years ago. I have a lot of records in my db that I need to validate after normalizing the data. I used to have an admin page that spit out a bunch of json data with some basic filtering and self-rolled testing that I could verify at a glance.

After a few years off from this project, I refactored it all, and part of that refactoring was building a test suite that I can run. When ran, it will rebuild, normalize, and verify all the data in my app (scraped data).

When I deploy, it will also run these tests and then email if something breaks, but skip the seeding portion.

I had plans to do this before but the firebase emulator still had a lot of issues a few years ago, and refactoring this project gave me the freedom to finally build a proper testing environment and make my entire app make full use of my local firebase emulator without issue.

I like giving it my test cases in plain english. It still gets them wrong sometimes but 90% of the time they are good to go.

I struggle to get github copilot to create any unit tests that provide any value. How to you get it to create really useful tests?
Would recommend to try out anthropic sonnet 3.5 for this one - usually generates decent unit tests for reasonably sized functions
I use claude-3-5-sonnet-20241022 with a very explicit .cursorrules file with the cursor editor.
Can you share your .cursorrules? For me cursor is not much better than autocomplete, but I'm writing mostly e2e tests.
You can find a bunch on https://cursor.directory/.
Can you give some examples? What LLM? What code? What tests?

As a test I just asked "ChatGPT 4o with canvas" to "Can you write a set of tests to test glBufferData and all of its edge cases?"

glBufferData is a 32 year old API so there's clearly plenty of examples for to have looked it. There are even multiple public tests for it including the official tests that are open sources and so easily scannable. It failed

It wrote 8 tests, 7 of those tests were wrong in that it did something wrong intentionally then asserted it go no error. It wasn't close to comprehensive. It didn't test the function actually put data in the buffer for example, nor did it check the set of valid enums to see that they work. Nor did it check that the target parameter actually works and affects the correct buffer bound to that target.

This is my experience with LLMs for code so far. I do get answers quicker from LLMs sometimes for tech questions vs searching via Google and reading stack overflow. But that's only sometimes. As a recent example, I was trying to add TypeScript types some JavaScript and it failed. I went round and round tell it it failed but it got stuck in a loop and just kept saying "Oh, sorry. How about this -- repeat of previous code"

If you asked me to write tests with such a vague definition I’d also have issues writing them though. It’ll work a lot better if you tell it what you want it to validate I think.
Wait, wait. You ought to write tests for javascript react html form validation boilerplate. Not that.

/s aside, it’s what we all experience too. There’s a great divide between programming pre-around-2015 and thereafter. LLMs can only do recent programming, which is a product of tons of money getting loaded into the industry and creating jobs that made no sense ten years ago. Basically, the more repetitive boilerplate patterns configuration options import blocks row-obj-dto-obj conversion typecheck bullshit you write per day, the more LLMs help. I mean, one could abstract all that away using regular programming, but how would they sell their work for $^6 an AI for $^9 then?

Just yesterday, after reading yet another “oh you must try again” comment, I asked 4o about how to stop puppeteer from dumping errors into console and exit gracefully when I close the headful browser (all logs and code provided). Right away it slided into nonsense. I always finish my chats with what I think about it uncut, just in case someone uses these for further learning.

Yes this is the same for me. I’ve shifted my programming style so now I just write function signatures and let the AI do the rest for me. It has been a dream and works consistently well.

I’ll also often add hints at the top of the file in the form of comments or sample data to help keep it on the right track.

Here's one I wrote the other day which took a long time to get right. I'm curious on how well your AI can do, since I can't imagine it does a good job at it.

  # Given a data set of size `size' >= 0, and a `text` string describing
  # the subset size, return a 2-element tuple containing a text string
  # describing the complement size and the actual size as an integer. The
  # text string can be in one of four forms (after stripping leading and
  # trailing whitespace):
  #
  #  1) the empty string, in which case return ("", 0)
  #  2) a stringified integer, like "123", where 0 <= n <= size, in
  #   which case return (str(size-int(n)), size-int(n))
  #  3) a stringified decimal value like "0.25" where 0 <= x <= 1.0, in
  #   which case compute the complement string as str(1 - x) and
  #   the complement size as size - (int(x * size)). Exponential
  #   notation is not supported, only numbers like "3.0", ".4", and "3.14"
  #  4) a stringified fraction value like "1/3", where 0 <= x <= 1,
  #   in which case compute the complement string and value as #3
  #   but using a fraction instead of a decimal. Note that "1/2" of
  #   51 must return ("1/2", 26), not ("1/2", 25).
  #
  # Otherwise, return ("error", -1)

  def get_complement(text: str, size: int) -> tuple[str, int]:
    ...

For examples:

  get_complement("1/2", 100) == ("1/2", 50)
  get_complement("0.6", 100) == ("0.4", 40)
  get_complement("100", 100) == ("0", 0)
  get_complement("0/1", 100) == ("1/1", 100)
Some of the harder test cases I came up were:

get_complement("0.8158557553804697", 448_525_430): this tests the underlying system uses decimal.Decimal rather than a float, because float64 ends up on a 0.5 boundary and applies round-half-even resulting in a different value than the true decimal calculation, which does not end up with a 0.5. (The value is "365932053.4999999857944710")

get_complement("nan", 100): this is a valid decimal.Decimal but not allowed by the spec.

get_complement("1/0", 100): handle division-by-zero in fractions.Fraction

get_complement("0.", 100): this tests that the string complement is "1." or "1.0" and not "1"

get_complement("0.999999999999999", 100): this tests the complement is "0.000000000000001" and not "1E-15".

get_complement("0.5E0", 100): test that decimal parsing isn't simply done by decimal.Decimal(size) wrapped in an exception handler.

Also, this isn't the full spec. The real code reports parse errors (like recognizing the "1/" is an incomplete fraction) and if the value is out of range it uses the range boundary (so "-0.4" for input is treated as "0.0" and the complement is "1.0"), along with an error flag so the GUI can display the error message appropriately.

I suspect certain domains have higher performance than others. My normal use cases involve API calls, database calls, data transformation and AI fairly consistently does what I want. But in that space there are very repeatable patterns.

Also with your example above I probably would break the function down into smaller parts, for two reasons 1) you can more easily unit test the components; 2) generally I find AI performs better with more focused problems.

So I would probably first write a signature like this:

  # input examples = "1/2" "100" "0.6" "0.99999" "0.5E0" "nan"
  def string_ratio_to_decimal(text: str) -> number
Pasting that into Claude, without any other context, produces this result: https://claude.site/artifacts/58f1af0e-fe5b-4e72-89ba-aeebad...
> I probably would break the function down into smaller parts

Sure. Internally I have multiple functions. Though I don't like unit testing below the public API as it inhibits refactoring and gives false coverage feedback, so all my tests go through the main API.

> Pasting that into Claude, without any other context

The context is the important part. Like the context which says "0.5E0" and "nan" are specifically not supported, and how the calculations need to use decimal arithmetic, not IEEE 754 float64.

Also, the hard part is generating the complement with correct formatting, not parsing float-or-fraction, which is first-year CS assignment.

> # Handle special values

Python and C accept "Infinity" as an alternative to "Inf". The correct way is to defer to the underlying system then check if the returned value is infinite or a NaN. Which is what will happen here because when those string checks fail, and the check for "/" fails, it will correctly process through float().

Yes, this section isn't needed.

> # Handle empty string

My spec says the empty string is not an error.

> numerator, denominator = text.split("/"); num = float(numerator); den = float(denominator)

This allows "1.2/3.4" and "inf/nan", which were not in the input examples and therefore support for them should be interpreted as accidental scope creep.

They were also not part of the test suite, which means the tests cannot distinguish between these two clearly different implementations:

  num = float(numerator)
  den = float(denominator)
and:

  num = int(numerator)
  den = int(denominator)
Here's a version which follows the same style as the linked-to code, but is easier to understand:

    if not isinstance(text, str):
        return None
    
    # Remove whitespace
    text = text.strip()
    
    # Handle empty string
    if not text:
        return None

    # Handle ratio format (e.g., "1/2")
    if "/" in text:
        try:
            numerator, denominator = text.split("/")
            num = int(numerator)
            den = int(denominator)
            if den == 0:
                return float("inf") if num > 0 else float("-inf") if num < 0 else float("nan")
            return num / den
        except ValueError:
            return None

    # Handle regular numbers (inf, nan, scientific notation, etc.)
    try:
        return float(text)
    except ValueError:
        return None
It still doesn't come anywhere near handling the actual problem spec I gave.