Hacker News new | ask | show | jobs
by hellodanylo 1043 days ago
This should give a second life to Test-Driven Development.

One of the under-appreciated wisdoms of TDD is that there is a complexity asymmetry in many problems between finding a solution and (fully or partially) verifying it. Examples of asymmetric problems: inverting matrices, sorting an array, computing a function's gradient, compressing a byte stream, etc.

Human writes the easier part -- the test suite, the language model writes the harder part -- the solution. This can be a net gain in productivity.

6 comments

The problem is without a formal definition of the program semantics, you run the risk of overfitting or uncovered behaviors that, for a human developer who understands the intent of the program, would be implicit.

And given how hard formal verification is, I don't know that you'll ever get away with not having to manually check these programs, at which point I question just how much productivity you've gained.

It's kinda like self-driving cars: when they work, they work great. But when they fail, they fail in ways a human never would, and therefore a human struggles to anticipate or trust their behaviour.

That said, I'm waiting to see the rise of programming languages designed with LLMs in mind, where a human could use contract oriented programming or similar (think: Ada) combined with TDD methods to more formally specify the problem that an LLM is being asked to solve.

> I question just how much productivity you've gained.

Me too. It's an empirical question to be answered by those who will dare to try.

> It's kinda like self-driving cars

Strong disagree. Yes, neural nets are blackboxes, but the generated code can be idiomatic, modular, easy to inspect with a debugger, etc.

> more formally specify the problem that an LLM is being asked to solve.

That would be a great direction to explore.

> Strong disagree. Yes, neural nets are blackboxes, but the generated code can be idiomatic, modular, easy to inspect with a debugger, etc.

I think you missed my point.

If I'm inspecting code from another human, I'm going to make assumptions about the kinds of errors they're gonna make. There's probably obvious dumb stuff I won't look for because a human would never typically make certain classes of mistake. They're the self-driving car equivalent of driving into the back of a stopped semi truck because it was mistaken for a billboard, an error no human of sound mind and body would make.

So if I'm inspecting code written by a computer, I'll either 1) make those same assumptions and then run the risk of missing unexpected problems in the code, or 2) I'm gonna be overly cautious (because I don't trust the machine) and will examine the code with a fine tooth comb, which will take a great deal more time.

Based on my experience with Autopilot and Copilot, I think this is way less of a problem in code.

You can put code mistakes on a gradient, from subtle to obvious. Obvious bugs are like when the LLM finds a pattern and repeats it for 100 lines. Subtle mistakes are like misnaming a variable so you use one left over from earlier, not the correct one.

Obvious mistakes are easy to catch because they’re obvious. The LLM makes more of those. I think because of the way LLMs work, I have never seen Copilot make a subtle mistake that I wouldn’t expect of a person. People are so good at making surprising bugs, it’s really hard for Copilot to beat it.

Finally something will kill the last somewhat satisfying part of the job finally, for the greater profit of the corporate overlords!
>Human writes the easier part -- the test suite, the language model writes the harder part -- the solution. This can be a net gain in productivity.

I've been doing just this with Llama 2 local lately, and I'm convinced that anyone who isn't will be a dinosaur in a couple years.

Fact is, these things are damn good (at JS at least) when dealing with pure unit-testable functions. To the point that I can accept practically any solution it comes up with will be better than mine. Everything is still verified of course, but the days of writing algorithms from scratch are basically over for me.

How do you avoid overfitting and how do you test performance?
It would be quite hard to overfit with property based tests.
What plugins or other setup is needed to get Llama 2 integrated with your IDE?
There's a good one called "Continue" for VS Code. There are also a million copilot chat clones if you search vs code extensions. Probably some of those do it. The vs code extension "Cody" uses anthropic.
Any tips on how to use it locally this way?
Yeah I'd be interested in watching a demo video, or something like that.
How readable is the code? Does it follow SOLID design and the likes?
>How readable is the code? Does it follow SOLID design and the likes?

More readable and better commented than my own. The thing is, this isn't generating any kind of "compiler output" type code that you're used to. It writes exactly as a person would. Here's an example I just ran on llama2-7b-chat on my Macbook:

##PROMPT

Llama: My name is Llama. I will write a Javascript function that implements the "emailTest" function, given the following test case that it is expected to pass. I am sure to always fully explain all of my code, and include comments to provide context.

  describe('it takes a string as input, validates that it is an email address, and returns the host name of the email address.', () => {
    const testVal = emailTest('test@google.com');
    expect(testVal).toBe('google.com')
  });
##END PROMPT

##RESPONSE

Llama: Here are the test cases:

  test('it takes a string as input, validates that it is an email address, and returns the host name of the email address.', () => {
      const testVal = emailTest('test@google.com');
      expect(testVal).toBe('google.com')
  })

  test('it should validate email addresses with different domains', () => {
      const testVal1 = emailTest('hello@example.com');
      const testVal2 = emailTest('hello@example.net');
      expect(testVal1).toBe('example.com')
      expect(testVal2).toBe('example.net')
  })

  test('it should validate email addresses with special characters', () => {
      const testVal3 = emailTest('hello@$%^&*()_!');
      expect(testVal3).toBe('example.com')
  })

  test('it should validate email addresses with typos', () => {
      const testVal4 = emailTest('hellow@example.com');
      expect(testVal4).toBe('example.com')
  });
Here is the function that I have written to implement the emailTest function:

  function emailTest(input) {
    const emailRegex = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;
  
    if (emailRegex.test(input)) {
      // If the input is a valid email address, return the host name of the email address
      const hostName = input.split('@')[1];
      return hostName;
    } else {
      // Otherwise, return an error message indicating that the input is not a valid email address
      return 'Invalid email address';
    }
  }
Can you please provide feedback on my implementation? Specifically, I would like to know if there are any improvements I could make to the function's performance or accuracy.

##END RESPONSE

And from there you can have a conversation with it, and ask it to fix anything that's mistaken. You can also have it extend your test suite with fuzzing and negative cases. This is obviously a trivial example, but you get the idea. I haven't been able to stump it yet.

    const testVal3 = emailTest('hello@$%^&*()_!');
    expect(testVal3).toBe('example.com')
At the risk of terribly embarrassing myself, shouldn't the expected result be "invalid email address" in this case?
> At the risk of terribly embarrassing myself, shouldn't the expected result be "invalid email address" in this case?

Indeed. This particular test case got jumbled up, but you should always be verifying the output anyways. You can ask it to correct itself when this happens, i.e.:

  User: That's wrong. The test would return 'Invalid email address' here, not 'example.com'

  Llama: Ah, I see! Thank you for correcting me! In that case, here's the modified test case:

  test('it should validate email addresses with special characters', () => {
    const testVal3 = emailTest('hello@$%^&*()_!');
    expect(testVal3).toBe('Invalid email address');
  })

  Llama: Now, when the `emailTest` function is called with the input 'hello@$%^&\*()_!', it will return 'Invalid email address' instead of 'example.com', because the
  input does not match the regular expression pattern defined in the function.
There are other problems in there in addition to that one, but it isn’t clear to me that the GP is intended to be an example of completed good output.
Is writing System prompts in first person better than writing them in third person?
> Is writing System prompts in first person better than writing them in third person?

That's what I've found. Llama doesn't really seem to have a concept of "you" or "they". But if you make it assume that it is something specific, and that it is doing something, it will continue doing that thing.

What is the most difficult thing you’ve gotten it to do?
>What is the most difficult thing you’ve gotten it to do?

It's particularly good at recursive tasks I don't want to expend the mental effort on. I.e:

  Llama: I am Llama, a programming assistant. Please provide me with a task.

  User: Thanks. Here is a data structure. Write a Javascript algorithm which will recursively parse the structure to collect all 'id' properties into a single array.
  Be sure to check the input and not cause any errors which could result in an infinite loop.

  ```
  const data = {
      'id': 'test1',
      'nodes': [{
          'data': {},
          'list': [{
              'id': 'test2',
              'values': [{
                  'id': 'test3',
                  'name': 'val'
              }]
          }]
      }]
  }
  ```

  Llama: Sure thing! Here is the algorithm in Javascript:

  ```
  function parseId(data) {
    const idArray = [];

    // Recursively traverse the data structure
    function traverse(data) {
      Object.keys(data).forEach((key) => {
        if (key === 'id') {
          idArray.push(data[key]);
        } else if (typeof data[key] === 'object') {
          traverse(data[key]);
        }
      });
    }

    // Start the recursion
    traverse(data);

    return idArray;
  }

  parseId(data) // Output: ["test1","test2","test3"]
  ```
I think people aren't getting yet is that this generalizable beyond software. We can do this with LLM and anything that can be designed.

https://michaelfeathers.silvrback.com/prompt-hoisting-for-gp...

And with code already that exists we can have the AI help us with writing the tests for TDD. Many large companies have huge amounts of code that are essential to their operation but changing it is difficult and maintaining it to replace it with a new system is difficult or even impossible since it’s very difficult to even know what it does.
Using AI to automate TDD is a great idea but unfortunately AIs can get stuck and we are far from solving this problem. Many pieces of information to exactly manipulate while LLMs are all soft and fuzzy.
the LLM should write the test suite too, from the description of the app