| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by timthelion 686 days ago
	Karpathy writes that there is no cheeply computed objective check for "Or re-writing some Java code to Python? " Among other things. But it seems to me that Reinforced Learning should be possible for code translation using automated integration testing. Run it, see if it does,the same thing!

5 comments

IanCal 686 days ago

Even there how you score it is hard.

"Is it the same for this s y of inputs?" May be fine for a subset of things, but then that's a binary thing. If it's slightly wrong do you score by number of outputs that match? A purely binary thing gives little useful help for nudging a model in the right direction. How do you compare two that both work, which is more "idiomatic"?

link

theqwxas 686 days ago

I agree that it's a very difficult problem. I'd like to mention AlphaDev [0], an RL algorithm that builds other algorithms, there they combined the measure of correctness and a measure of algorithm speed (latency) to get the reward. But the algorithms they built were super small (e.g., sorting just three numbers), therefore they could measure correctness using all input combinations. It is still unclear how to scale this to larger problems.

[0] https://deepmind.google/discover/blog/alphadev-discovers-fas...

link

exe34 686 days ago

for "does it run" cases, you can ask the model to try again, give it higher temperature, show it the traceback errors, (and maybe intermediate variables?) or even ask it to break up the problem into smaller pieces and then try to translate that.

for testing, if you use something like quickcheck, you might find bugs that you wouldn't otherwise find.

when it comes to idiomatic, I'm not sure - but if we're at the point that gpt is writing code that works, do we really care? as long as this code is split into many small pieces, we can just replace the piece instead of trying to understand/fix it if we can't read it. in fact, maybe there's a better language that is human readable but better for transformers to write and maintain.

link

IanCal 686 days ago

For "does it run" I'm not talking about how do we test that it does, but how do we either score or compare two+ options?

> when it comes to idiomatic, I'm not sure - but if we're at the point that gpt is writing code that works, do we really care?

Yes - it's certainly preferable. You may prefer working over neat, but working and neat over working but insane spaghetti code.

Remember this is about training the models, not about using them later. How do we tell, while training, which option was better to push it towards good results?

link

fulafel 686 days ago

“Programs must be written for people to read, and only incidentally for machines to execute.” — Harold Abelson

link

exe34 686 days ago

"programs written by LLMs must run correctly and only incidentally be human readable." - Me.

link

alex_suzuki 686 days ago

“WTF?!” - engineer who has to troubleshoot said programs.

link

exe34 686 days ago

"given the updated input and output pairs below, generate code that would solve the problem."

link

jeroenvlek 686 days ago

My takeaway is that it's difficult to make a "generic enough" evaluation that encompasses all things we use an LLM for, e.g. code, summaries, jokes. Something with free lunches.

link

msoad 686 days ago

A program like

    function add(a,b) {
      return 4
    }

Passes the test

link

falcor84 686 days ago

I suppose you're alluding to xkcd's joke about this [0], which is indeed a good one, but what test does this actually pass?

The approach I was thinking of is that assuming we start with the Java program:

    public class Addition {
        public static int add(int a, int b) {
            return a + b;
        }
    }

We can semi-automatically generate a basic test runner with something like this, generating some example inputs automatically:

    public class Addition {
        public static int add(int a, int b) {
            return a + b;
        }

        public static class AdditionAssert {
            private int a;
            private int b;

            public AdditionAssert a(int a) {
                this.a = a;
                return this;
            }

            public AdditionAssert b(int b) {
                this.b = b;
                return this;
            }

            public void assertExpected(int expected) {
                int result = add(a, b);
                assert result == expected : "Expected " + expected + " but got " + result;
                System.out.println("Assertion passed for " + a + " + " + b + " = " + result);
            }
        }

        public static void main(String[] args) {
            new AdditionAssert().a(5).b(3).assertExpected(8);
            new AdditionAssert().a(-1).b(4).assertExpected(3);
            new AdditionAssert().a(0).b(0).assertExpected(0);

            System.out.println("All test cases passed.");
        }
    }

Another bit of automated preparation would then automatically translate the test cases to Python, and then the actual LLM would need to generate a python function until it passes all the translated test cases:

    def add(a, b):
        return 4

    def addition_assert(a, b, expected):
        result = add(a, b)
        assert result == expected, f"Expected {expected} but got {result}"

    addition_assert(a=5, b=3, expected=8)
    addition_assert(a=-1, b=4, expected=3)
    addition_assert(a=0, b=0, expected=0)

It might not be perfect, but I think it's very feasible and can get us close to there.

[0] https://xkcd.com/221/

link

WithinReason 686 days ago

yes but that's not cheaply computed. You need good test coverage, etc.

link