Hacker News new | ask | show | jobs
Benchmarking the accuracy of GPT3.5's and GPT-4's code generation abilities (github.com)
74 points by seanmor5 1168 days ago
7 comments

The conclusions start from https://github.com/E-xyza/Exonerate/blob/master/bench/report...

This is particularly impressive for Elixir, which is not a language that is a particular focus of GPT-4. I imagine the accuracy for Python is extremely good. Maybe near perfect for this kind of benchmark if allowed to see error messages.

It's also possible GPT-4 is better at writing Elixir since there are less beginners/students writing Elixir code and polluting the training data with bad practice or faulty code.
This might not be an issue for the same (somewhat inscrutable) reason that GPT-4 has quasi-perfect grammar.
I don't see any iterations. I can sometimes get it to improve the code just by telling it that there's a problem, or to identify any problems.

> it's not clear if GPT will have sufficient attention to handle the more complex cases.

I think this could be helped with some tooling, by giving it smaller pieces of code to digest, in a new prompt and a somewhat smaller prompt. These LLM can't seem to dive into cracks if the context isn't constrained to that crack.

Do you have access to the code interpreter alpha? You can upload larger chunks of code and have it perform different tasks. One of the interesting features is that chatgpt will sometimes have some issues with its own code (python) and try to automatically correct itself.
I wish I had access to the plugins, both as a user or developer. I hate the feeling of being on the outside of the latest developments, nothing feels open about a OpenAI.
Aside from the fact you can access it at all you mean?
> just by telling it that there's a problem

how will you identify that there's a problem?

Well in a lot of cases the model is writing code that can't even be run (the red dots). Feeding it the compilation error can be done automatically and it will usually be able to at least get it running.
The last time I asked ChatGPT to write a function that I needed, it was syntactically correct right off the bat. I thought that even if it had logical errors, it would be easier to fix them one by one. It seemed like a plausible approach to take when dealing with writer's block.

However, after spending about 20 minutes fixing the details, I realized that the core logic was missing. I had only wasted precious time that I could have spent figuring it out myself.

[This comment brought to you by ChatGPT - I asked it to write a second draft of my original comment]

To answer your question from a practical perspective, you can try to run it, and feed back errors. See: https://news.ycombinator.com/item?id=35446171

But that's actually not what I meant. You can often just tell it "there's a problem, please fix it", or "do you see any problems" and it will be able to identify it without additional input. There no requirement that you've identified a problem, it's more of a "double check that" type of prompt.

especially as a new developer that now thinks they will never have to write any code
I have found somewhat interesting results by translating my prompts into other languages (using deepl) -- I haven't run the statistics in depth but German and French results tend to have more comments. Japanese uses variable names i, j, etc. I suspect languages which use e.g. cyrillic will produce significantly different results - but the tokenizer also "punishes" them in the sense that they're significantly more expensive

One area of low hanging fruit here is to automatically evaluate the quality/accuracy/correctness/etc of a given generation and select (or merge) between multiple possibilities generated in parallel. Sometimes it will forget to fill in a function def, so use the one from iter #3, etc. You could go so far as to run candidates in a sandbox with an input and evaluate which one gives output -- ideally which output is closest to what is desired, if you can define that.

Also a sort of "whitelist" for valid functions and routines - sometimes it's close not still wrong, if you can map the hallucinations and mistakes to what it's supposed to be, that also can probably go a long way.

Not to split hairs, but John Henry competed against a steam drill. I'm not sure what an iron track layer is...
GPT-4 is pretty good at generating working Haskell and fixing type errors.
If I say to ChatGPT, "Write some working Haskell." it replies

"Certainly! Here's a simple example of a Haskell program that calculates the factorial of a given number:"

And, if I try the code (not knowing Haskell), it appears to work. So I guess it's "good at generating working Haskell".

But that's not what working programmers do.

Here's a realistic example:

    ChatGPT> Parse the album names and genres out of this json using Haskell's Aeson library and print them out in csv format:

    {
      "items": [
 {
   "track": {
     "album": {
       "total_tracks": 9,
       "name": "playlist 1",
       "genres": [
  "Egg punk",
  "Noise rock"
       ]
     }
   }
 },
 {
   "track": {
     "album": {
       "total_tracks": 5,
       "name": "playlist 2",
       "genres": [
  "NotEgg punk",
  "NotNoise rock"
       ]
     }
   }
 }
      ]
    }
    <shell-maker-end-of-prompt>
    Here's the code to parse the JSON and print out the album names and genres in CSV format:

    ```haskell
    {-# LANGUAGE DeriveGeneric #-}

    import Data.Aeson
    import Data.List (intercalate)
    import GHC.Generics

    data Track = Track { album :: Album }
      deriving (Show, Generic)

    data Album = Album { total_tracks :: Int, name :: String, genres :: [String] }
      deriving (Show, Generic)

    data Response = Response { items :: [Item] }
      deriving (Show, Generic)

    data Item = Item { track :: Track }
      deriving (Show, Generic)

    instance FromJSON Album
    instance FromJSON Track
    instance FromJSON Item
    instance FromJSON Response

    main :: IO ()
    main = do
      json <- readFile "example.json"
      let parsed = decode json :: Maybe Response
      case parsed of
 Nothing -> putStrLn "Failed to parse JSON"
 Just response -> mapM_ printAlbum $ items response

    printAlbum :: Item -> IO ()
    printAlbum item = do
      let albumName = name . album . track $ item
   albumGenres = intercalate ", " $ genres . album . track $ item
      putStrLn $ albumName ++ "," ++ albumGenres
    ```

    Assuming the JSON is stored in the file "example.json", running this program will output:

    ```
    playlist 1,Egg punk, Noise rock
    playlist 2,NotEgg punk, NotNoise rock
    ```

    This is the album name and genres from each item in the "items" list of the JSON, separated by a comma.

    ChatGPT> gpt1.hs:27:23: error:
 • Couldn't match type ‘[Char]’
    with ‘Data.ByteString.Lazy.Internal.ByteString’
   Expected: Data.ByteString.Lazy.Internal.ByteString
     Actual: String
 • In the first argument of ‘decode’, namely ‘json’
   In the expression: decode json :: Maybe Response
   In an equation for ‘parsed’: parsed = decode json :: Maybe Response
       |
    27 |   let parsed = decode json :: Maybe Response
       |
    <shell-maker-end-of-prompt>
    The error message is indicating that there's a type mismatch between the `json` string and the expected `ByteString` type needed by the `decode` function.

    To fix this, you can use the `Data.ByteString.Lazy.Char8` module to create a `ByteString` from the `json` string:

    ```haskell
    import qualified Data.ByteString.Lazy.Char8 as BS

    -- ...
    main :: IO ()
    main = do
      json <- BS.readFile "example.json"  -- use ByteString.readFile instead of readFile
      let parsed = decode json :: Maybe Response
      -- ...
    ```

    With this change, the `json` string is read as a `ByteString` instead of a `String`, and the `decode` function should be able to parse it properly.
And it works fine:

    $ nix-shell -p "haskellPackages.ghcWithPackages (ps: with ps; [aeson])" --run "runhaskell gpt1.hs"
    playlist 1,Egg punk, Noise rock
    playlist 2,NotEgg punk, NotNoise rock
I don't understand what use this is.

I mean, let's say you really do have to parse this format for something at work.

So you run it on a GB of similar data (meaning not literally specifically a GB, but well more than you can reasonably verify by hand).

Then what? How do you know it works fine, or if it obviously crashes, what to do next?

> I don't understand what use this is.

It takes 30 seconds versus 5 minutes to write myself?

> So you run it on a GB of similar data (meaning not literally specifically a GB, but well more than you can reasonably verify by hand).

The same way you would test code you wrote yourself.

Firstly, I check over the code. Secondly, I run it on a small data set and make sure things look correct. Thirdly, there's always the option of writing tests into the prompt after or even as the initial prompt.

> Then what? How do you know it works fine, or if it obviously crashes, what to do next?

For this specific use in this language I've done this 100s of times. It's just letting me save my energy remembering things and avoiding the stress of forgetting things.

30s vs 5m is a very large difference as well. How many 30s tasks would you be confident in finishing a day vs 5m ones? For me there's a huge difference in energy used.

> How do you know it works fine, or if it obviously crashes, what to do next?

Well, you either:

- give chatgpt the errors and see if it can fix them - fix errors yourself and reprompt - fix errors yourself

Using this as a tool like this is a skill like anything else and only experience will tell you which direction is most effective. For me though, I typically only give it a few chances to get things right. If it doesn't, I just finish the task myself.

Typically I can start with the scaffolding it gave me though.

There's also usefulness for exploring things you don't know about, but I wouldn't have high confidence or use that for an important work task without adequate testing to build my confidence.

>For this specific use in this language I've done this 100s of times.

But that's not what programmers do. A programmer that does the same thing hundreds of times is not programming.

> I don't understand what use this is.

Also consider how useful this would be to someone who knows how to parse that json using Python that is learning Haskell.

Real world examples are invaluable for mapping over knowledge like that.

today for me it coded in scss and yaml to write bitwise ops lol
TL;DR?

(Yes, GPT-4 is prob better... but by how much and on what?) A table would've been easier

You just couldn't scroll to the bottom of the page, eh?

The conclusion is that neither 3.5 nor 4 are good enough because for anything none trivial they generate code that is often subtly wrong. Might still speed up somebody new to the language/project/learning or I would say: with additional tooling/plugins/"prompt engineering"/tinkering the author might get useful results.

> So, should you use GPT to generate your OpenAPI validations? Probably not... yet... I'm looking forward to repeating this experiment with GPT-6, and maybe GPT-7 will be able to generate an JSONSchema compiler and replace this library altogether.

from https://github.com/E-xyza/Exonerate/blob/master/bench/report...

(I believe the author is significantly underestimating the pace of progress)

Specific numbers are at https://github.com/E-xyza/Exonerate/blob/master/bench/report.... GPT-4 does significantly better.

Not the creator, but as it's GitHub, feel free to repeat the experiment and submit a pull request with a better method; peer review.
Not everyone has the time or expertise to create a pull request. But there are issues allowed on this repo! Create an Issue if you'd like the author to address something, in my opinion.
I’ve seen a good number of things get addressed simply because an Issue crossed a threshold of votes/comments over time. Really can make voices heard on the anon internet. Big respect to anyone working for free who takes them seriously, that takes integrity.