This is particularly impressive for Elixir, which is not a language that is a particular focus of GPT-4. I imagine the accuracy for Python is extremely good. Maybe near perfect for this kind of benchmark if allowed to see error messages.
It's also possible GPT-4 is better at writing Elixir since there are less beginners/students writing Elixir code and polluting the training data with bad practice or faulty code.
I don't see any iterations. I can sometimes get it to improve the code just by telling it that there's a problem, or to identify any problems.
> it's not clear if GPT will have sufficient attention to handle the more complex cases.
I think this could be helped with some tooling, by giving it smaller pieces of code to digest, in a new prompt and a somewhat smaller prompt. These LLM can't seem to dive into cracks if the context isn't constrained to that crack.
Do you have access to the code interpreter alpha? You can upload larger chunks of code and have it perform different tasks. One of the interesting features is that chatgpt will sometimes have some issues with its own code (python) and try to automatically correct itself.
I wish I had access to the plugins, both as a user or developer. I hate the feeling of being on the outside of the latest developments, nothing feels open about a
OpenAI.
Well in a lot of cases the model is writing code that can't even be run (the red dots). Feeding it the compilation error can be done automatically and it will usually be able to at least get it running.
The last time I asked ChatGPT to write a function that I needed, it was syntactically correct right off the bat. I thought that even if it had logical errors, it would be easier to fix them one by one. It seemed like a plausible approach to take when dealing with writer's block.
However, after spending about 20 minutes fixing the details, I realized that the core logic was missing. I had only wasted precious time that I could have spent figuring it out myself.
[This comment brought to you by ChatGPT - I asked it to write a second draft of my original comment]
But that's actually not what I meant. You can often just tell it "there's a problem, please fix it", or "do you see any problems" and it will be able to identify it without additional input. There no requirement that you've identified a problem, it's more of a "double check that" type of prompt.
I have found somewhat interesting results by translating my prompts into other languages (using deepl) -- I haven't run the statistics in depth but German and French results tend to have more comments. Japanese uses variable names i, j, etc. I suspect languages which use e.g. cyrillic will produce significantly different results - but the tokenizer also "punishes" them in the sense that they're significantly more expensive
One area of low hanging fruit here is to automatically evaluate the quality/accuracy/correctness/etc of a given generation and select (or merge) between multiple possibilities generated in parallel. Sometimes it will forget to fill in a function def, so use the one from iter #3, etc. You could go so far as to run candidates in a sandbox with an input and evaluate which one gives output -- ideally which output is closest to what is desired, if you can define that.
Also a sort of "whitelist" for valid functions and routines - sometimes it's close not still wrong, if you can map the hallucinations and mistakes to what it's supposed to be, that also can probably go a long way.
ChatGPT> Parse the album names and genres out of this json using Haskell's Aeson library and print them out in csv format:
{
"items": [
{
"track": {
"album": {
"total_tracks": 9,
"name": "playlist 1",
"genres": [
"Egg punk",
"Noise rock"
]
}
}
},
{
"track": {
"album": {
"total_tracks": 5,
"name": "playlist 2",
"genres": [
"NotEgg punk",
"NotNoise rock"
]
}
}
}
]
}
<shell-maker-end-of-prompt>
Here's the code to parse the JSON and print out the album names and genres in CSV format:
```haskell
{-# LANGUAGE DeriveGeneric #-}
import Data.Aeson
import Data.List (intercalate)
import GHC.Generics
data Track = Track { album :: Album }
deriving (Show, Generic)
data Album = Album { total_tracks :: Int, name :: String, genres :: [String] }
deriving (Show, Generic)
data Response = Response { items :: [Item] }
deriving (Show, Generic)
data Item = Item { track :: Track }
deriving (Show, Generic)
instance FromJSON Album
instance FromJSON Track
instance FromJSON Item
instance FromJSON Response
main :: IO ()
main = do
json <- readFile "example.json"
let parsed = decode json :: Maybe Response
case parsed of
Nothing -> putStrLn "Failed to parse JSON"
Just response -> mapM_ printAlbum $ items response
printAlbum :: Item -> IO ()
printAlbum item = do
let albumName = name . album . track $ item
albumGenres = intercalate ", " $ genres . album . track $ item
putStrLn $ albumName ++ "," ++ albumGenres
```
Assuming the JSON is stored in the file "example.json", running this program will output:
```
playlist 1,Egg punk, Noise rock
playlist 2,NotEgg punk, NotNoise rock
```
This is the album name and genres from each item in the "items" list of the JSON, separated by a comma.
ChatGPT> gpt1.hs:27:23: error:
• Couldn't match type ‘[Char]’
with ‘Data.ByteString.Lazy.Internal.ByteString’
Expected: Data.ByteString.Lazy.Internal.ByteString
Actual: String
• In the first argument of ‘decode’, namely ‘json’
In the expression: decode json :: Maybe Response
In an equation for ‘parsed’: parsed = decode json :: Maybe Response
|
27 | let parsed = decode json :: Maybe Response
|
<shell-maker-end-of-prompt>
The error message is indicating that there's a type mismatch between the `json` string and the expected `ByteString` type needed by the `decode` function.
To fix this, you can use the `Data.ByteString.Lazy.Char8` module to create a `ByteString` from the `json` string:
```haskell
import qualified Data.ByteString.Lazy.Char8 as BS
-- ...
main :: IO ()
main = do
json <- BS.readFile "example.json" -- use ByteString.readFile instead of readFile
let parsed = decode json :: Maybe Response
-- ...
```
With this change, the `json` string is read as a `ByteString` instead of a `String`, and the `decode` function should be able to parse it properly.
And it works fine:
$ nix-shell -p "haskellPackages.ghcWithPackages (ps: with ps; [aeson])" --run "runhaskell gpt1.hs"
playlist 1,Egg punk, Noise rock
playlist 2,NotEgg punk, NotNoise rock
It takes 30 seconds versus 5 minutes to write myself?
> So you run it on a GB of similar data (meaning not literally specifically a GB, but well more than you can reasonably verify by hand).
The same way you would test code you wrote yourself.
Firstly, I check over the code. Secondly, I run it on a small data set and make sure things look correct. Thirdly, there's always the option of writing tests into the prompt after or even as the initial prompt.
> Then what? How do you know it works fine, or if it obviously crashes, what to do next?
For this specific use in this language I've done this 100s of times. It's just letting me save my energy remembering things and avoiding the stress of forgetting things.
30s vs 5m is a very large difference as well. How many 30s tasks would you be confident in finishing a day vs 5m ones? For me there's a huge difference in energy used.
> How do you know it works fine, or if it obviously crashes, what to do next?
Well, you either:
- give chatgpt the errors and see if it can fix them
- fix errors yourself and reprompt
- fix errors yourself
Using this as a tool like this is a skill like anything else and only experience will tell you which direction is most effective. For me though, I typically only give it a few chances to get things right. If it doesn't, I just finish the task myself.
Typically I can start with the scaffolding it gave me though.
There's also usefulness for exploring things you don't know about, but I wouldn't have high confidence or use that for an important work task without adequate testing to build my confidence.
You just couldn't scroll to the bottom of the page, eh?
The conclusion is that neither 3.5 nor 4 are good enough because for anything none trivial they generate code that is often subtly wrong. Might still speed up somebody new to the language/project/learning or I would say: with additional tooling/plugins/"prompt engineering"/tinkering the author might get useful results.
> So, should you use GPT to generate your OpenAPI validations? Probably not... yet... I'm looking forward to repeating this experiment with GPT-6, and maybe GPT-7 will be able to generate an JSONSchema compiler and replace this library altogether.
Not everyone has the time or expertise to create a pull request. But there are issues allowed on this repo! Create an Issue if you'd like the author to address something, in my opinion.
I’ve seen a good number of things get addressed simply because an Issue crossed a threshold of votes/comments over time. Really can make voices heard on the anon internet. Big respect to anyone working for free who takes them seriously, that takes integrity.
This is particularly impressive for Elixir, which is not a language that is a particular focus of GPT-4. I imagine the accuracy for Python is extremely good. Maybe near perfect for this kind of benchmark if allowed to see error messages.