| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dontupvoteme 1169 days ago

I have found somewhat interesting results by translating my prompts into other languages (using deepl) -- I haven't run the statistics in depth but German and French results tend to have more comments. Japanese uses variable names i, j, etc. I suspect languages which use e.g. cyrillic will produce significantly different results - but the tokenizer also "punishes" them in the sense that they're significantly more expensive

One area of low hanging fruit here is to automatically evaluate the quality/accuracy/correctness/etc of a given generation and select (or merge) between multiple possibilities generated in parallel. Sometimes it will forget to fill in a function def, so use the one from iter #3, etc. You could go so far as to run candidates in a sandbox with an input and evaluate which one gives output -- ideally which output is closest to what is desired, if you can define that.

Also a sort of "whitelist" for valid functions and routines - sometimes it's close not still wrong, if you can map the hallucinations and mistakes to what it's supposed to be, that also can probably go a long way.