| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by efferifick 1944 days ago

I wonder how likely are code invariants found in the training set preserved in GPT-2/3. In other words, if I train GPT-2 with C source produce by Csmith (a program generator which I believe produces C programs without undefined behaviour) would programs produced by GPT-2/3 also do not exhibit undefined behaviour?

I understand that GPT-2/3 is just a very smart parrot that has no semantic knowledge of what it is outputting. Like I guess let's take a very dumb markov chain that was "trained" on the following input:

a.c ``` int array[6]; array[5] = 1; ```

b.c

``` int array[4]; array[3] = 2; ```

I guess a markov chain could theoretically produce the following code:

out.c

``` int array[4]; array[5] = 1; ```

which is undefined behaviour. But it is produced from two programs where no undefined behaviour is present. A better question would be, how can we guarantee that certain program invariants (like lack of undefined behaviour) can be preserved in the produced code? Or if there are no guarantees, can we calculate a probability? (Sorry, not an expert on machine learning, just excited about a potentially new way to fuzz programs. Technically, one could just instrument C programs with sanitizers and produce backwards static slices from the C sanitizer instrumentation to the beginning of the program and you get a sample of a program without undefined behaviour... so there is already the potential for some training set to be expanded beyond what Csmith can provide.)

EDIT: I don't know how to format here...