Hacker News new | ask | show | jobs
by aconz2 888 days ago
summary: generate 500M synthetic examples where you start with random constructions, deduce (not sure until saturation or fixed depth) and use each conclusion along with its dependency graph and original premises as a training example. The 500M reduces to 100M after dedupe and they used 7.5M CPU hours. 9M/100M used auxiliary/exogenous constructions (which I think of as like if you're doing this on program search and the program requires some constant pi that isn't derivable from the inputs). They encode the examples for a transformer and finetune on the auxiliary construction subset to eek out 2 more problems. Then beam search using the model to generate next term and logic engine to expand the closure until the conclusion is in the set.

They make a big deal about these auxiliary constructions but I'm not sure the results back that up. I also don't understand what the "without pretraining" means; they say without pretraining it gets 21 problems, without fine-tuning 23 problems, and altogether 25. So are they getting 21/25 without their synthetic data at all??

I want to look at how the dependency graph is encoded in the training examples and whether you might expect different results from different encodings. Ideally I'd think you'd want that part of the model/embedding to be invariant across all possible (big number!) of encodings.

The human comparisons are really weak in Fig 2 and unfortunately is probably what the media mainly focuses on. I do find the results interesting and wonder how well this translates into program synthesis for example. The paper mentioned in these comments which I have yet to read more "Language Models Can Teach Themselves to Program Better" sounds related. I'd be very interested to see more methods that are very different from LLM heuristic beam search, even though it does seem compelling and they get to publish lots of papers with it.

1 comments

>> I do find the results interesting and wonder how well this translates into program synthesis for example.

In broad terms it's the same generate-and-test approach used for their AlphaCode system but with a theorem prover as a test step. In AlphaCode they use clustering and various heuristics to select programs, here they use the prover.

Eventually, I expect, they'll figure out you can use an automated theorem prover to check the correctness of (logic) programs.