Hacker News new | ask | show | jobs
by artninja1988 888 days ago
>To train AlphaGeometry's language model, the researchers had to create their own training data to compensate for the scarcity of existing geometric data. They generated nearly half a billion random geometric diagrams and fed them to the symbolic engine. This engine analyzed each diagram and produced statements about their properties. These statements were organized into 100 million synthetic proofs to train the language model.

With all the bickering about copyright, could something similar be used for coding llms? Would kill the ip issues, at least for coding

3 comments

It's the topic of this fascinating paper:

https://arxiv.org/abs/2207.14502

Hey, thanks! I'll add it to my reading list
For language, the symbolic engine itself would likely be trained on copyrighted input, unlike the geometry engine, since math & math facts are not covered by copyright.

If you couple random text genrsrion and such an engine for language, you'd be laundering your training using the extra step (and the quality will likely be worse due to multiplicative errors)

What statements about properties of randomly generated code snippets would be useful for coding LLMs? You would need to generate text explaining what each snippet does, but that would require an existing coding LLM, so any IP concerns would persist.
Yeah, but couldn't some system be built that understands what the differnt code snippets do by compiling them or whatever?