| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by artninja1988 888 days ago
	>To train AlphaGeometry's language model, the researchers had to create their own training data to compensate for the scarcity of existing geometric data. They generated nearly half a billion random geometric diagrams and fed them to the symbolic engine. This engine analyzed each diagram and produced statements about their properties. These statements were organized into 100 million synthetic proofs to train the language model. With all the bickering about copyright, could something similar be used for coding llms? Would kill the ip issues, at least for coding

3 comments

cubefox 888 days ago

It's the topic of this fascinating paper:

https://arxiv.org/abs/2207.14502

link

artninja1988 888 days ago

Hey, thanks! I'll add it to my reading list

link

sangnoir 888 days ago

For language, the symbolic engine itself would likely be trained on copyrighted input, unlike the geometry engine, since math & math facts are not covered by copyright.

If you couple random text genrsrion and such an engine for language, you'd be laundering your training using the extra step (and the quality will likely be worse due to multiplicative errors)

link

nodogoto 888 days ago

What statements about properties of randomly generated code snippets would be useful for coding LLMs? You would need to generate text explaining what each snippet does, but that would require an existing coding LLM, so any IP concerns would persist.

link

artninja1988 888 days ago

Yeah, but couldn't some system be built that understands what the differnt code snippets do by compiling them or whatever?

link