| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by msdz 98 days ago
	They’d probably get the farthest, but they won’t pursue that because they don’t want to end up leaking the original data from training. It is possible in regular language/text subsets of models to reconstruct massive consecutive parts of the training data [1], so it ought to be possible for their internal code, too. [1] https://arxiv.org/abs/2601.02671

1 comments

foota 98 days ago

Copyright for me not for thee? :) That's a good point though. Maybe they could round trip things? E.g., use the model trained only on internal content to generate training data (which you could probably do some kind of screening to remove anything you don't want leaking) and then train a new model off just that?

link