| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by water-data-dude 161 days ago
	It'd be difficult to prove that you hadn't leaked information to the model. The big gotcha of LLMs is that you train them on BIG corpuses of data, which means it's hard to say "X isn't in this corpus", or "this corpus only contains Y". You could TRY to assemble a set of training data that only contains text from before a certain date, but it'd be tricky as heck to be SURE about it. Ways data might leak to the model that come to mind: misfiled/mislabled documents, footnotes, annotations, document metadata.

2 comments

gwern 161 days ago

There's also severe selection effects: what documents have been preserved, printed, and scanned because they turned out to be on the right track towards relativity?

link

mxfh 160 days ago

This.

Especially for London there is a huge chunk of recorded parliament debates.

More interesting for dialoge seems training on recorded correspondence in form of letters anyway.

And that corpus script just looks odd to say the least, just oversample by X?

link

water-data-dude 160 days ago

Oh! I honestly didn't think about that, but that's a very good point!

link

reassess_blind 160 days ago

Just Ctrl+F the data. /s

link