| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thomasahle 1038 days ago
	Exactly. This is id always a pitfall when benchmarking LLM based techniques. The enwiki8 dataset they use, for example, is for sure in the training data. To know how the method performs on novel data, the authors have to come up with entirely new datasets, since anything already existing must be assumed probably contaminated.

1 comments

makapuf 1038 days ago

But doesn't the size in benchmarks include the size of the binary decoder ? So the embedded trained data is accounted for (preventing a plain copy of wikipedia to be included in the decoder)

link

loeg 1037 days ago

I don't think the compressed size statistics on this webpage include the size of the LLM needed to decode. Some of these inputs are only a few 100 kB -- LLMs absolutely dwarf that.

link