| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by behnamoh 389 days ago
	> No sign of what source material it was trained on though right? out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..

6 comments

marci 389 days ago

When you're trully open source, you can make ethings like this:

Today we introduce OLMoTrace, a one-of-a-kind feature in the Ai2 Playground that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond.

https://allenai.org/blog/olmotrace

link

kreijstal 389 days ago

you can do these same, except you would need to be a pirate website. It would even be better. except illegal. but it would be better.

link

marci 389 days ago

That is why the others can't provide stuff like this. RAG/Hallucination check. I just wish Allen.AI models had bigger context, 4k is too small nowadays.

link

ToValueFunfetti 389 days ago

Would be useful for answering "is this novel or was it in the training data", but that's not typically what the point of open source is

link

anonymoushn 389 days ago

If labs provided the corpus and source code for training their tokenizers, it would be a lot easier to produce results about tokenizers. As it is, they provide neither, so it is impossible to compare different algorithms running on the same data if you also want to include the vocabs that are commonly used.

link

m00x 389 days ago

Many are speculating it was trained by o1/o3 for some of the initial reasoning.

link

fulafel 389 days ago

Are there any widely used models that publish this? If not, then no I guess.

link

DANmode 389 days ago

Depending on how you use "randomly", they absolutely can..?

link