The labs are spending hundreds of millions of dollars hiring people doing many fairly random (but economically valuable) tasks to collect this tacit knowledge for RL.
> It ceases to become tacit as soon as it is collected.
I'm not sure.
It it is collected via preferences then it isn't necessarily something that can be communicated (except in the LLM's latent space).
That still feels tacit to me.
To simplify that argument, the relationship between King and Queen in the Word2Vec latent space can be easily explicitly labelled.
But the relationship between Napoleon and Tsar Alexander I also exists and encodes much of the tacit knowledge about their relationship but isn't as easily labelled (eg, Google AI Mode says "Napoleon I and Tsar Alexander I had a volatile "bromance" that shifted from mutual admiration to deep animosity, acting as a defining conflict of the Napoleonic Wars".)
Word2Vec is a very simple model. In a more complex LLM that deeper knowledge can be queried by asking questions but you can never capture it all. Isn't that what "tacit knowledge" is?
It's a good question, yeah, and a lot of these boundaries get fuzzy when they're looked at closely enough.
It's certainly the case that LLMs already are able to represent and make use of some kinds of apparently still tacit knowledge, and that the scope of that is apparently expanding. I don't question that. I question two things: whether it is always desirable for that scope to expand, and whether it is possible for that scope to ever fully cover what it seeks to cover.
Maybe this rephrase will help: the proposed solution is to render all knowledge explicit.