| I've found that LLMs serve best as fuzzy searchers. It may be hard to ask Google the right questions, but this is where LLMs shine. Googling any form of "I remember hearing about a study that Google did awhile back about new hires and they found that if a GPA was above 3.0 that there was no difference. Can you link me that study? Was there any followup?" is quite difficult and you'll likely end up with tons of links about questions of minimum GPA for getting a job at Google, but Bard will give you information about "Laszlo Bock" and his book, when enables more refined Googling. Simple "Laszlo Bock Google GPA" now provides a useful search. This is where I find LLMs shine, when I'm struggling to cite the correct incantation to Google to filter our all the junk that has been SEO optimized. (foreshadowing LLM search optimization...) What's also interesting is I tried this exact sentence in multiple LLMs. - ChatGPT gives me the standard knowledge limit response despite all the results for our refined search being June 2013. - Bard didn't need any coaxing (a bit surprising). - Hugging Face Chat also gave me Bock and Project Oxygen and Project Aristotle (Bard didn't have either). HuggingFace is providing by far the best result. - Claude did not find the study but at least suggested some others. - LLaMa doesn't seem to be able to find it either, but suggests that Google has done studies and gives some names. sheepscreek is exactly right about the fine tuning for correctness degrading results. There is an interesting thing going on right now, as alignment is strangely not being recognized as also disalignment. You cannot have one without the other. There is always a trade since you are shifting the probability distribution. But I think unfortunately it is not only unpopular to research this area, but the methods needed would involve quite unpopular networks and require a deep discussion of probability and distributions, which currently appears to be resulting in rejection from top conferences if my Twitter feed and personal experience are any indication. The conferencing system is so noisy at this point that I personally feel that it is worse than were it to not exist. Much like my ChatGPT result for the question. It is also worth mentioning that the tuning process being performed may have additional consequences which aren't being openly discussed or addressed, despite it being in the name. Tuning for human preference is not exactly tuning for factual knowledge, but the preferred results that humans like. While tuning may include pressure to increase factual output one needs to also be highly aware that the bias we're introducing to these models is that which specifically hacks the evaluation metric (i.e. us humans). This has the ability to make LLMs worse off than before, as they become more likely to be convincing when they return incorrect information, even if the average factual accuracy is higher. Need to be highly aware of both Simpson's and Berkson's paradoxes, as they deal with poor evaluation due to the way in which data (results) are aggregated. We are literally tuning through Goodhart's Law. |