Hacker News new | ask | show | jobs
by aleem 1105 days ago
The only thing that didn't sit well with a lot of people about the leaked memo is that it ignored the quality of GPT4 vs GPT3 and made claims that all LLMs were poised to be on par, yet that isn't true till now.

What it also ignored (along with some of the comments here) is data ranking. Google didn't just build a search engine by crawling more of the web -- many search engines before it had already done that. Google managed to rank what's relevant and what isn't. Relevancy is hard. Similarly, not all scientific publications are ranked equally. Or for that matter, even publications with a lot of peer reviews or citations can become obsolete through new discoveries.

Reddit's data has value in that it can fill in a lot of the gaps left by more qualitative sources and furthermore the data is user-ranked by a trusted community. This also has implications for specialised querying, for example training on just r/fitness could be fairly useful for that community.

As a side note, other valuable data stores are not just text but voice/video as well. YouTube and podcast transcripts are readily available, for example to Google. Data and ranking is valuable all over again.