| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by code51 921 days ago

Their special sauce is most probably the quality of data and the amount of data cleaning effort they put in.

I’m speculating here but I think Google always refrains from getting into the manual side of things. With LLMs, it became obvious so fast that data is what matters. Seeing Microsoft’s phi-2 play, I’m convinced more about this.

DeepMind understood the properties, came up with Chinchilla but DeepMind couldn’t integrate well with Google, in terms of understanding what kind of data Google should supply to increase model quality.

OpenAI put annotation/cleaning work almost right from the start. Not too familiar with this but human labor was heavily utilized to increase training data quality after ChatGPT started.

1 comments

staunton 920 days ago

Indeed, making poor people in 3rd world countries rate the worst sludge of the internet for 8+h a day might backfire on your marketing... OpenAI could risk it, Google maybe doesn't want to...

link

blowski 920 days ago

Given that many western companies hire poor people to do all sorts of horrible work I doubt it’s that. More likely it’s to avoid suggestions of bias across their product range.

link

Palmik 920 days ago

This is a naive take. How do you think Google collects or collected data for their safe-search classifiers? Now that's a sludge.

Or how do you think Google evaluates search-ranking changes (or gather data for training various ad-ranking & search-ranking models).

link

staunton 920 days ago

I don't know. How do they?

link

NavinF 920 days ago

Their instructions for human raters is public info.

Overview: https://blog.google/products/search/overview-our-rater-guide...

Full PDF: https://static.googleusercontent.com/media/guidelines.raterh...

link

pixl97 920 days ago

I was going to make a joke about all those CAPTCHAs we've solved, but I don't have an answer here.

link