Hacker News new | ask | show | jobs
by markisus 1260 days ago
The letter frames recent advances like ChatGPT as positive indicators for Scale's future business. However, task specific bespoke human data annotation may fall away to datasets generated by massive self supervised multi-modal models. The other day, I saw that one of these image generation models can be prompted to create image-semantic mask pairs. And I wouldn't be surprised if ChatGPT could generate text along side named entity labels for specialized domains. The holy grail for AI has always been obviating the need for manual annotation and increased efficiency on the data we do have. Scale's data annotation business seems at odds with the core goals of AI researchers, and may dry up as those goals are increasingly realized.
1 comments

Not only that, it's not clear to me that these models required the sort of annotation that Scale sells. He says:

>Each of the big AI breakthroughs (ChatGPT, GPT-3, DALL-E) have been made possible through extensive data collection and annotation infrastructure.

Maybe I'd count the RLHF stuff from ChatGPT as requiring a form of annotation. But like Dall-E is trained using existing captions/alt-text. No one is sitting in a room annotating images for Dall-E. Where would a company like Scale fit into this?

Yea… despite the massive amounts of data being fed into these models, the model quality is still data-bound. There is no way to produce models like GPT-3 with manually annotated data.
This is suddenly a highly debatable claim. Traditional ML was "data hungry" and required massive datasets for features and difficult to acquire "clean" labels.

You can ask ChatGPT to generate these datasets now. However it's unclear if future models will rely on such datasets if large models already have the answer. Here is an example of a ChatGPT generated dataset for a misinformation classifier. ChatGPT Can generate 10 examples every time I hit enter, I asked it to create this dataset using the format.

sentence, misinformation (1,0), notes

---

The earth is flat, 1, This claim is widely debunked by scientific evidence and research.

Vaccines are harmful and cause autism, 1, This claim has been disproven by multiple scientific studies and is not supported by medical evidence.

The Holocaust never happened, 1, This claim is a historical denial and disregards the overwhelming evidence of the genocide of Jewish people during World War II.

Humans and dinosaurs coexisted, 1, This claim is not supported by scientific evidence and is a popular conspiracy theory.

Water boils at 212 degrees Fahrenheit, 0, This is the standard boiling point of water at sea level.

The moon landing was faked, 1, This claim is a conspiracy theory and is not supported by any evidence.

Climate change is not caused by human activity, 1, This claim is not supported by the overwhelming majority of scientific evidence and research.

The sun revolves around the earth, 1, This claim was disproven by scientific evidence in the 16th century and is now considered a flat-earth theory.

HIV does not cause AIDS, 1, This claim is not supported by scientific evidence and has been disproven by multiple studies.

Vaccines are safe and effective, 0, This claim is supported by the majority of scientific evidence and research.

I was talking about ChatGPT itself. It could be made better with more data.
however that data may not come from human labels.