Hacker News new | ask | show | jobs
by llm_trw 477 days ago
Data is king. Even when a new better model comes along a high quality dataset is still just as valuable.

Paying top performers above market rates to do nothing but data labelling is a moat that just keeps getting deeper.

1 comments

Good data and good evals are two legs of the 3-legged stool that a lot of AI teams are missing.
It also can't really be overstated how helpful it is as an ML engineer to simply spend the time going through thousands of examples yourself. If you abstract yourself away from the data and just "make metric go up" you'll be missing out on valuable insights about how and why your model might be failing.
It's almost as if (bear with me ...) these "artificial intelligences" actually need "human intelligences" to guide them. Maybe we can think up a "system" where "experts" can codify rules for the "artificial intelligence" to follow.

Ok the sarcasm got too thick but my point is if the engineer has to spend the time to comb thousands of examples then you don't have AI you have a man in a box pretending to be a machine that plays chess.

We have human teachers for much the same reasons.

Are humans just other humans hiding in boxes pretending to play chess?

What would a product look like in this space?
It's not a product. It's business core competency in the ml space.
There are several data labeling products on the market such as Label Studio.

I’ve resorted to building my own annotation apps.

For my one foray into ML, in 2020, I also built my own labeling system. It was stupidly simple; IIRC, it was a Jupyter Notebook that presented you with text to label, and you’d do so by hitting 1-5, which were mapped to sentiments / emotions. If you got bored, or just wanted to see how it performed with X% training, you could save progress and quit. It worked well enough, and I think I labeled a couple of thousand entries using it.
I ALSO have resorted to building my own labeling even though there are great generic labeling tools out there. I think this is a missing piece of the landscape but I don't know enough about the space yet to say what the solution should be.