Hacker News new | ask | show | jobs
by killjoywashere 2311 days ago
"The data" means more than pure computer science people want to admit. In any "advanced" application, that means annotators. Radiologists drawing circles around cancer, attorneys labeling contract clauses as unacceptable, drivers labeling stop signs, etc.

ML is a mining problem. Digitizers are the miners. Annotators are the refiners.

2 comments

Basically, the system is massively ad-hoc and driven by this large scale annotation, training and testing.

The big question here is, what happens when the world changes next year? You rebuild the application. I know there are companies that advertise doing continuous updating of deep learning models but it seems like calculating total costs and total benefits is going to be hard here.

Sometimes the mine makes money, sometimes it doesn't make sense to run the mine.
To extend the mining metaphor, and relate back to the original articles:

People and organizations are chasing what they believe, or are told to believe, is pay dirt.

Many unfamiliar investors have rushed in, possibly fearing missing out, and fund many of the prospectors, yet many of the prospectors and investors aren't really aware of the costs of running a mine, nor the practices required to run them efficiently.

It turns out that there's more aspects to the value creation process than dig/refine/polish (data/train/predict), especially when usefulness in application matters and there are finite resources available for digging.

Companies selling shovels are some of the primary beneficiaries of this, by selling shovels (i.e. renting compute) funded by the malinvestment.

Additional beneficiaries are the refiners (training experts) that are able to charge steep labor premiums, however organizations are starting to figure out that their refiners are expensive to keep idle and often operate the mines poorly in terms of throughput/cost-effectiveness/repeatability/application (see the various threads on "Data Engineers")

This is correct, however, the distinction between labeling and training is artificial, and probably arises from the fact that ML came from academia, where it was not part of the business process.

I.e. a modern ML system should just plug into the business process from day 0, where the ML task should be performed by human and recorded by the machine.

After a while, the machine would train on this recorded data, and start replacing the humans.

Rinse and repeat.

> a modern ML system should just plug into the business process from day 0, where the ML task should be performed by human and recorded by the machine.

Ah, this is a typical thing I hear people in the Valley say: just push it all ... somewhere. No.

If we digitized all microscopy slides, it would require YouTube-scale storage several times over. People think genomics is big. People think reconnaissance imaging is big. They're big, but there's only so much of them.

IF it were digitized, there would be far more pathology whole slide imaging being generated every day. I did some estimates at one point and had to throw a couple orders of magnitude into the genomics data to even make it competitive at enterprise scale.

And keep in mind, we're talking clinical medicine. We want the data now. We're looking at the slides while the glue is still wet. You don't have the bandwidth, no one has the bandwidth, to do some of this stuff they way you propose and maintain the current "business process" of clinical medicine.

Building models and iterating, the old fashioned way, is the only way it makes sense.

Funny, we all thought computers were fast. Turns out its nowhere what we need.
They're fast, sure. But not very efficient in certain problem domains, specificially where humans are efficient (for reasons that are IMHO historical, not innate).