Hacker News new | ask | show | jobs
by izendejas 2306 days ago
That's because, ML and operations-research problems can be simplified to set of optimization problems and the underlying math and statistics are all very similar if not identical in some cases.

And the input matters, a lot. So the differentiating factor isn't the models, it's the data and companies like Google figured it out a long time ago.

In short, find interesting problems, then the solutions -- not the other way around.

3 comments

"The data" means more than pure computer science people want to admit. In any "advanced" application, that means annotators. Radiologists drawing circles around cancer, attorneys labeling contract clauses as unacceptable, drivers labeling stop signs, etc.

ML is a mining problem. Digitizers are the miners. Annotators are the refiners.

Basically, the system is massively ad-hoc and driven by this large scale annotation, training and testing.

The big question here is, what happens when the world changes next year? You rebuild the application. I know there are companies that advertise doing continuous updating of deep learning models but it seems like calculating total costs and total benefits is going to be hard here.

Sometimes the mine makes money, sometimes it doesn't make sense to run the mine.
To extend the mining metaphor, and relate back to the original articles:

People and organizations are chasing what they believe, or are told to believe, is pay dirt.

Many unfamiliar investors have rushed in, possibly fearing missing out, and fund many of the prospectors, yet many of the prospectors and investors aren't really aware of the costs of running a mine, nor the practices required to run them efficiently.

It turns out that there's more aspects to the value creation process than dig/refine/polish (data/train/predict), especially when usefulness in application matters and there are finite resources available for digging.

Companies selling shovels are some of the primary beneficiaries of this, by selling shovels (i.e. renting compute) funded by the malinvestment.

Additional beneficiaries are the refiners (training experts) that are able to charge steep labor premiums, however organizations are starting to figure out that their refiners are expensive to keep idle and often operate the mines poorly in terms of throughput/cost-effectiveness/repeatability/application (see the various threads on "Data Engineers")

This is correct, however, the distinction between labeling and training is artificial, and probably arises from the fact that ML came from academia, where it was not part of the business process.

I.e. a modern ML system should just plug into the business process from day 0, where the ML task should be performed by human and recorded by the machine.

After a while, the machine would train on this recorded data, and start replacing the humans.

Rinse and repeat.

> a modern ML system should just plug into the business process from day 0, where the ML task should be performed by human and recorded by the machine.

Ah, this is a typical thing I hear people in the Valley say: just push it all ... somewhere. No.

If we digitized all microscopy slides, it would require YouTube-scale storage several times over. People think genomics is big. People think reconnaissance imaging is big. They're big, but there's only so much of them.

IF it were digitized, there would be far more pathology whole slide imaging being generated every day. I did some estimates at one point and had to throw a couple orders of magnitude into the genomics data to even make it competitive at enterprise scale.

And keep in mind, we're talking clinical medicine. We want the data now. We're looking at the slides while the glue is still wet. You don't have the bandwidth, no one has the bandwidth, to do some of this stuff they way you propose and maintain the current "business process" of clinical medicine.

Building models and iterating, the old fashioned way, is the only way it makes sense.

Funny, we all thought computers were fast. Turns out its nowhere what we need.
They're fast, sure. But not very efficient in certain problem domains, specificially where humans are efficient (for reasons that are IMHO historical, not innate).
This is spot on. Hence the open sourcing of ML code while keeping an iron grip on data.
> And the input matters, a lot. So the differentiating factor isn't the models, it's the data and companies like Google figured it out a long time ago.

The models are likely also a differentiating factor in a sense that there are models that perform much better than others, to a point of completely new functionality. But also all of these models are basically open source currently... So they can't by definition be differentiating between different companies, because all of the companies generally have access to all of the algorithms. At leat to all of the types of algorithms.