Hacker News new | ask | show | jobs
Ask HN: Why does Machine Learning use these assumptions?
7 points by dreamlessfate 1348 days ago
I'm trying to learn the nuts-and-bolts of Machine Learning, but the more I dig in, the stupider the assumptions seem to be.

The thought keeps popping into my head over and over again: Just because it works doesn't mean it works well, or that it works in a smart, optimal, or even ontologically truthful/useful/realistic way.

There are no shortage of videos, papers, tutorials, blogs that explain the math & models in detail. But there are exceptionally few sources that explain the underlying assumptions...and why these are useful (or not useful) assumptions.

Why does Machine Learning use these assumptions?

--

1) Sigmoid Functions & Binary Classification - I understand the math and the probabilities.

But rather: WHY would you want to classify using a binary system of classification? WHY would you want to reduce everything to yes/no? Or more accurately, a probability of yes/no? Or even chained probabilities of yes/no?

Is it just due to being stuck in the paradigm of programming on machines built on yes/no logic gates? Trying to perform these very complex tasks (identification, generation, whatever) on CPUs and software that are, in and of themselves, built on binary distinction?

If all you have is a binary logic gate (hammer), then everything looks like a cumulative distribution function (nail)?

Isn't this a totally moronic approach? Or is it just the best we got? I feel like it's stuck back in the signal processing days of trying to "fit" and force a signal to achieve a certain pattern without realizing the what or why. Turning knobs on an oscilloscope.

--

2) Layers - Why are artificial neural networks setup as "layers"?

Isn't this more like an assembly line? Doesn't that seem dumb? Why would someone believe, in their heart of hearts, that intelligence or pattern recognition, or any kind of thinking, happens procedurally?

Doesn't this (again) seem like a very moronic approach? One that is based on the procedural nature of the machine itself? And the programmer themself? And not the nature of thinking, intelligence, or even complex analysis / complex systems?

Complex systems with lots of variables and lots of dimensions don't actually interact like this. They don't have "layers", this is a totally made-up assumption that has major implications on the entire field.

Was this just chosen out of necessity, because software and programs need a beginning and an end? And input and an output? Or is there some really convincing argument, that speaks to the philosophy and ontology of these decisions?

4 comments

1. It's built into the task, not into the solution. How do you classify without binary outputs/probabilities? If you want to know if a picture contains someone's face or it doesn't, you need a binary result purely based on the task itself regardless of your approach for solving it. In multiclass classification, you extend your sigmoid to a softmax but it still boils down to a distribution of probabilities. Or in multilabel classification, you essentially perform binary classification for all classes at the same time. Like... what else could you do? In a hypothetical alternative, if you scan your face to unlock your phone, how does the underlying vision model give or deny access to the phone without producing a binary result at some point along the way?

2. To have nonlinearities between the layers, and to have layers with varying complexity and structure. In practice it works much better than all the alternatives that we've tried.

These things are explained very well even at a beginner level, and you aren't really questioning them deeply or proposing any alternatives, instead you seem to be getting into philosophy.

> "It's built into the task, not into the solution. "

Who says? Who's defining the task?

I promise I'm not trying to get too philosophical here, but this is a long-standing issue in all areas of Science - the tendency towards reductionism.

Keep turning the dials on the oscilloscope to try and eliminate the signal noise...but what if the noise itself is an essential part of the phenomenon you're trying to study and understand? You see where I'm going with this?

> "In practice it works much better than all the alternatives that we've tried."

I was waiting for someone to just come out and say "It's the best we got". I'll grant that it might be true, but I don't like it and I don't accept it.

For the first point - at least read what I wrote after the sentence you picked to quote, and try to address the example with unlocking the phone.

For the second point, read what I wrote before the sentence you picked to quote. Also, your not liking or accepting some result doesn't really change it.

This isn't how to argue in good faith, and it's overall not very productive for anyone involved.

Your example wasn't a good example. For starters, it's literally a binary decision to make (yes/no: is this jstx1 trying to unlock the phone?).

Most of the tough, interesting, challenging problems in this world don't boil down to binary decisions.

Second, facial recognition doesn't depend (inherently) on artificial intelligence. It's not a great example. It's not a truly interesting, tough problem. It's not in the realm of fuzzy logic, concurrency, periodic or aperiodic behavior, or nonlinear relationships.

Could a Neural Net do it faster? Yeah, sure, maybe. But so what? You have a quicker algorithm, a faster heuristic.

PaulHoule, on the other hand, gave a great example:

>For instance, where should a modern book on digital photography be filed in the library? Should it go in the 000's with computing? In the 700's under art? Or in the 600's with technology (an application of optics, electronics, etc.)

I don't how you expect a non-binary example when you've titled your own question "Sigmoid Functions & Binary Classification".
> Sigmoid Functions & Binary Classification

Sometimes you want to do binary classification.

This isn’t all there is.

> Layers - Why are artificial neural networks setup as "layers"?

They aren’t all set up like that. It’s a simplification of the effects of the limitations of the speed limits of synapses firing and vs the distances between them.

Spiking neural networks model the propagation more precisely, and have some promise. Their biggest issue is that it’s hard to get training data into an appropriate format for them… and that once you do... they don’t really seem to do better.

Just finish learning everything first and come back here. Others answered your concerns pretty well.

It is common to be like an angry freshman/sophomore who yells why should he take all these difficult classes then 5-10 years later he appreciates whatever he learnt before.

This doesn't even feel worth learning.

Feels like Economics in undergrad, listening to professors repeat broken, oversimplified models that are so hilariously wrong in their assumptions that they have to invent entirely new definitions to deal with their own failings.

Who put the statistics nerds in charge of AI? Is this really the best we got? Chained probabilities? Gradient descent?

Like did the evolution of AI & ML research go like this?

> We're stuck. After decades of research, we've hit a dead end. All we're left with is a byzantine maze of IF/THEN statements. We cannot simulate intelligence using pure logic. We have failed.

> Ok but what if we throw in PROBABILITIES into a byzantine maze of IF/THEN statements??????

>GENIUS!

It doesn't look good when you're dismissing things you don't understand at all.
So far sadly yes? http://www.incompleteideas.net/IncIdeas/BitterLesson.html

tl;dr So far things that enable faster search and faster learning win over long run.

Recursive things like backprop in NN and optimizing reward over long trees of states, seem to win despite huge compute requirements.

Personally I think we are still on the right track of trying to do the right thing, then do the thing right, then do the thing faster.

You cannot refute the things you do not understand.

(1) From the viewpoint of ontology, binary classes are the most elemental and keep you away from the open pits of modelling that people are always walking into.

For instance, where should a modern book on digital photography be filed in the library? Should it go in the 000's with computing? In the 700's under art? Or in the 600's with technology (an application of optics, electronics, etc.)

All these answers are right but they are also wrong. (Like why isn't computing filed with electronics in the 600's or math in the 500's?)

If you're physically filing the book in a place in the library you have to assign it one category out of all of those because it can only be in one place.

If you're trying to do anything else and get correct answers it is simultaneously true that a book is about how to use computer software (say Lightroom) and about how to make art, about the optical performance of lenses, but not about asian languages, nuclear energy, or how to play casino games.

There are certain cases where classes are mutually exclusive and in those cases it is usually right to model those as a constraint rather than start with multi-class classification which usually winds up like

https://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevole...

unless there is something structurally special about the problem.

If you approach the classification of books as asking the question "Is this book about this topic?" the problem becomes tractable... Because the reason why a particular book that could be filed in multiple places is filed in one particular place is "because some librarian decided to file it there". You could never train an algorithm to reproduce the same arbitrary decisions that different librarians make arbitrarily, you'd always have a high error rate. If the question is "Is this book about how to use computer software" then you can get close to 100% in accuracy. To attempt the first is to decide to fail at the very beginning.

Also often the math works for binary classification and doesn't work for other kinds. See

https://plato.stanford.edu/entries/arrows-theorem/

for one kind of problem with is trivial for two choices and intractable for more than two.

(Funny there are two kinds of people... the ones who know what the knobs of the oscilloscope do and the ones that don't!)

(2) The visual cortex of your brain has layers much like the layers of a convolutional network.

An anti-aircraft missile system has layers of processing from raw signals, from which are discovered momentary blips, which are assembled into tracks, etc.

Matter is made of quarks and electrons, the quarks form protons and nuclei, which form nuclei, which are the core of atoms, which form molecules, etc.

Insofar as we are not dying at 30, freezing in the dark, frightened of the howling of wolves, and believing everything happens because some god wants it to happen, it's because we see a hierarchical structure in the universe.

If you had a million neurons all wired to each other it would be an intractable problem to solve for the coefficients because there are so many of them not to mention so many symmetries that would let you trade these ones over here for those ones over there which would make it hard to get started. The wiring diagram for your brain is not like the wiring diagram for a TV set, but your genetic code does wire certain populations of neurons in certain areas to other populations in other areas and then the neurons fine-tune their coefficients based on your experience.

And don't dismiss "programs need a beginning and end" and "input and output" as incidental, they're absolutely essential to writing a program.

Thank you for your thoughtful and detailed response. I'm digesting your points and have some research papers and thoughts I'm going to share tomorrow, but here's an immediate response on #2.

(2) Regarding: "beginning and end", "input and output"

It's my understanding that neural networks suck at learning from periodic functions, one of the most basic functions of importance to human society and natural science.

I'd argue that this isn't JUST because of the math, it's also the assumptions being made.

Regarding: "visual cortex of your brain has layers".

You're talking about an instrument of data collection & data filtering. Not an instrument of inference.

Keep digging deeper...And deeper...you will never find a neuron that can recognize your Grandma. Or your cat. Or that guy you hate at the grocery store.

https://en.wikipedia.org/wiki/Grandmother_cell

This is part of the problem with reductionist thinking and the reductionist approaches I see in ML.

I'll try to expand and explain more tomorrow on where I'm coming from (nonlinear dynamics, complex analysis and chaos theory)

I'll warn you that I am a arch-reductionist and have a PhD in chaos theory.
Then you are exactly who I want to talk to and learn from!!
The kind of wicked problems they talk about involve not everybody being on board with solving the problem (e.g. the drug addict who doesn't want to stop using, the billionaire who would be 100,000 times poorer if wealth was evenly distributed) or not seeing the problem the same way (the white person who would be at most 10% poorer if wealth was evenly distributed but sure gets scared when somebody kicks down the door at the gas station and steals all the green Newports.)
I posted it before I had realized you had a background in chaos theory. It's being discussed on HN right now, so it seemed timely and relevant.

That said, the paper still touches on the same problems with reductionism and simplification of complex systems.

To quote from the paper:

We draw on the ‘reductive tendency’, a process through which individuals simplify complex systems into cognitively manageable representations. While simplified representations offer benefits, such as quicker decision-making, such representations are often inaccurate as they overlook the complexities of the problem at hand.

--

Compounding these factors is the nonlinear nature of wicked problems, where “cause and effect relationships are either unknown or highly uncertain.

Second, wicked problems present potential entrepreneurs with “radical” uncertainty. Because of their specifically nonlinear and interrelated complexity, wicked problems “have no closed form definition”

--

Multiple reasons have been offered as to why reduction is so common. For example, the ability to reason about complexity requires a range of components to be prioritized to understand how they relate within a system. As this is difficult, individuals adopt understandings that are simpler in nature, thereby reducing the perceived complexity of a problem (Feltovich, Spiro, and Coulson, 1993). Others suggest that the tendency is a habitual carry-over from the rudimentary and routinized way that beginners are introduced to a concept (Gibson and Spelke, 1983). For many individuals, simpler conceptual forms are often employed to introduce a topic (Feltovich et al., 1989). This may, however, set up path-dependent learning that relies on reduction as a crutch (Feltovich et al., 1986). Another argument arises from motivational psychology and the finding that people prefer a middle level of complexity in their lives; concepts that are too simple are deemed boring, while concepts that are too complex are off-putting and do not attract engagement (Berlyne, 1971).

Research has identified 11 dimensions or manifestations of the reductive tendency (Feltovich et al., 2004; Hmelo-Silver and Pfeffer, 2004). We organize these into three categories.

The first pertains to simplifying processes and entails four dimensions: continuous processes are simplified into ones with discrete steps; interactive processes that depend on each other are simplified to be independent and separated; concurrent processes are simplified to be sequential; and nonlinear explanatory relationships are simplified into linear ones.

The second category pertains to perspective restrictions. This category describes situations in which individuals minimize the importance of, or ignore altogether, facets or manifestations of phenomena. This category includes three dimensions whereby individuals simplify: concepts necessitating multiple representations to single ones; phenomena with numerous and ambiguous causal mechanisms to ones with simple and clear causal agents, and; concepts with covert or abstract elements to surface-level, apparent ones.

The third category contains four dimensions that pertain to forming standardized representations of phenomena. It captures situations in which individuals simplify: concepts necessitating dynamic understanding of inputs into static ones; heterogeneous schemes or facets of a phenomena into uniform or highly similar; context-sensitive phenomena into universal ones; and regularity to replace situations that are characterized by asymmetric, inconsistent, or complex patterns

--

^ This is closely matches the major points I'm whining about.

My take is that Ashby's Law rules the roost

https://www.businessballs.com/strategy-innovation/ashbys-law...

Namely you have to simplify any problem in order to talk about it, solve it, teach it (making some of those reductions) but there is a certain amount of complexity that is fundamental to the problem.

For instance you can sometimes get away with treating a concurrent process as sequential, sometimes you can't.

The reductionist prays for the wisdom to know which simplifications they can get away with and which ones they can't. If your model captures the essential features you are OK, otherwise you are lost in the woods.

My journey with Machine Learning so far:

:D Oh, nonlinear equations! This is something I know a lot about.

:) I think I see...so they use nonlinear equations in the activation function. This helps to create divergence, or sensitive dependence on initial conditions.

:| Wait it's a sigmoid function?? Wtf that's boring.

:( They're just trying to min/max a data set, and figure out probability as it relates to that min/max. But that sucks, because most of the interesting phenomenon in nature exists in BETWEEN zero and one! All the fun, cool stuff happens in the middle! You can't reduce it down to a probability, there's no way that's going to do a good job describing anything!

Thanks for the question and discussions. Any books about this knowledge? Very insightful.