Hacker News new | ask | show | jobs
by fooker 2854 days ago
So, curve fitting?
2 comments

Not in the context of a randomized control experiment. :-D Random assignment of the treatment is a really fantastic platform for causal inference.
exactly, Judea Pearl's The Book of Why opened my eyes to the fact that most of what happens in machine learning is really just curve fitting

It connected with what i've heard Chomsky say about trying to develop laws of physics by filming what's happening outside the window. We need to do experiments and interventions to learn the dynamics of a system

"What do you think the role is, if any, of other uses of so-called big data? [...]

NOAM CHOMSKY: It’s more complicated than that. Let’s go back to the early days of modern physics: Galileo, Newton, and so on. They did not organize data. If they had, they could never have reached the laws of nature. You couldn’t establish the law of falling bodies, what we all learn in high school, by simply accumulating data from videotapes of what’s happening outside the window. What they did was study highly idealized situations, such as balls rolling down frictionless planes. Much of what they did were actually thought experiments.

Now let’s go to linguistics. Among the interesting questions that we ask are, for example, what’s the nature of ECP violations? You can look at 10 billion articles from the Wall Street Journal, and you won’t find any examples of ECP violations. It’s an interesting theory-determined question that tells you something about the nature of language, just as rolling a ball down an inclined plane is something that tells you about the laws of nature. Scientists use data, of course. But theory-driven experimental investigation has been the nature of the sciences for the last 500 years.

In linguistics we all know that the kind of phenomena that we inquire about are often exotic. They are phenomena that almost never occur. In fact, those are the most interesting phenomena, because they lead you directly to fundamental principles. You could look at data forever, and you’d never figure out the laws, the rules, that are structure dependent. Let alone figure out why. And somehow that’s missed by the Silicon Valley approach of just studying masses of data and hoping something will come out. It doesn’t work in the sciences, and it doesn’t work here."

- https://www.rochester.edu/newscenter/conversations-on-lingui...

It is actually a really interesting subject, marketing people doing a/b tests for ads/features seem at least a little closer to the experimental ideal, not just fitting curves to data

For further reading, I'd recommend the epilogue of Casuality (Pearl 2000), it's from a 1996 lecture at UCLA:

- http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf

There are a lot of subtle points to make here. There is a tendency to throw data at models that don't capture parts of a distribution, and it is definitely true that many of the tail events in a challenging domain will not occur again no matter how long we observe the domain. Successful machine learning systems are able to predict these outcomes without having seen the data previously because they have captured the theory that creates them. Unfortunately it is very difficult to determine when a model is capturing the domain theory and when it is just modelling a distribution - often the only way is to "know" that it's a bit fishy. In many domains this difference doesn't matter, vision in animals seems to work in this way - it's all approximations and sameasis, and we and machines get tricked by optical illusions and so on. Other domains (many in physics) are modelled by observing data and inferring a higher level theory. Early days physics didn't work this way - Chomsky is right, but the method of Galileo is not the only method. Modern scientists do organise data and do look for exceptions and regularities which then drives the search for explanatory systems with predictive power.
I really violently oppose this characterization of ML as "just" curve fitting, as if curve fitting is some simple solved problem. It seems like there is a ignorance about issues relating to model selection, which is an essential part of curve fitting. What complexity of model does the data support? Can you keep a distribution over structures that allows uncertain parts of the model to be interrogated? These are the parts of the fitting equation that allow something like "experiments" to be automatically generated as part of the curve fitting.
Not the same kind of experiment. An experiment in the scientific sense tweaks the process that generates the data, not the interpretation of the data. There is an inspiration / hypothesis creation step between old data and new experiment.

Main differences: A hypothesis is sorta kinda like your model's coefficients, but more generally applicable. And you have no feedback loop between model coefficients and input data.

So yeah, you are doing very sophisticated curve fitting. It is useful alright, it's just not very much like science.

No, it's the same. It is just about having access to control variables.
What Chomsky is saying is that the control variables don't exist until you create them because the most telling things don't happen until you have a specific hypothesis and make them happen to test the hypothesis.
I disagree. What he is saying is that there is a special rule for languages that he doesn't think you would get at without an enormous amount of data. So a passive learning algorithm wouldn't uncover this structure in a reasonable amount of time or data (I guess it is poor sample efficiency he is worried about). A learning algorithm that has a distribution over it's own internal model of language would be able to ask questions that minimize the uncertainty of the model.
But what you describe is still curve fitting. I say this in spite of some expertise in ML myself. There are some parts of ML that are not fall in the curve fitting family but they are still a small part, for example Markov logic network, some parts of reinforcement learning.

What you are saying is curve fitting with good predictive ability is not trivial, and that is indeed true.

Markov Logic Networks are still about finding coefficients for a probability distribution over some process. My opinion is that there is only curve fitting. There is data and a minimum complexity model that can reproduce the data with minimum error. So do you really believe that there are physical processes where this approach will fail?
There is more to Markov Logic than estimating parameters. That's the "unification" part of the Markov logic, the analogue of https://en.wikipedia.org/wiki/Unification_(computer_science)
> Galileo

Nobody is interested in having a machine discover the theory behind parabolic trajectories. That was solved science 400 years ago.

What is interesting, is having a machine that can estimate a parabolic trajectory, not deductively, but inductively, based only on visual observation, for a variety of different shaped and sized objects. The way a human does.

Galileo was a great scientist, and discovered many natural laws relating to motion, but that wouldn’t have made him a great dodgeball player.