Hacker News new | ask | show | jobs
by chimeracoder 5163 days ago
> In a PhD, essentially everything you learn is to master a particular topic, or solve some kind of problem. This can often involve programming (it did for me) and almost certainly involves statistics (again, it did for me).

Yes, but the original point was that more or less any quantitative PhD would be expected to have these skills.

> For instance, I was the only person in my department to learn R

Case in point - and I can tell you from knowing the PhD students that I if I found someone who knew how to program in any language not used primarily for statistical computation (R, Stata, Matlab, SAS etc.), I would consider them the exception, not the norm.

The exact opposite is true about a data scientist.

> but if you are not capable of performing this kind of analysis that you probably shouldn't be doing a PhD anyway.

Or you just don't care about those types of jobs - and apparently there are plenty of those, because many, if not the majority, of PhD students I can think of aren't looking for data science jobs.

> I really don't see how programming upends all that grad students learn (though I would be delighted to hear your thoughts), as to me it just seemed like the application of logic with the aid of computers.

It's not programming per se, but the computational power that it brings makes certain techniques feasible, and other concepts and methods aren't obsolete - just no longer optimal. This is really a comment about statistics specifically. Most job postings for data science positions mention some form of the phrase 'machine learning' - and if they don't, they often have that in mind. Unfortunately, while demand for machine learning dominates the job market, in the grand scheme of things, it's just one branch in the field of statistics, and its 'parent' branch was relatively obscure until very recently. To this day, if a PhD student finished their program having next to none of the required academic background for machine learning, I doubt most academics would bat an eyelid. It's just not considered important from an academic standpoint. It's unfortunate that we have such a disconnect between academic interest and industry demand, but it's very much the case.

A basic example that I often cite about how computational power has fundamentally changed statistics from how it was for the previous few decades is in our selection of estimators. (I often cite this because anybody who's ever taken a statistics class probably had this experience). In every introductory statistics class (and for many non-intro classes as well), when studying inference, you spend 90% of your time talking about estimators for which the first moment has an expectation of zero, and the 10% is a 'last resort' when no 'better' estimator exists. Who decided that the first moment was the most important? What about the second? Third?

Well, it turns out that the first moment is easier to calculate, and, by coincidence, it happens to be the most relevant when your dataset is small (say, between 30 and 100). But once you're talking about datasets with observations which number in the thousands (which is still 'small' by some standards today!), you'd be insane to throw out any estimator that converges at a linear rate (rather than at the rate of \sqrt{n}) just because it introduces a small bias.

But we do - and that's reflected in the the sheer amount of academic research and literature that discusses the former, and the sheer lack of that reflects the latter. In many cases, the theory exists, but it was developed in an era in which it could never feasibly be applied.

Vestiges of this era are visible even in many statistical software packages - for another basic example, regressions by default assume homoskedasticity in errors, even though this is almost never valid in real life. Why? Because in a previous era, everyone imposed this assumption, because while the theory behind the alternative had been developed, it was expensive to carry out in practice (it involves several extra matrix multiplications).

I'm painting with a broad brush, but the general picture still very much holds.

1 comments

I completely see your point on estimators and the nature of many if not most statistics courses. I suppose that I was lucky enough to study non-parametric statistics in my first year of undergrad, as there's a subset within psychology that's very suspicious of all the assumptions required for the traditional estimators.

That being said, I think you're missing my major point which is that a PhD should be a journey of independent intellectual activity, so the courses one takes should be of little relevance, and so can therefore be downweighted in considering what PhD students actually learn. I accept that this is an idealistic viewpoint (FWIW, the best thing that ever happened to my PhD in this context was my supervisor taking maternity leave twice during my studies, which forced me to go to the wider world for more information about statistics).

I accept your point about machine learning not being a major focus of academia (well except for machine learning researchers), and I think its awful. Its very sad that the better methods for dealing with large scale data and complex relationships are only used by private companies. Its not that surprising, but it is sad.

That being said, I firmly believe that any halfway skilled quantitative PhD can understand machine learning, most of which is based on older statistical methods. It may not be taught (yet), but its not that much of a mindblowing experience. I do remember that when I first heard about cross-validation I got shivers down my spine, but that may just be a sad reflection on my interests rather than a more general point.