|
> If academic Data Science programs aren't emphasizing data engineering as part of their curriculum, what differentiates a Data Science program from statistics or business intelligence? In my experience, they're emphasizing software-based data work like machine learning, but not the (vital) peripherals like
cleaning/studying/loading data or monitoring and sanity-checking outputs. A data science student might get a process-first task like making predictions from data using KNN, regressions, t-tests, or neural nets, choosing a method and optimizing based on performance. A statistics student might focus on theory, choosing an appropriate analysis method in advance based on the dataset, and reasoning about the effects of error instead of just trying to reduce it. But the data scientist could still be training on a clean, wholly-theoretical dataset or a highly predictable online-training environment. The result is a lot of entry-level data scientists who are mechanically talented but stymied by real-world hurdles. Issues handling dirty or inconstant data, for one. But there are a lot of others: a tendency to do analysis in a vacuum, without taking advantage of knowledge about the domain and data source; or judging output effectiveness based on training accuracy, without asking whether the dataset is (and will stay) well-matched to the actual task. I don't mean that to sound dismissive; there are lots of people who do all of that well, even newly-trained. But it does seem to be a common gap in a lot of data science education. |
I'm currently working on an assignment for CV in which we extract Histogram of Oriented Gradient features from the CIFAR-10 dataset using python, then use them to train one of three classifiers (SVM, Gaussian Naive Bayes, Logistic Regression). I had asked about preprocessing, but was told it was outside the scope of this assignment, so we're just using the dataset as-is. :(
The nice bit is, I have a research internship coming up in a lab that will have me working on actual datasets, rather than toy examples. And, there's a data science club on campus that has an explicit focus on cleaning data which I plan on regularly attending. So... hopefully I'm on the right track!