Hacker News new | ask | show | jobs
by tgflynn 5166 days ago
Another thing that increases difficulty is that depending on people's background and experience there are very different views on what is most important.

For example is my view exploratory data analysis and visualization are less important than using strong models and figuring out how to apply them to problems. I say this because I haven't seen any visualization methods that really tell you much about how hard or easy it will be to develop a predictive model. Sure you can do a 2-D LDA projection and if there's a huge amount of overlap you know you're not looking at a trivial problem. But if the problem is linearly separable someone's probably already got a good solution in Excel.

As for the "Big Data" buzzword it applies well to some problems like NLP or web analytics where massive datasets are available. In these cases it's clear that the more densely your data samples the problem space the better your performance will be and even very simple models will perform well.

However there are many applications where the amount of available training data is not so large and you need to use models which are powerful enough to discover non-obvious patterns. Applying such models and adequately evaluating them, which is critical to avoiding over-fitting with relatively small data sets, requires developing quite complex processes.