Hacker News new | ask | show | jobs
by minimaxir 3233 days ago
"Rosetta Stone" implies that there is a universal stratagem for processing any dataset in any language.

One of my pet peeves with everyone using the Titanic dataset as a Hello World for data science is that real-world datasets are not as clean or intuitive. ETL and variable selection is half the battle, if not more.

3 comments

No but the Titanic set let's your practice essential skillset and dip your toe into kaggle.

You still need to clean those missing data and do some sort of imputation. (edit you cannot use randomforest until you deal with those missing values in R at least and from a theory perspective I don't recall CART handling missing data.)

If you want to eek out those percentage of accuracy you have to do feature engineering which gave me the chance for the first time to actually understand and practice feature engineering.

And of course you gotta do EDA on it.

The only thing I've see that is bad about the Titanic is the data leakage. People are getting 100% accuracy because you can look up who's actually dead or use the test data with your train data and increase your model accuracy. But it also introduce you to the concept of data leakage.

I think the titanic dataset is very nice and compact that it lets you practice a variety of skill sets within the datascience domain. Much better than when I had to deal with medical genetic data.

> variable selection

You mean multivariate data?

I think there's a reason why in applied statistic you take statistic and regression first before you jump into multivariate.

Unless you just want to blindly do PCA and factor analysis on everything under sun without understanding the theory sure.

Neither rosetta stone nor "hello world" implies anything about "universal". Actually, quite the opposite. The real Rosetta stone is a very small subset of the 3 languages, but paved the way for greater understanding. The real Rosetta stone text is actually pretty mundane. And "hello world" apps are anything but "universal". This is just how to do a simple (non universal) data science task in 5 languages.
Of course, but the objective of this writeup appears to be illustrating the same basic problem in a few different languages.

It does strike me as a good idea to do something similar for data manipulation/cleansing. If I ever find some free time, I'll write it and post it somewhere.