|
|
|
|
|
by minimaxir
3233 days ago
|
|
"Rosetta Stone" implies that there is a universal stratagem for processing any dataset in any language. One of my pet peeves with everyone using the Titanic dataset as a Hello World for data science is that real-world datasets are not as clean or intuitive. ETL and variable selection is half the battle, if not more. |
|
You still need to clean those missing data and do some sort of imputation. (edit you cannot use randomforest until you deal with those missing values in R at least and from a theory perspective I don't recall CART handling missing data.)
If you want to eek out those percentage of accuracy you have to do feature engineering which gave me the chance for the first time to actually understand and practice feature engineering.
And of course you gotta do EDA on it.
The only thing I've see that is bad about the Titanic is the data leakage. People are getting 100% accuracy because you can look up who's actually dead or use the test data with your train data and increase your model accuracy. But it also introduce you to the concept of data leakage.
I think the titanic dataset is very nice and compact that it lets you practice a variety of skill sets within the datascience domain. Much better than when I had to deal with medical genetic data.
> variable selection
You mean multivariate data?
I think there's a reason why in applied statistic you take statistic and regression first before you jump into multivariate.
Unless you just want to blindly do PCA and factor analysis on everything under sun without understanding the theory sure.