Hacker News new | ask | show | jobs
by joshwd 4773 days ago
What really differentiates data science from econometrics, other than (I guess) the type of data being analysed?
1 comments

I would like to know this as well.
1) Size of data : Most econometrician work on small data set (mostly in MBs ) which they can they keep in RAM and use R and excel to analyze the data. but modern day data scientist have to deal with GBs (sometimes TBs or even PBs) of data..for such a large data you need multiple machine or even hundreds of machine..So you need to be good at distributed computing and frameworks like hadoop, hive etc

2) Visualization : such large dataset can not always be expressed in bar charts or pie charts...so standard charting tools like excel and R dont work..you need to have good knowledge of charting libraries like d3 or openGl (for 3d visualization) to analyze and express their findings

4) Type of data: Econometricians are never comfortable with unstructured data set consisting of twitter feeds and apache logs..good knowledge of machine learning and graph algorithms are becoming very essential...Apache mahout a machine learning framework build over hadoop is looking extremely promising

I would also add that econometricians are highly focused, almost exclusively focused in fact, on finding causal relationships.

This means that descriptive work such as clustering, dimension reduction, is often either ignored, or considered as a kind of pre-processing before the real work starts.

I think this is a big one, and one of the reasons I would be uncomfortable calling myself a "data scientist" despite meeting some of the more tool-oriented definitions - my work has a much larger focus on attempting to infer causality.
Things I didn't learn in my econometrics classes that are used all over in data science:

1) Machine learning techniques for analysing data sets as opposed to parametric models

2) Clustering (k-means, etc.)

3) TF / IDF

4) Using a variety of data sources / tools - my econometrics educations was heavily Stata dependent. Learn a little bit of SQL, R, and Matlab so that getting up to speed doesn't take you longer than a month.

Thanks. What do you mean TF / IDF?
It's a method of determining which words in an arbitrary collection of documents (tweets, for instance) are most important when classifying those documents.

Term-Frequency-Inverse-Document-Frequency. Assigns each word a score based on how often it appears in a document relative to how often it appears in all documents.

https://en.wikipedia.org/wiki/Tf%E2%80%93idf