| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joshwd 4773 days ago
	What really differentiates data science from econometrics, other than (I guess) the type of data being analysed?

1 comments

jjsz 4773 days ago

I would like to know this as well.

link

neel8986 4773 days ago

1) Size of data : Most econometrician work on small data set (mostly in MBs ) which they can they keep in RAM and use R and excel to analyze the data. but modern day data scientist have to deal with GBs (sometimes TBs or even PBs) of data..for such a large data you need multiple machine or even hundreds of machine..So you need to be good at distributed computing and frameworks like hadoop, hive etc

2) Visualization : such large dataset can not always be expressed in bar charts or pie charts...so standard charting tools like excel and R dont work..you need to have good knowledge of charting libraries like d3 or openGl (for 3d visualization) to analyze and express their findings

4) Type of data: Econometricians are never comfortable with unstructured data set consisting of twitter feeds and apache logs..good knowledge of machine learning and graph algorithms are becoming very essential...Apache mahout a machine learning framework build over hadoop is looking extremely promising

link

mc-lovin 4773 days ago

I would also add that econometricians are highly focused, almost exclusively focused in fact, on finding causal relationships.

This means that descriptive work such as clustering, dimension reduction, is often either ignored, or considered as a kind of pre-processing before the real work starts.

link

Fomite 4773 days ago

I think this is a big one, and one of the reasons I would be uncomfortable calling myself a "data scientist" despite meeting some of the more tool-oriented definitions - my work has a much larger focus on attempting to infer causality.

link

dangerlibrary 4773 days ago

Things I didn't learn in my econometrics classes that are used all over in data science:

1) Machine learning techniques for analysing data sets as opposed to parametric models

2) Clustering (k-means, etc.)

3) TF / IDF

4) Using a variety of data sources / tools - my econometrics educations was heavily Stata dependent. Learn a little bit of SQL, R, and Matlab so that getting up to speed doesn't take you longer than a month.

link

jjsz 4773 days ago

Thanks. What do you mean TF / IDF?

link

dangerlibrary 4773 days ago

It's a method of determining which words in an arbitrary collection of documents (tweets, for instance) are most important when classifying those documents.

Term-Frequency-Inverse-Document-Frequency. Assigns each word a score based on how often it appears in a document relative to how often it appears in all documents.

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

link