| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nmkridler 3788 days ago
	I completely disagree, data scientists who can not create the data they need are at a significant disadvantage to those who can. Our job is more than being able to analyze and interpret data. If you have someone in your organization that spends no time thinking about how they get the data, you need to fire them or reduce their salary.

2 comments

v64 3788 days ago

The data scientists I work with are statistics PhDs. The extent of their programming knowledge is R and SQL. What are they supposed to do if the data they need to analyze is only available through a SOAP API you log into with OAuth, and they need to log in once a day to retrieve the latest day of data? Unless you're a software engineer, you probably don't have the skillset necessary to easily get that data.

The data we use comes from relational databases and document stores operated by different departments, external APIs and third party services, SalesForce, server log files, etc. A stats PhD does not have the training to gather this data themselves.

In terms of a hybrid scientist/engineer role, I don't know many software engineers who are also good at stochastic calculus or ensemble learning. Likewise, I don't know many data scientists who are also comfortable writing cronjobs to retrieve external API data or have the ability to diagnose server problems.

link

nmkridler 3788 days ago

What you are describing is a statistician and that's perfectly fine, but lumping them in with data scientists devalues the role for those of us doing more.

link

v64 3788 days ago

How would you differentiate the roles of statistician, data scientist, and data engineer? I've used and heard the titles "statistician" and "data scientist" used interchangeably, and the Wikipedia entry for data science [1] gives evidence to support that usage since the late 90s:

"In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled "Statistics = Data Science?" for his appointment to the H. C. Carver Professorship at the University of Michigan. In this lecture, he characterized statistical work as a trilogy of data collection, data modeling and analysis, and decision making. In his conclusion, he initiated the modern, non-computer science, usage of the term "data science" and advocated that statistics be renamed data science and statisticians data scientists."

From the same article, a quote from Nate Silver:

"I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician."

If your skillset differs from a statistician, then calling yourself a data scientist is not going to be a differentiating title in common parlance.

[1] https://en.wikipedia.org/wiki/Data_science#History

link

nmkridler 3788 days ago

I think the quote and definition from the blog is a good one: “better engineers than statisticians and better statisticians than engineers”. Perhaps that 1997 quote was influential in the decision to use the term Data Science, I think the current usage encompasses much more than statistics. When I started it required the ability to push production code, build statistical models, and communicate results effectively. Maybe I'm wrong and maybe the tools got better, but for a while, you couldn't provide value if you couldn't get to the data or create the data you needed.

link

hadley 3788 days ago

SOAP + Oauth is a weird combination but you could definitely work with it in R.

link

v64 3788 days ago

I just randomly picked two of the most painful protocols I could think of :) It doesn't surprise me though, I feel like I can't go a workday without hearing the phrase "Oh, actually, I can do that in R"

link

TheLogothete 3788 days ago

I disagree with you.

link