Hacker News new | ask | show | jobs
by wenc 2042 days ago
You're both right. Sometimes we have to work with the data we have. Other times we have to create or buy the data we need.

Some companies aren't experienced at building infra to collect data and don't know how to do it. Or their environment is too complex or expensive to sample data from. The data scientist's job in such cases is to do their best with what exists, show success and make a business case for investing resources into data collection infrastructure.

In other cases, when the required sensors don't exist and the information is critical to decision making, you can either buy the data or work with with an engineering group or external vendor to integrate and build out the sensors needed. Need foot traffic data? You can buy from a data marketplace like https://datarade.ai, where there exist various vendors (like SafeGraph -- which was recently used in a COVID19 study published in Nature) aggregating foot traffic data from cell phones. There are datasets that can be used as inferential proxies (so called "alternative data") for the actual data one needs.

Need to collect in-store data? I was at the NRF conference (the world's largest retail tech conference) in NYC back in January and there were a boatload of vendors hawking different types of retail analytics sensors.

In certain small scale operations, you can even engage field operations and get the in-store retail staff to help collect data and upload manually. (you'll need a good relationship with the field supervisor of course)

Sometimes the data does exist but is inaccessible, say in the ERP or in some proprietary format -- then you have negotiate with certain business groups or with OEM vendors in order to get the data out.

It all boils down to whether the data has value that exceeds (by a margin) the cost of collecting them. If the answer is yes, there's often a way to do it (albeit sometimes imperfectly).

Is it part of the data scientist's job description to create or participate in creating data collection infrastructure? I guess this depends on the company but for many companies the answer is yes.

1 comments

I agree with you too. I think it's a mix of exec and management dont fully grasp the job and its implications if you shortcut too much. At the same time, too many data sci are in it for the keyword/sexiness of the job and are not of the personality type to take hardline stands. Inexperience leads to a lack of trust from higher ups. A lack of backbone from the experienced results in performing more incompetence. Which results in more lack of trust. Experienced personnel leave, more inexperienced comes in and do things the cheap, shortcut, buy bad data way, plus no backbone to combat against this when seen... and you see how this can spiral into a shitshow that I've been noticing in some consulting projects I've been in.

But yes, data collection should be part of their job. I'm having a hard time understanding why the person who analyzes the data should have a good word at least in what data is collected.