As a Data Scientist, what challenges do you face? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	As a Data Scientist, what challenges do you face?
	19 points by karishmakunder 2173 days ago
	Some of the challenges I feel prevalent are finding value in data, integrating open source software, not enough platforms for feedback and improvement.

6 comments

eeegnu 2173 days ago

The most frustrating challenges I've faced boil down to just cleaning the data. It's not too bad when everything is stable and you're just cleaning up a database, though this can still be pretty hard depending on the scale of the operations required. The worst is when I have a live data feed that is liable to occasionally mess up. In one instance I was reading in stock data from an API, and on their end they messed up and sent the same timestamp for two different instances, which caused my local data aggregation to merge them together into a series, and later when that value was actually queried, expecting a numpy float, it just crashed. So writing data processing code that's anticipatory of potential noise, with mechanisms to resolve it, or that sends errors early instead of finding them a week later by performing asserts on your assumptions are what I've done to face this.

I do agree with the general lack of feedback/improvement platforms, at least on the non-analysis side (I've seen good feedback on Kaggle forums before when it comes to questions on problem solving methodology.) I don't really follow the not finding value in data part though, in my experience it's pretty much a binary question like 'can I use this data I've found to solve my problem, or improve my solution', and if so it's valuable relative to that application.

karishmakunder 2172 days ago

That’s one of the major challenges, where you need the data processing code to reassess what comes in.

Is there any way you share this piece of code across teams? One of the challenges I have seen, is how to avoid re-inventing the wheel. Like, its all there, somewhere, however, across team members, its quite difficult to pass on that knowledge of “hey, already have this data processing script” for another similar usecase.

eeegnu 2171 days ago

Private git repo's, with Jupyter notebooks documenting the scripts is the primary means of sharing. I have duplicated quite a few things inadvertently though, just due to them being fairly simple and not asking about it. That's more of a communication issue than anything else though.

karishmakunder 2172 days ago

I agree. Kaggle’s great. Though, personally, I’d prefer a collaborative interface to help improve my model accuracy for example, than a competition type interface.

stevesycombacct 2172 days ago

A significant chunk of my work involves no use of algorithms, statistics, or math. For the most part, my days are filled with one thing: cleaning up data.

If more firms took an educated, standardized approach to their data, I would do my job significantly faster, and the company would have access to their data-related products in far, far less time. Firms I have worked for, overall, refuse to do that.

Instead, such notions are treated as toxic- and I am too, by the associative property. They think I'm making extra work ("You want us to stop copying Powerpoints into Excels and put data into a whole new Excel that stops me from pasting in whatever I want? But I already have it in my own Excel!"), or they think I'm automating someone's job away to be replaced by a robot. This makes getting support from both employees and leaders near impossible.

It could save firms millions a year in reduced labor costs alone to not have entire teams of people sitting in basements fixing problems with data. However, data scientists are sold on what math they can do and what algorithms they can write, not on what processes they can improve.

It's, frankly, hell.

karishmakunder 2172 days ago

Absolutely agree. It reminds me of an article that I read, had the statistics of Data scientists spend 80% of their time cleaning data rather than creating insights.

Curious to know, did you in any ways accomplish to standardise the way data was collected/ stored?

stevesycombacct 2172 days ago

It's more like 90%.

On existing processes, standardizing collection has been nearly impossible unless I promised that it would be easier, it will take less time, and I'll do all the setup. It always takes leadership backing, and if I don't get it from one leader, I go over their head to the next.

On new processes, if I jump forward and volunteer, I find I'm given leeway to do it my way- that is, a standardized way.

If this sounds ruthless, it is.

karishmakunder 2172 days ago

It is, what it is :)

DataDaoDe 2170 days ago

If you think about the scientific lifecycle: Gather Information => Form Hypothesis => Test Hypothesis => Analyze Data => Interpret => Repeat.

Then I would say the hardest parts are the "Gather Information" and "Test Hypothesis" phases. But its like this in every scientific endevour and this is nothing unique to data science.

One interesting point is, perhaps, that we as data scientists are aware that our sources for gathering information and our means for testing hypothesis are often tied to man made software or hardware systems - as opposed to dynamical real world structures. This means that theoretically and practically there is only ingenuity and will-power keeping us from building better and less time consuming ways for automating away the tedius (data cleansing/prep/etc.) parts of those processes.

cyberdrunk 2172 days ago

In my company, the biggest issue is finding the right data sets in company's vast data landscape, figuring out the exact definition/meaning of each column etc. Then, it's dealing with the data quality issues. Then, it's getting access to it and setting up an ingestion job for the data to be copied to some common storage (e.g. Hadoop). At the very end, it's the actual data science. I suspect a lot of the PhDs we hire start drinking before they reach the data science stage :)

karishmakunder 2172 days ago

Yeah, that! Do you use any tool to centralise the data that you use across your Data Science teams? Like a central repository of sorts, so that each one can be given access to it and from there you can start off building your models for training etc?

cyberdrunk 2169 days ago

Yes. We ingest all that data onto central Hadoop, where data science team can access all of it in an uniform way. This solves the physical access problem.

Unfortunately, the DQ and meaning of data are harder to solve. They require essentially caretaking of the datasets done by the data owners (cannot be done by a centralized unit). My organization is currently undergoing a transition, where it will be a responsibility of the data owner to maintain the metadata of his/her dataset and also to measure the data quality, but implementing it across the whole org is a journey that will take a long time.

danielscrubs 2173 days ago

Salespeople trying to coax us into the next automl or drag and drop ml tool. We have an army of phds at my job and you think your jack of all master of none software is going to impress us? The whole reason they started seems to be “ai is so hot right now and it will look good on our cv”. A single domain expert grey beard in our field would have impressed us more.

karishmakunder 2172 days ago

Ah, true almost all the time! :) There are only a handful of folks that have an impressive balance of understanding what hot and how to make an impressive product/solution.

nxpnsv 2172 days ago

Mangers not understanding what you are doing or why it is not instant.

natalyarostova 2172 days ago

It took me four years before I built up the experience and confidence to advocate for myself and the length things take. Now I'm trying to protect newer data scientists"

"Hey, new_hire, can you build this model by Friday?" "Sure, let me just get the data" Me: No you can't. The data will take a month to get. No one is getting anything from you on Friday, because I'm not going to let our team develop a culture where data scientists work all night on thankless failing data jobs.

karishmakunder 2172 days ago

Kudos to you for that!

It starts with, “We need accuracy for this problem statement” and ends with “wait, lets find and prep the data first”. Timelines anyone?