Some of the challenges I feel prevalent are finding value in data, integrating open source software, not enough platforms for feedback and improvement.
The most frustrating challenges I've faced boil down to just cleaning the data. It's not too bad when everything is stable and you're just cleaning up a database, though this can still be pretty hard depending on the scale of the operations required. The worst is when I have a live data feed that is liable to occasionally mess up. In one instance I was reading in stock data from an API, and on their end they messed up and sent the same timestamp for two different instances, which caused my local data aggregation to merge them together into a series, and later when that value was actually queried, expecting a numpy float, it just crashed. So writing data processing code that's anticipatory of potential noise, with mechanisms to resolve it, or that sends errors early instead of finding them a week later by performing asserts on your assumptions are what I've done to face this.
I do agree with the general lack of feedback/improvement platforms, at least on the non-analysis side (I've seen good feedback on Kaggle forums before when it comes to questions on problem solving methodology.) I don't really follow the not finding value in data part though, in my experience it's pretty much a binary question like 'can I use this data I've found to solve my problem, or improve my solution', and if so it's valuable relative to that application.
That’s one of the major challenges, where you need the data processing code to reassess what comes in.
Is there any way you share this piece of code across teams? One of the challenges I have seen, is how to avoid re-inventing the wheel. Like, its all there, somewhere, however, across team members, its quite difficult to pass on that knowledge of “hey, already have this data processing script” for another similar usecase.
Private git repo's, with Jupyter notebooks documenting the scripts is the primary means of sharing. I have duplicated quite a few things inadvertently though, just due to them being fairly simple and not asking about it. That's more of a communication issue than anything else though.
I agree. Kaggle’s great. Though, personally, I’d prefer a collaborative interface to help improve my model accuracy for example, than a competition type interface.
A significant chunk of my work involves no use of algorithms, statistics, or math. For the most part, my days are filled with one thing: cleaning up data.
If more firms took an educated, standardized approach to their data, I would do my job significantly faster, and the company would have access to their data-related products in far, far less time. Firms I have worked for, overall, refuse to do that.
Instead, such notions are treated as toxic- and I am too, by the associative property. They think I'm making extra work ("You want us to stop copying Powerpoints into Excels and put data into a whole new Excel that stops me from pasting in whatever I want? But I already have it in my own Excel!"), or they think I'm automating someone's job away to be replaced by a robot. This makes getting support from both employees and leaders near impossible.
It could save firms millions a year in reduced labor costs alone to not have entire teams of people sitting in basements fixing problems with data. However, data scientists are sold on what math they can do and what algorithms they can write, not on what processes they can improve.
Absolutely agree. It reminds me of an article that I read, had the statistics of Data scientists spend 80% of their time cleaning data rather than creating insights.
Curious to know, did you in any ways accomplish to standardise the way data was collected/ stored?
On existing processes, standardizing collection has been nearly impossible unless I promised that it would be easier, it will take less time, and I'll do all the setup. It always takes leadership backing, and if I don't get it from one leader, I go over their head to the next.
On new processes, if I jump forward and volunteer, I find I'm given leeway to do it my way- that is, a standardized way.
If you think about the scientific lifecycle: Gather Information => Form Hypothesis => Test Hypothesis => Analyze Data => Interpret => Repeat.
Then I would say the hardest parts are the "Gather Information" and "Test Hypothesis" phases. But its like this in every scientific endevour and this is nothing unique to data science.
One interesting point is, perhaps, that we as data scientists are aware that our sources for gathering information and our means for testing hypothesis are often tied to man made software or hardware systems - as opposed to dynamical real world structures. This means that theoretically and practically there is only ingenuity and will-power keeping us from building better and less time consuming ways for automating away the tedius (data cleansing/prep/etc.) parts of those processes.
In my company, the biggest issue is finding the right data sets in company's vast data landscape, figuring out the exact definition/meaning of each column etc. Then, it's dealing with the data quality issues. Then, it's getting access to it and setting up an ingestion job for the data to be copied to some common storage (e.g. Hadoop). At the very end, it's the actual data science. I suspect a lot of the PhDs we hire start drinking before they reach the data science stage :)
Yeah, that! Do you use any tool to centralise the data that you use across your Data Science teams? Like a central repository of sorts, so that each one can be given access to it and from there you can start off building your models for training etc?
Yes. We ingest all that data onto central Hadoop, where data science team can access all of it in an uniform way. This solves the physical access problem.
Unfortunately, the DQ and meaning of data are harder to solve. They require essentially caretaking of the datasets done by the data owners (cannot be done by a centralized unit). My organization is currently undergoing a transition, where it will be a responsibility of the data owner to maintain the metadata of his/her dataset and also to measure the data quality, but implementing it across the whole org is a journey that will take a long time.
Salespeople trying to coax us into the next automl or drag and drop ml tool.
We have an army of phds at my job and you think your jack of all master of none software is going to impress us? The whole reason they started seems to be “ai is so hot right now and it will look good on our cv”. A single domain expert grey beard in our field would have impressed us more.
Ah, true almost all the time! :) There are only a handful of folks that have an impressive balance of understanding what hot and how to make an impressive product/solution.
It took me four years before I built up the experience and confidence to advocate for myself and the length things take. Now I'm trying to protect newer data scientists"
"Hey, new_hire, can you build this model by Friday?"
"Sure, let me just get the data"
Me: No you can't. The data will take a month to get. No one is getting anything from you on Friday, because I'm not going to let our team develop a culture where data scientists work all night on thankless failing data jobs.
I do agree with the general lack of feedback/improvement platforms, at least on the non-analysis side (I've seen good feedback on Kaggle forums before when it comes to questions on problem solving methodology.) I don't really follow the not finding value in data part though, in my experience it's pretty much a binary question like 'can I use this data I've found to solve my problem, or improve my solution', and if so it's valuable relative to that application.