|
|
|
|
|
by eeegnu
2136 days ago
|
|
The most frustrating challenges I've faced boil down to just cleaning the data. It's not too bad when everything is stable and you're just cleaning up a database, though this can still be pretty hard depending on the scale of the operations required. The worst is when I have a live data feed that is liable to occasionally mess up. In one instance I was reading in stock data from an API, and on their end they messed up and sent the same timestamp for two different instances, which caused my local data aggregation to merge them together into a series, and later when that value was actually queried, expecting a numpy float, it just crashed. So writing data processing code that's anticipatory of potential noise, with mechanisms to resolve it, or that sends errors early instead of finding them a week later by performing asserts on your assumptions are what I've done to face this. I do agree with the general lack of feedback/improvement platforms, at least on the non-analysis side (I've seen good feedback on Kaggle forums before when it comes to questions on problem solving methodology.) I don't really follow the not finding value in data part though, in my experience it's pretty much a binary question like 'can I use this data I've found to solve my problem, or improve my solution', and if so it's valuable relative to that application. |
|
Is there any way you share this piece of code across teams? One of the challenges I have seen, is how to avoid re-inventing the wheel. Like, its all there, somewhere, however, across team members, its quite difficult to pass on that knowledge of “hey, already have this data processing script” for another similar usecase.