Hacker News new | ask | show | jobs
by eeegnu 2136 days ago
The most frustrating challenges I've faced boil down to just cleaning the data. It's not too bad when everything is stable and you're just cleaning up a database, though this can still be pretty hard depending on the scale of the operations required. The worst is when I have a live data feed that is liable to occasionally mess up. In one instance I was reading in stock data from an API, and on their end they messed up and sent the same timestamp for two different instances, which caused my local data aggregation to merge them together into a series, and later when that value was actually queried, expecting a numpy float, it just crashed. So writing data processing code that's anticipatory of potential noise, with mechanisms to resolve it, or that sends errors early instead of finding them a week later by performing asserts on your assumptions are what I've done to face this.

I do agree with the general lack of feedback/improvement platforms, at least on the non-analysis side (I've seen good feedback on Kaggle forums before when it comes to questions on problem solving methodology.) I don't really follow the not finding value in data part though, in my experience it's pretty much a binary question like 'can I use this data I've found to solve my problem, or improve my solution', and if so it's valuable relative to that application.

2 comments

That’s one of the major challenges, where you need the data processing code to reassess what comes in.

Is there any way you share this piece of code across teams? One of the challenges I have seen, is how to avoid re-inventing the wheel. Like, its all there, somewhere, however, across team members, its quite difficult to pass on that knowledge of “hey, already have this data processing script” for another similar usecase.

Private git repo's, with Jupyter notebooks documenting the scripts is the primary means of sharing. I have duplicated quite a few things inadvertently though, just due to them being fairly simple and not asking about it. That's more of a communication issue than anything else though.
I agree. Kaggle’s great. Though, personally, I’d prefer a collaborative interface to help improve my model accuracy for example, than a competition type interface.