| I felt like this article was a bit light on data scientist specific advice, and while I am not one, I do herd them for a living, so thought I'd put some random thoughts together: 1) Quite often you are not training a machine to be the best at something. You're training a machine to help a human to be the best at something. Be sure to optimise for this when necessary. 2) Push predictions, don't ask others to pull them. Focus on decoupling your data science team and their customers early on. The worst thing that can happen is duplicating logic, first to craft features during training, and later to submit those features to an API from clients. Even if you hide the feature engineering behind the API, this can either slow down predictions, or still require bulky requests from the client in the case of sequence data. Instead, stream data into your feature store, and stream predictions out onto your event bus. Then your data science team can truly be a black box. 3) Unit test invariants in your model's outputs. While you can't write tests for exact outputs, you can say "such and such a case should output a higher value than some other case, all things being equal". When your model disagrees, do at least consider that the model may be correct though. 4) Do ablation tests in reverse, and unit test each addition to your model's architecture to prove it helps. 5) Often you will train a model on historical data, and content yourself that all future predictions will be outside this training set. However, don't forget that sometimes updates to historical data will trigger a prediction to be recalculated, and this might be overfit. Sometimes you can serve cached results, but small feature changes make this harder. 6) Your data scientists are probably the people who are most intimate with your data. They will be the first to stumble on bugs and biases, so give them very good channels to report QA issues. If you are a lone data scientist in a larger organisation, seek out and forge these channels early. 7) Don't treat labelling tools as grubby little hacked together apps. Resource them properly, make sure you watch and listen to the humans building and using them. 8) Have objective ways of comparing models that are thematically similar but may differ in their exact goal variables. If you can't directly compare log loss or whatever like-for-like, find some more external criteria. 9) Much of your job is building trust in your models with stakeholders. Don't be afraid to build simple stuff that captures established intuitions before going deep - show people the machine gets the basics first. 10) If you're struggling to replicate a result from a paper, either with or without the original code, welcome to academia. Probably not earth shattering stuff, I grant you. |
Good practices from software engineering are just as applicable to Data Science. In particular:
Notebooks are great for performing an EDA, and testing out new concepts. They're not great for running production code. Put your non-once off code in regular source code files and source control it.
Break your code into separately testable and composable functions. Write unit tests to verify behavior where you can. Speaking from experience you all most certainly will find bugs.
Implement a peer review process for the methodology used and the code. Approaches should be explainable and justifiable. Bugs and poor assumptions can lead to incorrect results.
Focus on making your model training process end-to-end reproducible. Document the training data used. Document the configuration used. Link back to the commit hash of the exact code used. Make sure your environment is reproducible.