|
|
|
|
|
by caffeine
1570 days ago
|
|
Let’s say you are the Lead Data Engineer of a small data-driven company. You need to define a strategy, pick a stack of tools, decide how data is going to be stored and normalised, what the workflow will be from ad-hoc, exploratory studies to productionized inference. Are there any good resources out there that are useful to this person? I am this person right now and I need to find some good guidance. |
|
So many resources that you can easily get lost. Martin Kleppmann's Designing Data Intensive applications can be your starting point. It helps you establish a basic to advanced understanding on quite a few of the concepts that will be coming up and some key principles to drive your strategy from a technical perspective (you'll need a few facets of your strategy for different audiences).
Then move to a more corporate focused presentation with Piethein Strengholt's Data Management at Scale (business facing aspects of your strategy, incl. governance forums etc. unless if you are short-term lucky enough to not have them due to size - long term unlucky as you'll have to establish them or drive others to do so).
At this point, after a few discussions you should be getting a feeling of what the direction will be in terms of where your data will be stored, how you do data quality, how you process, how you expose, infrastructure etc. Dozens of books on the individual elements of your stack. Try to link them back to Kleppmann or other more specialized but still conceptual books (e.g. if you do streaming you could look into Flow Architectures by Urquhart, Streaming Systems By Akidau et.al. etc.) Then you can move to inference, etc. I am not at that stage yet, so no specific advice. In my case, I see inference etc. as more of something I can address after data are on the platform, but not sure what the state you are facing is. I guess you can start looking into trendy stuff like MLOps etc.
Good luck! It's really exciting working on this domain!