Hacker News new | ask | show | jobs
by claytonjy 1628 days ago
I took a similar route, albeit much less intentionally and spanning almost a decade. Software QA -> Business Intelligence Analyst -> Data Scientist -> Data Engineer.

Here what I'd recommend today:

1. get very comfortable with Python. Scripting isn't enough, you'll need good OO principles, understand how to manage projects/libraries/dependencies, etc. This will take the longest, so start it first.

2. Read and re-read Designing Data-Intensive Applications by Kleppmann. This is the bible of data engineering and far outclasses anything else currently available.

3. Get your hands dirty with modern tools and the whole data lifecycle. DBT, Airflow, Snowflake, Postgres should be obvious (feel free to substitute prefect, clickhouse, etc. if desired). You'll also want familiarity with a cloud stack and how to manage it (terraform, pulumi, or CDK). A public portfolio project would be great, but being able to talk confidently about the how and why of these things is probably enough.

The hard part is getting that next job. Look for junior roles at big companies, and mid-level roles at startups who don't understand the data ecosystem yet (almost any startup whose product is not ML or ELT). The former will give more mentorship, the latter will be easier to get if you can talk the talk in an interview.

2 comments

>you'll need good OO principles

Could you please give more details why this is important? I have good experience with dealing with data, data science and little bit of data engineer too but I never saw the necessity for OO. I'm also very interested in data engineering and was wandering why you mentioned OO and why it is important for data engineering?

Thank you.

I've been there too, especially since most of my pre-eng work was in R. In python at least, if you want to write code that others can use and extend, you need to embrace objects. Nothing fancy necessarily, but you should have a sense of how to organize a class hierarchy for a given program, when and how much inheritance to use. To do that you'll touch on a lot of small things like mixins, MRO, ABCs, etc.

One way you may be forced into this is custom Airflow operators; the community now recommendeds writing ~0 logic in airflow and sticking it all in docker instead, but any team using airflow for more than a few years has a tangled web of custom bullshit you'll be expected to maintain and extend.

You can certainly write a lot of python in a more procedural and/or functional way, but if you ask a python engineer to use or modify that code, don't be surprised by their anger.

Thank you so much for your answer!

I have general understanding of OO in Python. I just do not see where exactly to use it in data engineering. Could you please recommend any book/article/video that shows with the examples when to use OO in data engineering tasks?

This is perfect thanks.

On #3, do you know of any public githubs or codebases that are good to reverse engineer and learn an end to end pipeline from?

Always like to supplement book learning with real world production-level examples, especially in this case where I don't have access to that where I work.