Hacker News new | ask | show | jobs
by prions 1989 days ago
SQL proficiency is important but I wouldn't say it supersedes programming experience. To me, Data Engineering is a specialization of software engineering, and not something like an analyst who writes SQL all day.

As DE has evolved, the role has transitioned away from traditional low code ETL tools towards code heavy tools. Airflow, Dagster, DBT, to name a few.

I work on a small DE team. We don't have the human power to grind out SQL queries for analysts and other teams. Our solutions are platforms and tools we build on top of more fundamental tools that allows other people to get the data themselves. Think tables-as-a-service.

3 comments

SQL proficiency is something I've seen developers of all sorts neglect, which I think is a huge mistake. And relegating SQL to something that just an "analyst" does is an even bigger mistake.

Several times over my career I've been brought in on a project where the team was considering replacing their RDBMS entirely with a no-SQL data store (a huge undertaking!) because they were having "performance problems". In many cases the solution is as simple as adding an index or modifying a query to use an index, but the devs regard it as some kind of wizardry to read a query plan.

I spent way too large a portion of my last position teaching developers about indexes, query plans and underlying join types and their impact on performance and memory consumption.
Just curious do you have such knowledge for columnar database such as Vertica?
Not Vertica though that looks very interesting. I do have a lot of experience with Redshift though. The difficulty is most implementations of data warehouses are fairly bespoke, even down to query plan and execution so knowledge on Redshift may not completely transfer to Vertica for instance.
Thanks. But how does one approach to learn the internals for these things? It's not like MySQL or SQL Server or PostgreSQL that we have tons of books and very detailed documentation. For Vertica we only have a doc, no books, just provided as is.

It seems to be the norm for everything that takes flight around 2010. Of course many are open sourced so those are OK I guess.

The thing is finding the terminology, in the case of Redshift that is Sort Key, Distribution Key and primary key (though these aren't true primary keys they do influence the query planner).

It took me a few minutes but I found indexes are called projections in Vertica and are more like materialized views than true indexes, here are the docs with a breakdown, https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...

And here is a general walk through of the architecture including key concepts such as projections. https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/C...

In a few cases I have had to go to published white papers on the technologies as well.

But honestly it's all searching for the right words and then crawling through docs and papers.

I wholeheartedly agree with the "specialization" comment.

Unless you are in a position where you can entirely rely on managed tools that do the work for you and all effort is centered around managing the data, rather than the holistic view of your data pipelines (Talend ETL, Informatica - the "pre-Hadoop" world, if you will, and maybe some modern tools like Snowflake), then a good Data Engineer needs a deep understanding of programming languages, networking, some sysadmin stuff, distributed systems, containerization, statics, and of course a good "architect" view on the ever-growing zoo of tools and languages with different pros and cons.

Given that at the end of the day, most "Data Pipelines" run on distributed Linux machines, I've seen and solved endless issues with Kernel and OS configurations (noexec flags, ulimits, permissions, keyring limits ...), network bottlenecks, hotspotting (both in networks and databases), overflowing partitions, odd issues on odd file systems, bad partition schemes, a myriad of network issues, JVM flags, needs for auditing and other compliance topics, heavily multi-threaded custom implementations that don't use "standard" tools and rely on language features (goroutines, multiprocessing in Python, Threadpools in Java ...), encoding problems, various TLS and other security challenges, and of course, endless use of GNU tools and other CLI-fun and I would not necessarily expect for a pure SQL use case (not discounting the fact that SQL is, in fact, very important).

Not to mention that a lot of jobs / workflows Data Engineers design and write tend to be very, very expensive, especially on managed Clouds - generally a good idea to make sure everything works and your engineers understand what they are doing.

TL;DR: SQL is still a DE's best and most ubiquitous tool/language in the space--hands down.

I've led DE teams for the last decade. I have lived through shifts in toolsets, languages, etc. Regardless of platform, languages, model types, etc, etc, etc, etc, the one constant has been SQL with some sort of of scripting around it.

Right now, it seems Python is the big wrapper language, whether it's via dag or some other means but that's just the preferred method TODAY. Considering SQL has been around for decades and has outlasted just about every other language and system, many of which have opted for a SQL-like interface on top of their system, I would highly recommend DEs be very strong there.