• azurecoder

What makes a data engineer?

Note: this post is mainly about Azure but it can apply to any cloud.

You get to a stage when you hear enough definitions of what people think a job is to take you to the point of nervous exhaustion. This is one job definition that wears me out thinking and talking with customers.

All of sudden the market seems awash with data engineers. In this post I'd like to review some of the backgrounds that I find that people convey when they call themselves a data engineer. At Elastacloud we have some clear definitions which might be helpful if you're looking to build a practice. Let's start with the backgrounds of people that think they're data engineers.

  1. The DBA. This is one of the most common definition of the origin of a data engineer. I've seen people completely reinvent themselves from a SQL background into data engineering. Pad out a little Databricks and Spark on a CV with PySpark and you're a data engineer.

  2. Data Analyst. Data Analysts have a strong excel background and sometimes upgrade themselves to a data engineer where they add a little PySpark and Spark with some custom ETL.

  3. Software Engineer. Background in distributed computing, maybe gone through the days of DCOM/Remoting/RMI.

Okay so it's pretty clear that I think (3) is the most coherent definition. People in this bracket understand jobs, HPC and scale out compute naturally as well as algorithms. They are the sort of people that could write parts of Spark or Storm and read code of the things that they are trying to write "notebooks" for. That's why you need them. Period.

I've seen lots of type (1). Mainly in the wake of chaos that ensues where type (1) doesn't really understand the change from Data Warehouse to Data Lake as the single source of truth and how a data lake lifecycle works. A lot of our customers have this type of data engineer. We generally tend to find an overengineered morass of SQL, messy lengthy "ingestion" pipelines which take nearly as long to execute as the huge batch window that they need to execute in and blaming toolsets such as Spark for a fundamental misunderstanding of data localization and partitioning.

Type (2) is fairly harmless. I've seen many of our customers pretend that type (2) are data engineers. This type tends to amble along doing small irrelevant tasks but not achieving much. They tend not to be able to follow SDLC and are quite nervous of pushing code into production so projects with awash with disguised analysts in my experience never make it into production. Actually type (1) also don't follow SDLC either.

Type (3) is generally a software engineer. This type would be able to move between Spark and Azure Functions on a whim because they realise that they are two sides of the same coin, distributed compute. With big compute it's all about scale out and fast execution and understanding the limits of each part of a system so that you can distribute and make it execute as quickly as possible. With big data the same paradigm is about how to separate data into chunks and transform so that it can be put together in a way that satisfies queries quickly. I've seen some terrible use of Spark by type (1) that really exemplify the lack of study of how to use a big data platform. It's weird because this type generally understands and spends loads of time on SQL optimisation but doesn't learn the basics of distributed compute.

In summary, data engineers for me are programmers that can move between distributed frameworks like Azure Functions / Azure Batch / Service Fabric and big data platforms such as Databricks or HDInsight. They will also have a strong understanding of software integration with many of the data engineers having come from a messaging and integration background (e.g. in Azure - Event Hubs, Service Bus, Event Grid etc.). They have good programming experience originally in OO languages and then later in functional languages.

Most of these things will come naturally to type (3) but over time they will learn BI skills such as modelling data in Kimball models and how to sequence and merge data into Databricks Delta and put those together so that the database / warehouse does very little work.

Happy trails and keep hiring type (3)s.


170 views0 comments

Recent Posts

See All

Dropping a SQL table in your Synapse Spark notebooks

One of the nice things with Spark Pools in Azure Synapse Analytics is how easy it is to write a data frame into a dedicated SQL Pool. curated_df.write.mode("overwrite").synapsesql("curation.dbo.feed_i