Processing Excel Data using Spark with Azure Synapse Analytics

Having recently released the Excel data source for Spark 3, I wanted to follow up with a "lets use it to process some Excel data" post. This took some more work than I expected. Normally when I go looking for data sources for posts or examples I skip past all of the sources where the format is Excel based, but this time I wanted to find them. The problem is that they often come as highly summarised overviews of data, but not the actual data itself, so this took some hunting around.

Eventually I settled on the Feed in Tariff commissioned installations statistics from data.gov.uk. There's a couple of reasons I went with this data set.

The data is split out over multiple sheets
It's structured pretty terribly (which is common when dealing with Excel data sources)

The splitting of the data is useful as it means I could demonstrate reading in multiple sheets from the same workbook, as well as including the sheet name in the output. The structure? Well it's been made for humans who are pretty good at inferring data based on position, but requires more effort for anyone cleaning the data. It also has the data pivoted, so for each new month of data a new column is added to the end of the sheet. Great for people, pretty rubbish for machines.

The following is a screenshot of one of the sheets.

As you can see from this we have

Merged cells
Date headers are split across 2 rows
The heading showing the unit for the total is inside the data
There are grand and sub-total rows
We've got a random column on the far right, split from the table by a blank column
The notes are straight after the data (Excel needs 2 empty rows and/or 2 empty columns to indicate the end of a table)

The data is split across sheets called

Cumulative PV FIT Deploy
Cumulative Hydro FIT Deploy
Cumulative Wind FIT Deploy
Cumulative AD FIT Deploy
Cumulative Micro CHP FIT Deploy

We'll want the sheet name as it will be easier to get the technology type from that rather than the sub-heading row in the data. And we'll need to read in the data, across multiple sheets, add the value unit of measurement in, clear out totals and sub-totals, clear out the non-data rows, and then un-pivot the data.

Getting start

First up is which platform am I going to run this on. Well, I've been doing a lot in Databricks and so thought I might as well create a simple pipeline using Azure Synapse and Spark pools. I need to create a Spark 3 pool though which is still under public preview, but this is a simple drop down when creating the pool.

I already have a Synapse Analytics workspace which is connected up to an Azure Data Lake Gen 2 instance, and so I could jump straight in to working out the process I wanted to follow.

Use a copy activity to download the Excel workbook to the landing area of the data lake
Execute a Spark notebook to clean and stage the data, and to also start the curation process
Load the data into a SQL pool and create a Kimbal model
Load the data into Power BI

So, first step, download the data.

Downloading the data

If you've ever used Azure Data Factory this is going to look incredibly familiar.

I have a copy activity which is downloading data from my FIT_Deployments_HTTP data set. The data set itself is using a Linked Service which points at the data.gov.uk storage location and then takes a relative path to the file. Because this is an open data set I don't have to provide any authentication information.

In this case I'm using a binary copy. There is an Excel data set option available but this attempts to read the data from the Excel file and is very particular about it's structure. For this data set it would fail miserably and I'd need to create a data set for each worksheet in the workbook. So lots of data movement activities. Instead I'm taking a binary copy to get the workbook as is.

Once this has executed it's on to the Spark pool notebook

Processing the data with Synapse Spark pools

Before I can start processing the data I need to add the Excel data source library (jar file) to the workspace as a package. Once it's added I can then add it to my spark pool. This takes a while to do and, if you're used to using JAR files in Databricks, seems very awkward. But it gets there eventually.

I'm using the 0.1.6-SNAPSHOT release of the Excel data source as there were some bug fixes I implemented and wanted to try out.

Once it's uploaded, we can get started.

Writing the notebook

So, the very first step is to read in the data using the Excel data source. Well, I say that's the first step, the actual first step is to open up the workbook in Excel first to work out where the data starts so we can provide the right options. I'm writing this in PySpark just to make it more accessible.

source_df = (spark.read
    .format("excel")
    .option("cellAddress", "B3")
    .option("headerRowCount", "2")
    .option("sheetNamePattern", "Cumulative [\\w\\s]+ FIT Deploy")
    .option("includeSheetName", "true")
    .load("abfss://synapse@mystorageaccount.dfs.core.windows.net/landing/October_2019_FIT_Deployment_Statistics.xlsx")
)

What's going on here? Well, we're first saying that we want to use the Excel data source. Because it's loaded as a package in the pool we don't need to import anything to do this. The next steps are to say that the data starts at cell B3, and that the header is made up of 2 rows (remember that the year is in row 1 and the month is in row 2). Next up is providing a regular expression to match the names of the sheets we want to take data from; this is a pretty easy regex as it's just saying "Cumulative <any word character or space, one or more times> FIT Deploy" which captures just the sheets listed above. Then we say that we want to include the sheet name in the returned data frame, this is because that sheet name contains some useful context. Finally we point at the workbook we want to read in.

So what we end up from this first step is this.

And it's not the cleanest looking data. But we can see that we have ingested data from all of the 5 sheets.

So we have the first 2 columns with default names provided by the data source. There's total rows, descriptions, blank rows, and why do the column dates have a 0 (zero) in them? Well that last point is an easy one, the year is stored in the sheet as a number value and not a string, so when the data source asks for it's string value it gets back "2010.0", but we don't want a period character in the column name otherwise it makes dealing with the values more difficult, so the reader has replaced it with an underscore, and then appended the month from the next row down.

Looking down the data we can see rows such as "Cumulative Number of Installations", this is pretty crucial data as it gives us the units the values are in. But... It's listed in col_1 which then gives banded ranges, so we need to get this data and forward fill the rows until we get to the next one and then forward fill with that value. We also want to get rid of the rubbish data and that means we can filter out anything where there's no value in the Pre_2010 column (much easier that one). Once we've got rid of those we need to clear out the totals (because they do have values in the Pre_2010 column).

There's a lot happening here, so step-by-step this goes.

Create a new column called "type". Where the "col_1" column starts with "Cumulative" then use that value, otherwise leave it blank.
Create a new "id" column which is the row number over the data, but partitioned by each sheet
Create a new "value_type" column which uses he last non-null value in the type column in the current partition (partitioning by the sheet name, and ordering by the "id" column)
Drop the "id" column
Keep any records where the Pre_2010 column is not null and the "col_1" column does not contain the words "All" or "Total"

I'm also renaming a couple of columns, but those are pretty straight forward transformations.

So now the data is looking cleaner (I've sub-selected the columns to make it easier to show in the picture), we've got our value type, and we've filtered out the non-data records. Now for a bit of renaming. We need to rename col_0 and col_1, extracting the technology from the sheet name, and rename all of the date columns to strip out the zero in the middle.

I'm using the Python reduce function so that I can build up the list of columns to change and then call withColumnRenamed for each one. This could be done in a loop, but I think this looks neater. I'm also using the regexp_extract function to extract the technology information from the sheet name.

We now want to unpivot the data, so rather than having a column for each period we have a row for each instead. I've covered how to do this in a previous post and I'm using the same method here. If it's difficult to follow then I strongly suggest reading that previous post.

This is looking a lot better. We now have a row for each data point, and the data from the sheet names. There's just a couple of things left, if we were to look at the schema we would see that the value is a string, this is because the data source couldn't infer the columns as floats because of the non-data records. So, we need to cast this to a numeric value and also change they key from Year_Month to a proper date.

And now we have the final staged data set. Taken from fairly rubbish Excel and to a clean set of data, un-pivoted, and in a structure that allows us to start building additional features and services on top of it.

Curating the data

There's one last thing I want to do in the Synapse Spark pool, and that's to slightly reshape the data before I load it into a SQL pool, just because there are things I find easier to do with Spark.

In this case I want to extend the tech column, having a full description where it's been abbreviated in the sheet name, and turning the newly created period date into a key value. This is actually pretty easy to do in Spark.

What you might spot though is that at the bottom I've taken the data frame and created a temp view from it. This is for the next step where the data is loaded into a SQL pool. Synapse has it's own writer method for doing this but currently it only works in Scala, and so I need to write the data out and load it back in a Scala cell to write it to the SQL pool. This is a bit strange given that Synapse defaults users to Python workloads, but there we go.

The last line here is where the data is written to the SQL pool, specifically I'm using the INTERNAL constant so that the data is written into the pool, rather than it becoming an external table.

N.B. This write only works once, if it's run again then the target table has to be manually dropped. Hopefully this is going to be fixed in the future and it obeys the "overwrite" option.

Modelling and Power BI

After this it's a case of modelling the data into facts and dimensions in SQL. I'm not going to dive into the detail here, but the process is pretty easy. First I create the dimensions by extracting the distinct values from the value_type, scheme, capacity, and tech columns, creating a dimension table for each. In each table I have an auto-incrementing integer which becomes the id column. Then I create a fact table by taking all of the values from the curated table, joining on each of the other columns to the dimension tables, and ending up with a new table that has the id codes for the dimensions and just the facts. Because this is Synapse I do this using a CTAS (Create Table as Select) operation.

You might ask why I didn't mention the date dimension? Rather than creating a dimension for only the dates in the fact table, it's better to have a dimension for all dates where you can add date logic specific to your business needs. This might mean including financial periods, quarters, number of weeks prior to now etc... So I created another notebook which creates a simple dimension and loads it into the SQL pool under the dbo schema, so that it's available to other data sets later.

The numbers might look a little weird, but that's because I'm using seconds from the epoch to create the data. Starting at 0 (1st Jan 1970) and going up to 31st December 2040, incrementing by 86400 (the number of seconds in 1 day). There's a good reason for doing this. If I had started with a number like 20100101 (1st Jan 2010 in yyyyMMdd) and incremented by one, then I've lost all notion of days and spark will dutifully tick up to 20100199 before moving on to the next month. Epcoh seconds work well as it's a continuous measure and I don't have to filter out invalid dates later.

This gives me a nice simple date dimension which I can join to on the date_key column later on.

Then finally it's a case of loading the data into Power BI where the references can be added in, allowing the facts to be sliced and diced however the user wants by the data in the dimensions.

The visualisations aren't great as I'm not the worlds best (mediocre) Power BI user, but hopefully you can see what I'm trying to achieve (if anyone wants to try this out and show me some better visuals I'd love to see them).

And at that stage we're done.

And done

Hopefully this shows how we can take data from Excel, ingest and transform that data using Spark in Azure Synapse, and take it all the way through to Power BI. This entire process could be just as easily replicated using Databricks, with the exception of a different driver being required to write the data out to the Database (SQL pool in this example).

As I said earlier in the post though, if you can use a different data format such as CSV or Parquet then do so, it will save a lot of effort in terms of implementation, and probably save a lot of pain later on when someone decides to rename a sheet, or change the layout to make it "easier to read". But if you can't avoid Excel, then hopefully this helps work out how to tackle it.