Using Spark to read from Excel

dazfuller
Jul 3, 2021
4 min read

There are many great data formats for transferring and processing data. Formats such as Parquet, Avro, JSON, and even CSV allow us to move around large volumes of data, with varying degrees of structure. But... There are a lot of people out there who, for various reasons, have their data in Excel.

Excel is great as a spreadsheet tool, and it makes things nice and easy to use for people analysing data and summaries within it. But it's not so great as a format for transferring data around, or as a source format for data engineering workloads. It misinterprets data types, people add in formulas and lookups, and then there's how people format data (I'm sure we've all seen some weird and wonderful formatting). But this doesn't escape the fact that there is often a lot of useful data in Excel which we might want to bring into our data lake and process in our big data pipelines.

There's been a few ways to do this to date, but a while ago I wanted to start learning how to write my own Spark data source and Excel seemed like a good place to start as, somehow, I always seem to end up with the projects that need data from it.

At the same time Spark was moving towards Spark 3 and with it, the newer DataSourceV2 APIs for creating data sources. So why not learn how to work with both?

I will write up another, more in-depth post on how I created the data source. The culmination of scratching that itch though is that the data source is now available on Github under the Apache License 2.0. It supports Spark 3.0 and above, but because the API was unstable before this it does not support 2.4 and below.

Right now you need to upload and install the jar using the well documented methods for your platform, such as Databricks and Azure Synapse Analytics.

So what can it do?

Well, first up it's available for use in Scala, Python, and Spark-SQL. In Scala and Python you can use the long format name "com.elastacloud.spark.excel", or the short format name which is just "excel".

// Scala
val df = spark.read.format("com.elastacloud.spark.excel").load("file.xlsx")
val df = spark.read.format("excel").load("file.xlsx")

# Python
df = spark.read.format("com.elastacloud.spark.excel").load("file.xlsx")
df = spark.read.format("excel").load("file.xlsx")

Alternatively, in Scala there is a convenience method as well.

// Scala
import com.elastacloud.spark.excel._
val df = spark.read.excel("file.xlsx")

In Spark-SQL you can read in a single file using the default options as follows (note the back-ticks).

SELECT * FROM excel.`file.xlsx`

As well as using just a single file path you can also specify an array of files to load, or provide a glob pattern to load multiple files at once (assuming that they all have the same schema).

And it's doesn't have to be just OOXML files. The library is built on top of Apache POI which allows it open Excel 97-2003, Excel 2010, and OOXML formats.

What else?

Well it's worth taking a look at the README for all of the various options, but just to summarize them here.

Multi-line Headers
Reading from multiple worksheets
Including the worksheet name in the output
Schema inference
Cleaning of column names to avoid issues when working in Spark
Handling of merged cells
Formula evaluation (for those supported by Apache POI)

And in practice?

Good question. I created an example notebook using data from the UK Office of National Statistics on Unemployment Rates between 1971 and 2021. In this notebook we read in the Excel file, transform the data, and then display a chart showing the percentage of unemployment month-by-month for the entire duration.

The source data from the ONS looks like the following.

Source data taken from the UK Office of National Statistics — Excel source data

The data itself starts at cell A9 and has no headers (well it does on row 1, but in this case it's easier to ignore it). The data is also actually 3 sets in 1. After the year-by-year data in the screenshot there is the quarter-by-quarter data, and then immediately after it the month-by-month data. It's the 3 set we're interested in so we have to do some post-processing once we've read the data.

Here we have the solution in Azure Databricks.

Unemployment rates using Azure Databricks

And again using Azure Synapse Analytics.

Unemployment rates using Azure Synapse Analytics

As you can see (if you zoom in) the code is identical. We read in the data by specifying that the data starts at cell A9, and that there are no header rows. The data source therefore creates default column names of "col_0" and "col_1", so we replace these with our own. Following that we create a "month" column by parsing the first column with a given format string. Any rows which don't match the format get a null value, but the monthly data gets a timestamp value, so we can filter out the null values. Finally we can display the data using the built in charting capabilities in both solutions.

What next?

This is an initial release and it won't be perfect. Next steps are to get Github Actions configured so that the jar files and coverage information are generated by the CI process. Also uploading to a Maven repository to make it easier to include in Spark projects.

But right now, it's available for anyone who wants to take a spin. If you find issues then raise an issue and let me know, or if there are features enhancements. Or, dive in and have a go at contributing to it.

6 Comments

NXerxesrWinemar

Jul 05

Beneath the buzz of the most popular vintage skin diver options lies link myriad link examples from period brands that cranked out watches using a link variety of shared elements, including cases, hands, dials, movements, and more. This is where I started my hunt – with brands like Technos, Wyler, Wittnauer, Vantage, Deman, Nicolet, Tradition, Waltham, and Yema (again, only to name a few).

Jul 03

And hey, if you do happen to look down at the Immortal Beloved, and immediately envision a wild-eyed Beethoven looking hopelessly out a link window at nothing, on a long-ago July day in a war-torn Europe, so much the better. Buy one for your one true love who will never love link you link back today.

UUdolfiJelenai

Apr 19

Marin Instruments has nothing to do with Marin County, instead, it link stems from the brand's dive watch ambitions, "I'm link just a kid from the midwest, I had no idea about Marin County. I didn't want anything too deep, I wanted something bold and simple. Originally, I link had considered Marine, but then I took the e off to make it stick more."

Apr 18

It's clear Vacheron produced the Historiques 222 with one purpose: To make a watch that looked like the original 222, but better in fit, finish, and construction. link Upon closer inspection, Vacheron made a modern watch that looks the same as the original 222, but with subtle improvements, from the link bracelet link and case to the dial.

KaleoxKendax

Apr 07

The PRX is downright link affordable, and there's something very charming to me about not giving into the link hype and still playing link in the SS/integrated bracelet game. Wearing the PRX is acknowledging the current trends in watchmaking, but not giving in to the hype machine. That's true steez.