There are many great data formats for transferring and processing data. Formats such as Parquet, Avro, JSON, and even CSV allow us to move around large volumes of data, with varying degrees of structure. But... There are a lot of people out there who, for various reasons, have their data in Excel.
Excel is great as a spreadsheet tool, and it makes things nice and easy to use for people analysing data and summaries within it. But it's not so great as a format for transferring data around, or as a source format for data engineering workloads. It misinterprets data types, people add in formulas and lookups, and then there's how people format data (I'm sure we've all seen some weird and wonderful formatting). But this doesn't escape the fact that there is often a lot of useful data in Excel which we might want to bring into our data lake and process in our big data pipelines.
There's been a few ways to do this to date, but a while ago I wanted to start learning how to write my own Spark data source and Excel seemed like a good place to start as, somehow, I always seem to end up with the projects that need data from it.
At the same time Spark was moving towards Spark 3 and with it, the newer DataSourceV2 APIs for creating data sources. So why not learn how to work with both?
I will write up another, more in-depth post on how I created the data source. The culmination of scratching that itch though is that the data source is now available on Github under the Apache License 2.0. It supports Spark 3.0 and above, but because the API was unstable before this it does not support 2.4 and below.
Right now you need to upload and install the jar using the well documented methods for your platform, such as Databricks and Azure Synapse Analytics.
So what can it do?
Well, first up it's available for use in Scala, Python, and Spark-SQL. In Scala and Python you can use the long format name "com.elastacloud.spark.excel", or the short format name which is just "excel".
// Scala
val df = spark.read.format("com.elastacloud.spark.excel").load("file.xlsx")
val df = spark.read.format("excel").load("file.xlsx")
# Python
df = spark.read.format("com.elastacloud.spark.excel").load("file.xlsx")
df = spark.read.format("excel").load("file.xlsx")
Alternatively, in Scala there is a convenience method as well.
// Scala
import com.elastacloud.spark.excel._
val df = spark.read.excel("file.xlsx")
In Spark-SQL you can read in a single file using the default options as follows (note the back-ticks).
SELECT * FROM excel.`file.xlsx`
As well as using just a single file path you can also specify an array of files to load, or provide a glob pattern to load multiple files at once (assuming that they all have the same schema).
And it's doesn't have to be just OOXML files. The library is built on top of Apache POI which allows it open Excel 97-2003, Excel 2010, and OOXML formats.
What else?
Well it's worth taking a look at the README for all of the various options, but just to summarize them here.
Multi-line Headers
Reading from multiple worksheets
Including the worksheet name in the output
Schema inference
Cleaning of column names to avoid issues when working in Spark
Handling of merged cells
Formula evaluation (for those supported by Apache POI)
And in practice?
Good question. I created an example notebook using data from the UK Office of National Statistics on Unemployment Rates between 1971 and 2021. In this notebook we read in the Excel file, transform the data, and then display a chart showing the percentage of unemployment month-by-month for the entire duration.
The source data from the ONS looks like the following.
The data itself starts at cell A9 and has no headers (well it does on row 1, but in this case it's easier to ignore it). The data is also actually 3 sets in 1. After the year-by-year data in the screenshot there is the quarter-by-quarter data, and then immediately after it the month-by-month data. It's the 3 set we're interested in so we have to do some post-processing once we've read the data.
Here we have the solution in Azure Databricks.
And again using Azure Synapse Analytics.
As you can see (if you zoom in) the code is identical. We read in the data by specifying that the data starts at cell A9, and that there are no header rows. The data source therefore creates default column names of "col_0" and "col_1", so we replace these with our own. Following that we create a "month" column by parsing the first column with a given format string. Any rows which don't match the format get a null value, but the monthly data gets a timestamp value, so we can filter out the null values. Finally we can display the data using the built in charting capabilities in both solutions.
What next?
This is an initial release and it won't be perfect. Next steps are to get Github Actions configured so that the jar files and coverage information are generated by the CI process. Also uploading to a Maven repository to make it easier to include in Spark projects.
But right now, it's available for anyone who wants to take a spin. If you find issues then raise an issue and let me know, or if there are features enhancements. Or, dive in and have a go at contributing to it.
Comments