top of page
  • Writer's picturedazfuller

Documentation the easy way

So a slight departure from Spark (sort of) for this post, but I wanted to look at one of the most commonly overlooked aspects of building out a custom library, documentation!


Wait, wait... Okay, so I know it's not the most exciting subject in the world and, lets be honest, most of us will come up with an array of excuses to avoid doing it. So I want to cover off how we can do this in an automated way, making the docs look good in the process, and playing with another toy from Microsoft, Azure Static Web Apps. Hopefully that's enough shiny to keep you interested :)


I'm going to look back at a demo spark library I created, documenting the public functions, and automatic the deployment of the documentation using Azure DevOps. Documentation gets a lot more fun if you don't have to open up a word processor I tend to find.


First step is to document our code, in this case it's Python. This is easily done and you can find tools to help you in the task. The best bet is always to follow a standard which conforms to the PEP 257 docstring convention. There's a lot of different variations on this, I'm using the Google format in this example, but if you're creating a PySpark library I'd suggest adopting numpydoc as this is the format PySpark itself is standardising on. If you're using Visual Studio Code then there is the excellent Python Docstring Generator extension which will help you write the docstrings, reading type information from your type hints and can be configured for different formats.


In this example library I have a function for converting an Excel timestamp serial value (it's always Excel isn't it!) into a Python datetime value. Included is some basic information about the function, plus references to other documentation which the user might find useful.


EXCEL_EPOCH = datetime(1899, 12, 30, tzinfo=timezone.utc)


def from_excel_date_func(serial: float) -> Optional[datetime]:
    """Parses an Excel serial to a datetime value with precision to
    the second.

    Excel stores dates as a decimal value such as 43913.66528, where
    the integer component is the number of days since the Excel epoch
    (1st January 1900), and the decimal component is a fractional part
    of the day.

    For a further explanation see:
    https://www.excelcse.com/how-does-excel-store-dates-and-times/

    Args:
        serial (float): The Excel serial date

    Returns:
        datetime: Parsed datetime as UTC
    """
    try:
        days: int = int(serial)
        millis: int = (serial - days) * 86400000
        return (EXCEL_EPOCH + timedelta(
            days=days, 
            milliseconds=millis)
        ).replace(microsecond=0)
    except ValueError:
        return None

This is a pretty basic docstring, but we can improve it by incorporating aspects of Markdown (such the link out for further information). Using this our chosen IDE can now provide better in-line information for us, we can use the help() method to get at the information, and we can auto-generate documentation.


So how do we auto-generate the documentation? Well, there's a few ways but my favourite is to use the amazing pdoc3 library. From the defaults it produces beautiful looking documentation, it generates appropriate HTML from Markdown in the docstrings, and it can run locally using it's own web server so you can view the documentation as you're writing it. There's also the ability with some other extensions to have it produce PDF documentation if you really want to.


Once we've installed the library using pip we can view our documentation by simply running the following.

python -m pdoc -o docs --html <module name>

This will produce HTML output in the "docs" directory. We could also write out Markdown which is the default, but that won't work with the static web app later.

A screenshot of the auto-generated documentation as HTML
Auto generated documentation

And that's pretty much all there is to it. Of course we can use a custom template as well if we wanted to make things more on-brand, but I'm going to keep the defaults because I think it looks nice enough.


So in a few easy steps we're creating our docstrings and auto-generating documentation. Not a word processor in sight and hopefully a method which encourages us to keep the docstrings up-to-date. Now, this is all well and good, but if we can't share it then what good is it doing? As part of the talk I did using this library originally I used an Azure Storage Account to host the static website. But since then Azure Static Web Apps have come along and I think they're a bit nicer for this kind of job.


Whilst a lot of the demonstrations of the Static Web Apps is around things like Angular apps, they do also support custom apps where we're not diving into using frameworks. They also offer a free tier which is useful for a personal site, or just for trying things out. They're really easy to deploy as well.


For this use case I have a couple of requirements.

  • Needs to be deployed from an Azure DevOps build

  • Must be secured so that only authenticated users can access the documentation

All of the tutorials and examples tend to show deploying straight from GitHub, but there is DevOps documentation as well. Whilst it shows getting the web app code from a git repo really there's only a few things we need to know.

  • Where is the app code

  • Where is the output folder

  • What's the deployment key

Because these are simple requirements it means we can generate the documentation as a web app as part of the build, and then publish it to the static web app.


The tricker item (or, less simple) is the authentication requirement. The information needed for this is in the documentation, but it needs teasing out. What it comes down to though is routing.


As part of our application we can specify routes. These routes allow us to redirect to different pages, require users to be authenticated, or require users to be in specific roles to access different parts of the web app. Ours is a pretty simple routing requirement, anyone access any page must be authenticated, and so our staticwebapp.config.json file looks like this.


{
  "routes": [
    {
      "route": "/login",
      "redirect": "/.auth/login/aad"
    },
    {
      "route": "/*",
      "allowedRoles": [
        "authenticated"
      ]
    }                                
  ],
  "responseOverrides": {
    "401": {
      "redirect": "/login",
      "statusCode": 302
    }
  }
}

In this we are saying that accessing any page requires the user to be authenticated. That we have a /login path which re-directs to the Azure Active Directory login provider. And that we are overriding the 401 error to re-direct users to the login if they are not authenticated.


The issue with this is that pdoc3 doesn't generate it for us automatically. To resolve this issue we can either generate it as part of the build, or add it to our source code and copy it into the auto-generated documentation folder as part of the build process. Either way it needs to go into the root folder. I'm generating it as part of the build in a Powershell step because I seem to like doing things the difficult way!

python3 -m pdoc --html --output-dir docs dazspark

$webAppRouting = @{
    routes = @(
    @{ 
        route = "/login" 
        redirect = "/.auth/login/aad"
    },
    @{ 
        route = "/*" 
        allowedRoles = @("authenticated")
    }
    )
    responseOverrides = @{
    "401" = @{
        redirect = "/login"
        statusCode = 302
    }
    }
}

$webAppRouting | ConvertTo-Json -Depth 10 | Out-File ./docs/dazspark/staticwebapp.config.json

This single step is generating our documentation using pdoc and adding in the config file. This then lets us have the later step which deploys the static web app. As you can see I have the token as a variable which is coming from a variable group so that I can secure it from prying eyes.

- task: AzureStaticWebApp@0
inputs:
    app_location: './docs/dazspark'
    azure_static_web_apps_api_token: '$(webapp_token)'
displayName: 'Deploy docs to static web app'

The step takes a while as it has to generate a docker image to deploy, but once it's done... We have our web app published and operational.

Auto-generated documentation deployed to an Azure Static Web App
The published documentation

And that's it. We have the steps in our build process and we can now point everyone at our documentation to help them build the cools apps that they need to, and it's updated every time we do a build (or at certain stages if we want). The domain is auto-generated for us by Azure but we can specify a custom domain if we want to.


Azure Static Web Apps are capable of a lot more and so I would recommend checking them out as they're a useful service and very easy to configure. But hopefully from this a few more people might consider documenting their code, especially if they get to play with new toys.

418 views0 comments

Recent Posts

See All

Dropping a SQL table in your Synapse Spark notebooks

For the Python version of the code below, see the follow-up post. One of the nice things with Spark Pools in Azure Synapse Analytics is how easy it is to write a data frame into a dedicated SQL Pool.

bottom of page