• dazfuller

Dropping a SQL table in your Synapse Spark notebooks

One of the nice things with Spark Pools in Azure Synapse Analytics is how easy it is to write a data frame into a dedicated SQL Pool.

curated_df.write.mode("overwrite").synapsesql("curation.dbo.feed_in_tarrifs", Constants.INTERNAL)

There's couple of gotchas though. The first is that this is only available through Scala (though you can work around this to a degree with Python). But the one which normally catches people pretty quickly is that you can only do it once.


The once thing is a pretty big gotcha, in that it normally comes when you try to run the cell again, but you get an error. So you check the documentation and then spot the following written in there.

The table must not exist in the dedicated SQL pool or an error will be returned stating that "There is already an object named..."

So what can you do?


Well, one way could be to add an activity in your Synapse pipeline which deletes the table first? Or you could just drop in and manually drop the table? Both would work but you don't really want to be running a pipeline or switching views when working on a problem, so what else can we do?


Whilst Spark 3 in Synapse currently doesn't support SQL with JDBC connectivity, it does come the Microsoft JDBC Driver for SQL Server, and this gives us options.


In the solution here, I'm going to show how to drop the table using the JDBC driver, and using the Synapse Managed Identity. This means that the code will work when run as part of a pipeline, and it means that you don't have to start assigning permissions to others, as long as the managed identity has sufficient permissions in the first place. So what's the first step?


N.B. I'm doing all of this in Scala as it's much easier to work with for this purpose


I'm going to be using the TokenLibrary here, but you could do something similar with mssparkutils. The reason I'm not? Well the Token Library gives us a map back which makes this a lot easier to work with. To use it we're just need to do the following (we're also going to bring in some Java SQL libraries for use later)

import com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;

Next we want to get the server name that we want to connect to. When you create a Synapse Analytics workspace it automatically creates a default Linked Service for us, we're going to use it to get the server endpoint. We're going to use mssparkutils to get the workspace name so that we can re-use the same code in multiple workspaces.

val connectionProperties = TokenLibrary.getPropertiesAsMap(s"${mssparkutils.env.getWorkspaceName}-WorkspaceDefaultSqlServer")

val server = connectionProperties.get("Endpoint").get.asInstanceOf[String].split("[:,]")(1)

The split operation is because the endpoint takes the form of "tcp:<server name>,<port>", and we just want the name part. We're also having to convert to string because the value in the map is an Any type.


Importantly we also need to authenticate with the endpoint. We already have managed identity access but to use this we'd need to know the client id. We can get that from the workspace properties but I really don't want to hard-code anything here. So instead we're going to use Token Based Authentication. So how do we get a token?


Well, the Linked Service makes this really easy for us as it's already done the hard work. Using the connectionProperties we collected previously we can simply access the AuthKey property and get our token.

val authKey = connectionProperties.get("AuthKey").get.asInstanceOf[String]

Once we have that, and we've set our database name (SQL Pool name), we can connect and drop our table.

val ds = new SQLServerDataSource()
ds.setServerName(server)
ds.setDatabaseName(dbname)
ds.setAccessToken(authKey)

var connection: Connection = null

try {
    connection = ds.getConnection
    val stmt = connection.createStatement

    stmt.executeUpdate("DROP TABLE dbo.my_example_table")
} catch {
    case e: Throwable => println(e.getMessage)
} finally {
    if (connection != null && !connection.isClosed) {
        connection.close()
    }
}

And that's it :)


There's just a word of warning here.


This connection will come through as being from the managed identity. So you don't want people using the environment to follow this same approach generally, otherwise you won't know who is actually logging in. It's the same as using a single SQL account for all your devs. So this best left to processes which will run as part of a pipeline.


Otherwise, have fun making the most out of the synapsesql writer.

45 views0 comments

Recent Posts

See All