• azurecoder

The negative externality of badly written SDKs

I've taken the last week off. For the first time in about a decade I haven't responded to emails (at least I've picked and chosen), Teams messages (same) and I've signed out of all of my customer accounts. It has been bliss.

During this time I've been reflecting a lot on Elastacloud and our journey I've also written a learning platform for our Data Academy.

In this post I'd like to rant again and sum up something that we put up with but don't factor into our cost calculation. We have contracts with our customers and we have service credits which mean that if we fail at something it costs us because it costs our customers. I'd like to see more of this with our cloud providers.

So, here's my story. Years ago there was no C# SDK for Azure. That's when I built Fluent Management and thousands of people started using it because there was nothing else. At the time all Azure's APIs were undocumented and I spent many an hour reverse engineering Azure with Andy Cross. Creating a VM at the time was a nightmare, you had a block of XML as the API payload and it took an afternoon to work out that the XML was order dependent (e.g. if you've used SAX over DOM in the past ...).

Now of course everything is written for us. We have a CLI, POSH, SDKs, nice portal etc. and if we really want we have documented endpoints for APIs. A lot has changed in a decade.

Recently I've been immersed in Sustainability so I've been going back to my days studying economics and understanding that negative externalities or "polluter pays" is not a thing. As a rage inside I'll start to build up a case for what what badly written APIs cost consumers who are now in the hundreds of thousands or millions according to download numbers on package managers like nuget.

I needed to spend some time the other day taking a service principal and using it to add a AAD user in the contributor role to a resource group. Not an uncommon activity, most people will do this interactively through the Azure Portal or via some kind of Powershell automation if you were building out provisioning workflows.

I'm going to cite Darren Fuller from our previous podcast episode 9 things I love and hate about Azure:

Darren mentioned that documentation in Azure to do anything useful was terrible and talked about building out the docs to use Self Hosted Runtimes in ADF something that's fundamental to all of our customers. The takeaway here is that if you want to do something you won't find it documented anywhere, docs are either basic or autogenerated.

Anyway, it took me the better part of 2 hours searching the internet to not find what I needed bar a few examples that were years old. I wanted to find a C# SDK to add something to add the AAD User per the above description. I started with the new ArmClient thinking that this might be a function of the resource group in the new SDK model. It wasn't. I finally worked out that this was a function of either a RoleDefinition, defining roles and permissions or a RoleAssignment which assigns a principal in a role to a resource. Okay, so clocked up 3 hours now and not written a line of code.

Now I was confused because it made more sense to me to add the RoleAssignment as a function of a resource group or an extension to a resource in general no? Surely that's common sense? Anyway, despite the way the SDK was built (now I had to use two libraries, one to create the resource group and another to assign the role) I wanted to crack on.

Okay, so no docs no. Checked on Github, thanks Microsoft for making this open source. Helps immensely. Finally found the AuthorizationManagementClient and started looking through docs. All autogenerated. Started looking through tests. Also autogenerated. No clear codedoc so didn't know what the first parameter on the constructor of the CreateById method was supposed to be. Spent an hour or so looking through tests and got a vague idea that I needed to construct a role/permission endpoint to the resource, which I've seen before. We're now about 4.5 hours in.

Okay, started writing code now. First hurdle, roleAssignmentId parameter is wrong, no docs, looked through tests. Got it finally. Another half hour burned. 5 hours in.

When I added the nuget package to project I was a little suspicious because the last 13 packages were preview. Dismissed it though. Finally got another error. In the version I have the api-version is appended to the end and it doesn't exist anymore.

{"error":{"code":"InvalidApiVersionParameter","message":"The api-version '2020-04-01-preview' is invalid. The supported versions are '2022-05-01,2022-03-01-preview,2022-01-01,2021-04-01,2021-01-01,2020-10-01,2020-09-01,2020-08-01,2020-07-01,2020-06-01,2020-05-01,2020-01-01,2019-11-01,2019-10-01,2019-09-01,2019-08-01,2019-07-01,2019-06-01,2019-05-10,2019-05-01,2019-03-01,2018-11-01,2018-09-01,2018-08-01,2018-07-01,2018-06-01,2018-05-01,2018-02-01,2018-01-01,2017-12-01,2017-08-01,2017-06-01,2017-05-10,2017-05-01,2017-03-01,2016-09-01,2016-07-01,2016-06-01,2016-02-01,2015-11-01,2015-01-01,2014-04-01-preview,2014-04-01,2014-01-01,2013-03-01,2014-02-26,2014-04'."}}

This was strange but thought that this might have something to do with the preview tag in nuget. Went back a version from 2.13 to 2.12 and lo and behold a different api-version preview. Went through every single one of these to understand whether each preview nuget package was reliant on a preview api-version querystring. 6 hours in.

Okay, so then I went back to the last stable version in nuget. This was 2.0.1 and dated back to 2017. The interface had a changed a lot since then so had to make code changes. 6.5 hours in.

So ran this and boom. Missing assemblies, since it's 2017 it's reliant on .NET 4 assemblies and I'm on 6 and couldn't mess around with retargeting and causing myself multiple conflicts. Working this backwards took another 30 minutes. 7 hours in.

Okay, moved everything back to the preview version now; back to the old api-version error. Thought to myself, there must be a way to override this. So decided to look through the code. Worked out the structure and finally understood how everything branched off from extension methods and got to this file.

azure-sdk-for-net/RoleAssignmentsOperations.cs at main · Azure/azure-sdk-for-net (github.com)

Looked through the code file and here it was clear as a day.

string apiVersion="2020-08-01-preview";

So now we're 7.5 hours in and I don't have a codebase. I decided to spend some time looking further and it turns out that you can't override the api-version. Insanely in the codebase after setting a local variable you have this gem.

if (apiVersion!=null)            
	_queryParameters.Add(string.Format("api-version={0}", System.Uri.EscapeDataString(apiVersion)));            

In Elastacloud this would probably be a firing offence. I guess one of the issues I have is that it's impossible to work out whether this was an issue created by someone or the autogenerated output which passes for code that's then customised. Either way it's terrible. I also looked at the commit history and saw that every now and again someone would add to the codebase by updating the api-version to a new preview version not thinking or understanding that it would expire and become a GA version. Doh!

It was clear that nobody has or had tested this component on anything real for 5 years. It's not okay to mark something as preview and not test it.

Having spent all this time I then decided to write a bug report and added > 1.6K issues for the .NET Azure SDK. I got a lovely response from a bot that worked out that I was ranting about Authorization so it was routed to the correct person. 8 hours in.

Anyway, I still needed to write my piece so I looked up the REST API.

Assign Azure roles using the REST API - Azure RBAC | Microsoft Docs

Half an hour later it's done and working.

I guess the point of this post is that it cost Elastacloud 8 hours of my time, which is hugely precious, all leading to a bug report that helped Microsoft and not Elastacloud. I see Microsoft (and Azure) as like an oil company. You don't go to the pumps at a BP or Shell forecourt and have someone say sorry the pumps don't work so you can't have any petrol in your car but you still have to pay for the oil drilling, oil refining, pipelining and logistics. Either don't release SDKs at all and say that everyone needs to use REST APIs (which is fine, I can do that) or test them properly because in this case I/Elastacloud had to pay the social cost of shoddy software and it was in thousands to test this. I could add that I see the same for people Elastacloud that are using SDKs for Digital Twin, Batch and Purview currently. It's either issues with the docs or issues with the autogenerated code.

So to end here untested SDKs are a negative externality to all of the consumers of your software.

I just want to end by quoting Darren, who had to wait for his word in edgewise after my ranty monologue.

I know it's meant to save time when the APIs change, but when the SDK requires more effort than making the changes manually then what's the point of it

Note: I'm discussing Azure per se but I asked around people I know in my customers and Google and AWS do the same, in one case on an epic scale.

137 views0 comments

Recent Posts

See All

I was mulling over whether to derive a new jocular term for a Data Mesh. I pondered Data Mess but that seemed to obvious so I've opted for Data Smoosh which isn't as poetic but kind of reminds me of S