Richard vs the Microsoft Speech SDK round 2

azurecoder
Jul 23, 2024
4 min read

Another week of murderous rage as we get closer to our impending Summer of Code ending on the 31st July.

It's been a while since I complained about the state of Microsoft SDKs. I'll try with this post not fill the 2 year gap but it's fair to say that I fell into the SDK trap yet again. For those that haven't read my serialised complaint stream over the years you'll begin to understand from now onward how much of a scratched record I am when it comes to Microsoft SDKs. To cut a long story short many of them are fundemantally broken. I decided a few years ago that I would always start with the REST API and build my own tooling. I broke my own rule to keep up my own development cadence for my team's "videocracker" entry and have set fire to another 6 hours since my past post.

It's only my sheer determination to write something complete with my team and my muscle memory that will see me through here.

Let's start at the beginning. If you missed my previous post about the first wasted 14 hours of building dependencies on a Batch pool read that first to get yourself in the mood.

This is part 2. Predictably but tragically, I moved my working code from MacOSX to Ubuntu 18 and Ubuntu 22 on a Batch node.

In order to run a transcription from a wav file, the simplest way to do this asynchonously is through the following code.

def transcribe_audio(self, wav_file):
	print("wav file: {}".format(wav_file))
	speech_config = speechsdk.SpeechConfig(subscription=self.key, region=self.region)      			speech_config.speech_recognition_language = "en-GB"
	audio_config = speechsdk.audio.AudioConfig(filename=wav_file)
	conversation_transcriber = 	speechsdk.transcription.ConversationTranscriber(
    speech_config=speech_config, audio_config=audio_config)
	transcribing_stop = False

def stop_cb(evt: speechsdk.SessionEventArgs):
	print('CLOSING on {}'.format(evt))
	nonlocal transcribing_stop
	transcribing_stop = True
def transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs):
	line = '{}: {}'.format(evt.result.speaker_id, evt.result.text)
	print('TRANSCRIBED: {}'.format(line))
	self.transcribed_lines.append(line)         	conversation_transcriber.transcribed.connect(transcribed_cb)
conversation_transcriber.session_started.connect(lambda evt: print("SESSION STARTED: {}".format(evt)))
conversation_transcriber.session_stopped.connect(lambda evt: print("SESSION STOPPED: {}".format(evt)))
conversation_transcriber.canceled.connect(lambda evt: print("CANCELED: {}".format(evt)))
	conversation_transcriber.session_stopped.connect(stop_cb)
	conversation_transcriber.canceled.connect(stop_cb)
	conversation_transcriber.start_transcribing_async()
	while not transcribing_stop:
		time.sleep(.5)
	conversation_transcriber.stop_transcribing_async()

Turns out this code doesn't work on Ubuntu and fails silently. Of course it does.

SESSION STARTED: SessionEventArgs(session_id=dc36012432ec4805a331aed11c8f72e7)
CANCELED: ConversationTranscriptionCanceledEventArgs(session_id=dc36012432ec4805a331aed11c8f72e7, result=ConversationTranscriptionResult(result_id=7a57e296583a40d098512c580c156d10, speaker_id=, text=, reason=ResultReason.Canceled))
CLOSING on ConversationTranscriptionCanceledEventArgs(session_id=dc36012432ec4805a331aed11c8f72e7, result=ConversationTranscriptionResult(result_id=7a57e296583a40d098512c580c156d10, speaker_id=, text=, reason=ResultReason.Canceled))
SESSION STOPPED: SessionEventArgs(session_id=dc36012432ec4805a331aed11c8f72e7)
CLOSING on SessionEventArgs(session_id=dc36012432ec4805a331aed11c8f72e7)

This sad little output is all that's present when you try and transcribe. No reason for cancellation and certainly no trascription. On my mac I get every single line of transcribed audio passed to the transcription event. On Ubuntu it just breaks.

My first thought was to look up the SDK online.

Given I'm now incapable of using anything other than an AI I asked ChatGPT and this is the link it gave me.

Azure Cognitive Services Speech SDK for Python

Yes, you get a 404. It's entirely possible that ChatGPT made this up and also highly probable BUT I didn't think so. I looked wider to see whether I could find any other SDKs for Speech Services. Sure enough, a Go SDK and a Javascript one just what I need. Read through and gave up, too much abstraction. Was chatting to Darsh and it dawned on me I should just write my own SDK and wrap up the API. So I did it.

Kind of looks something like this:

def create_transcription(self):
	url = f"https://{self.region}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions"
	headers = {
		"Ocp-Apim-Subscription-Key": self.subscription_key,
		"Content-Type": "application/json"
	}
	data = {
		"contentUrls": [self.content_url],
		"locale": "en-GB",
		"displayName": self.display_name,
		"properties": {
			"wordLevelTimestampsEnabled": False,
			"languageIdentification": {
				"candidateLocales": ["en-US", "en-GB"]
			},
			"diarizationEnabled": True,
			"punctuationMode": "DictatedAndAutomatic",
			"profanityFilterMode": "Masked"
		}
	}
	response = requests.post(url, headers=headers, data=json.dumps(data))

Once you've created a transcription you can check whether it's available. It gets written to a file. Everything is synchronous though. Just have to wait until you get a success message from polling the transcription id and then you can use the content link to download the details of the transcription and the JSON metadata in all its glory. A little bit shit compared to the SDK but I can live with that.

I checked the JS SDK which wasn't taken offline like the Python one thinking that it had to do something interesting to get the transcription line by line and use Javascript promises. Turns out is uses websockets. Checked the docs and there we go wtt protocol. Okay so now I'm thinking I can create my own async SDK.

Checked to see whether the Batch node could use a websocket using the following code. Damn, it worked.

import asyncio
import web sockets
async def test_websocket():
	uri = "wss://echo.websocket.org"
	async with websockets.connect(uri) as websocket:
		await websocket.send("Hello WebSocket!")
		response = await websocket.recv()
		print(f"Received: {response}")
asyncio.get_event_loop().run_until_complete(test_websocket())

Not in my happy place but feel like I'm closing in on something.

Okay, so thinking now I must be able to get a more verbose view. Spent some more time looking through the Javascript SDK and then the samples and low and behold turns out you can get verbose logging through a property set.

speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "speech_sdk.log")

As Mark Russinovich says, you can never have too much logging.

I checked the logs after this and boom! something stands out.

[720977]: 3011ms SPX_TRACE_ERROR:  exception.cpp:130 About to throw Runtime error: Failed to initialize platform (azure-c-shared). Error: 2176

Okay so I checked the library chain with ldd and it looks like it's dependent on an older version of openssl. A much older version. Of course it is. Right. Tracked down the openssl 1.1 dependency and installed directly from an older package like so.

sudo wget http://security.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2.22_amd64.deb     && sudo dpkg -i libssl1.1_1.1.1f-1ubuntu2.22_amd64.deb     && sudo wget http://security.ubuntu.com/ubuntu/pool/main/o/openssl/libssl-dev_1.1.1f-1ubuntu2.22_amd64.deb     && sudo dpkg -i libssl-dev_1.1.1f-1ubuntu2.22_amd64.deb

And voila. Hours of fun and frolics cursing the Speech Services SDK team and instant gratification. Works straight away.

I'll probably carry on and write my own SDK which isn't dependent on libraries from the neolithic era so watch this space and mine or Elastacloud's Github if you want to use something a bit lighter weight and Python native. I have to say I was pretty hardcore with C++, then C# and Java, then Scala. My love for Python hasn't really surfaced. I'm struggling not to loath it currently but this is the new world so I'm going with it.

Happy trails!

11 Comments

UUdolfiJelenai

Apr 19

Seiko also gets into this market with options like the Arnie, which is quite a bit larger than the UDT, but has a backlight and retails for ~$525. I like these a lot – and I did link a hands-on with the SNJ025 here – but I like link the UDT more. Ideally, get one of each to have the action star pairing we all link need (I kid, maybe, not sure).

Apr 18

Watchmaker Willy Breitling was born in 1913 and in the late 1950s developed an intense interest in link the link space programs. Mercury astronaut Scott Carpenter was born in 1925, and believed that the wristwatch could be an important tool in space exploration. Both Breitling and Carpenter were intensely curious; both were courageous in charting link new courses; both were committed to solving problems.

KaleoxKendax

Apr 07

There are two variants of the new Autavia chrono: One in steel with a sunray silver dial, and the other link in black link DLC-coated steel with a black sunray dial and green coloration to the link lume (when it isn't glowing).

unknownstranger

Sep 25, 2024

The medication works by mimicking the GLP-1 hormone, which helps regulate appetite and food intake, leading to a reduction in overall emsculpt neo houston caloric consumption. Long-term use of semaglutide has also been associated with improvements in metabolic health.

Sep 24, 2024

Plumbing issues can escalate quickly, leading to water damage, mold growth, and expensive repairs. Routine smart financial strategies for optimal property performance plumbing maintenance is essential for preventing these problems. Regularly inspecting pipes, checking for leaks.