• azurecoder

Data Smoosh

I was mulling over whether to derive a new jocular term for a Data Mesh. I pondered Data Mess but that seemed to obvious so I've opted for Data Smoosh which isn't as poetic but kind of reminds me of Smash, you know, the horrible potato powder that your gran poured kettle water over and force-fed you when you were young (if you were born in the 70s/80s at least).


So let's start as always with a few definitions. What is a Data Mesh? It's basically this Data Mesh | Thoughtworks. This is a paper written by Thoughtworks describing it and a bunch of cloud vendors have thought this through and provided a context-sensitive approach to implementation. It's of course the next biggest enterprise brainwave. As the cloud seeks to reduce the complexity of just about everything here we go again putting the Enterprise back into Enterprise IT.


In the implementations I've been involved in it's about providing a single point of governance, control, ingestion and transformation operators for all segments of a business and data domains so that they have low level control over their data, how it's stored, who has access to it, view of it's lineage and classification and a whole host of other centralised features. Most of what I've seen in Azure makes this a data hub and spoke solution where the business segments consume a bunch of centralised services related to data.


On the plus side it means that you can supplement the skills gap by using a model-based approach and centralised team that has better skills. This means that you can just provide a guide for others to consume data services. Sounds perfect yep? Uh-uh.


The reality is that it takes ages and is fraught with issues. The patterns, practices and automation around this is still very immature and some of the Microsoft stack providing governance isn't evolved enough to support customer asks so you'll find a lot of bespoke development and confusion with developers having to write custom APIs, custom lineage capability, custom access controls, custom anonymisation and custom metadata ingestion capabilities. That's a lot of custom!


I've had quite a bit of experience with Data Mesh's now (including with my tiny Elastacloud) and I don't particularly like them. Don't get me wrong the theory is sound and the practice covers the flange that we all discuss daily when we talk about how the single version of the truth has moved to the data lake. However, the practicality of having multiple storage accounts, governance across them all, data owner control of everything to do with their datasets becomes so much harder at scale. Here's a thought. Let's not do it ... At least until we're ready for a change like this organisationally and the tool chain catches up.


Going to back our podcast episode 3 of Tech from the Top https://music.amazon.com/podcasts/ad267717-fff3-49b8-a470-048b0a4ff1b6/episodes/92e19c52-efcf-4913-ae97-8381e43d4fe8/tech-from-the-top-9-things-i-love-and-hate-about-azure where Andy and I discussed abstractions that make the cloud simpler I feel the Data Mess goes the other way and ends up with a big smoosh. My general feeling about this after pondering for more than a year is that if I had a company of greater than 100K people that were looking to share their data I would find a different way.


So that I'm not just complaining yet again here is what I think is better approach for now. Provide a standardised framework for each data domain to manage their own data. In Elastacloud we use Data Lake in a Box which is a framework that allows us to modularise the deployment of a Data Platform with all the permutations that you would need for a centralised around the single version of the truth being the Data Lake. For the most part our implementation uses the CSTAR Alliance as guidance so that we know we have a secure, compliant and standardised design. Providing these templates as Azure Blueprints or just managed deployments for departments in an organisation makes it easy to standardise everything you need to and then use patterns like watermarking and use features in ADF for ingestion trusting that you have teams that can do it. In our deployments we provide samples for each type of ingestion capability you'd need and common sample implementations of Modern Cloud Data Warehouse design. Trust your developers by providing them with the right pre-configured infrastructure and the right kind of templates that you can make idiot-proof.


This is actually one of the patterns that we wanted to use for platform-centric approach instead of the Data Mesh that would have been much better for our customers. I think there has been a lot of excitement around Purview but I've seen some scalability problems for patterns like the Data Mesh. I did propose about 18 months ago that a much better pattern would be to relay Purview scans and other objects through a Purview hierarchy so that you could have a enterprise layer and local layers where the local layers would be relevant for workflows and the enterprise layer for data discovery. There are ways and patterns such as using the Event Hub to emit messages on each scan result that could serve to collect all Purview Data. This could be used to build a larger instance of a catalogue and then building an asynchronous addition of scan sources, results and other things so that you have a queued and managed approach to populating a central catalogue within the constraints of Purview compute. I've done a few tests to prove this out but not at huge scale but think that it would work.


Anyway, think twice before you start building your Data Mesh. It's a complete beast and requires proper planning and timelines to manage expectations internally. If you prefer not to use this pattern, think around the decentralised approach I've highlighted which could prove better and simpler to implement.


It's worth reading the Microsoft documentation which does advocate a high-level architecture and tool chain but is fairly non-descript. At least you'd get an idea of intent as this pattern gains ground.


What is a data mesh? - Cloud Adoption Framework | Microsoft Docs


A more detailed view of actors and data sharing is here:


Data contracts - Cloud Adoption Framework | Microsoft Docs


Just to reiterate, Data Mesh is a big commitment so challenge yourself before you start and if you can't ping me a note and I'll happily challenge you.


Happy trails!

96 views1 comment

Recent Posts

See All

I had a great time last weekend with a garden party with my friend and one of my longstanding tech leads David. It was great to spend the afternoon with friends and work colleagues. Great food, great

I've taken the last week off. For the first time in about a decade I haven't responded to emails (at least I've picked and chosen), Teams messages (same) and I've signed out of all of my customer acco