<![CDATA[blog.AliStoops.com]]>https://blog.alistoops.com/https://blog.alistoops.com/favicon.pngblog.AliStoops.comhttps://blog.alistoops.com/Ghost 5.82Sat, 15 Mar 2025 00:24:15 GMT60<![CDATA[Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment]]>So, you've created a wonderful pipeline in Microsoft Fabric, but everything is point-and-click or copy-and-paste when changes need made either within a single environment or after deployment to a new workspace. For anyone who wants to re-create what I'm working on up until this point, that

]]>
https://blog.alistoops.com/metadata-driven-fabric-pipelines-2-of-2/67d4bfea073d2a00011cd648Sat, 15 Mar 2025 00:21:55 GMT

So, you've created a wonderful pipeline in Microsoft Fabric, but everything is point-and-click or copy-and-paste when changes need made either within a single environment or after deployment to a new workspace. For anyone who wants to re-create what I'm working on up until this point, that is a sample lakehouse, warehouse, basic ingestion pipeline and dataflow, and a deployment pipeline, please see part 1 in my last post. There's some more information there around an introduction and context, but the aim of this blog is to focus on providing information around driving a Fabric pipeline through metadata, parameters, and variables, so let's get stuck in. 

Creating the metadata

Our first step is to create the warehouse table we will use to drive the dynamic content used later. Before moving on to the actual table creation, it's worth opening VSCode notepad, OneNote, or wherever you store text for using later, and grab your warehouse ID and SQL connection string. For the warehouse ID, open the warehouse from you workspace and copy the text after '/warehouses' and up to, but excluding '?experience' in the url (in the red box in the image below), and for the connection string, you need to open the warehouse settings then navigate to SQL endpoint and copy the connection string. Do this for both the dev and test environments. We also want to grab a couple of other bits of information while we're here; the text after groups but before warehouses (in the blue box from the image below) is your workspace id, and lakehouse id follows the same process as warehouse id but you should see '/lakehouses/' in the URL.

Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment
Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

Use these values to replace the 'mywarehouse...', 'workspace_id,' and 'warehouse_id' variables in the below SQL query and run it in the sample warehouse in the dev environment. I've taken the approach of only creating this table once and pointing all environments back to the table in dev, but this could just as easily be run in each environment.

	DROP TABLE IF EXISTS [Sample Warehouse].[dbo].[environmentvariables]
	CREATE TABLE [Sample Warehouse].[dbo].[environmentvariables]
	(
	    [Environment] [varchar](8000) NULL,
	    [Variable] [varchar](8000) NULL,
	    [Value] [varchar](8000) NULL,
	    [workspaceid] [varchar](8000) NULL
	)
	INSERT INTO dbo.environmentvariables ([Environment],
	    [Variable],
	    [Value],
	    [workspaceid])
	VALUES 
	    ('Dev', 'sqlstring', 'mywarehousex.datawarehouse.fabric.microsoft.com', 'workspace_id_a'),
	    ('Dev', 'warehouseid', 'warehouse_id_x', 'workspace_id_a'),
	    ('Dev', 'lakehouseid', 'lakehouse_id_x', 'workspace_id_a'),
	    ('Test', 'sqlstring', 'mywarehousey.datawarehouse.fabric.microsoft.com', 'workspace_id_b'),
	    ('Test', 'warehouseid', 'warehouse_id_y', 'workspace_id_b'),
	    ('Test', 'lakehouseid', 'lakehouse_id_x', 'workspace_id_b'),
	    ('Prod', 'sqlstring', 'mywarehousez.datawarehouse.fabric.microsoft.com', 'workspace_id_c'),
	    ('Prod', 'warehouseid', 'warehouse_id_z', 'workspace_id_c'),
	    ('Prod', 'lakehouseid', 'lakehouse_id_z', 'workspace_id_c');

Adjust the Pipeline

Now we have things set up, there are some additions to be made to the transformation pipeline, so navigate to your pipeline (starting point below).

Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

Start by adding the relevant variables that we will populate from the environment variables. Here I've labelled them as 'sqlstring', 'warehouseid' and 'lakehouseid':

Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

Then add a lookup activity and configure it to connect to the sample warehouse. Tick 'Enter manually' for the table details and enter dbo.environmentvariables or whatever you named the table from the above sql query if you changed it. Make sure to UNTICK the 'First row only' setting

Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

At this point, the activity would simply read the environment variables table. The next step is to create to activity streams, each filtering to a single entry in the table to collect and set one variable (the sql string then warehouse id). First, add a filter activity, connect the lookup activity, on success, to the filter activity, and configure as below. Please note, that the two text entries in quotes need to be configured to the title of the lookup step ('EnvironmentVariableLookup' for me) and the variable name you've given the sql string in the original query ('sqlstring' for me):

  • Items - @activity('EnvironmentVariableLookup').output.value
  • Condition - @and(equals(item().workspaceid,pipeline().DataFactory), equals(item().Variable,'sqlstring'))

Next, add a 'set variable' activity, connect (on success) the filter activity to the set variable activity, and configure the settings as below:

  • Name - sqlstring
  • Value - @activity('Filter SQL String').output.value[0].Value

This is taking the array value from the column 'Value' of our environment variables filtered table and assigning it to the sqlstring variable.

Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

This needs to be repeated for the warehouse ID and lakehouse ID. You'll notice below that it's only the text in the variable value calling back to the activity in set variable and condition to select the right variable in filter that change

Activity Setting Value
Warehouse ID Filter Items @activity('EnvironmentVariableLookup').output.value
Warehouse ID Filter Condition @and(equals(item().workspaceid,pipeline().DataFactory), equals(item().Variable,'warehouseid'))
Warehouse ID Set Variable Name warehouseid
Warehouse ID Set Variable Value @activity('Filter Warehouse ID').output.value[0].Value
Lakehouse ID Filter Items @activity('EnvironmentVariableLookup').output.value
Lakehouse ID Filter Condition @and(equals(item().workspaceid,pipeline().DataFactory), equals(item().Variable,'lakehouseid'))
Lakehouse ID Set Variable Name lakehouseid
Lakehouse ID Set Variable Value @activity('Filter Lakehouse ID').output.value[0].Value
Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

After connecting these to the stored procedure, now we need to adjust the downstream steps to run based off our defined variables. Under the stored procedure settings, select the connection dropdown and then use dynamic content. Navigate to the variables section, which may need you to select the three dots on the right hand side, and click warehouseid, then 'OK' and do the same for the sql string. We then need to adjust two other inputs; first, click in the workspace ID and select use dynamic content and under system variables select Workspace ID (or just copy and paste the following '@pipeline().DataFactory'). Finally, type in the stored procedure name.

Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

Now we need to configure the For each inner activity source and destination as below:

Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

At this point, you can run the pipeline through and if all is configured properly, it should succeed. Usually the errors will give a good pointer where something has failed e.g. during testing I had an error where the table didn't exist because I'd cleared down my workspace without re-running ingest before testing this, but you can also check the monitor section for details and the input/output of each task in the pipeline to see where things are at.

Deploy to Another Workspace

Finally, to realise the benefit of this effort, it's worth deploying through a deployment pipeline. You'll see the connection on the lookup activity is still pointing to the dev workspace which is how I had this set up, but you can otherwise just click run and immediately populate your test workspace assuming the data is in place. If you followed part 1 of this blog, you will need to run the 'sourcepipeline' data pipeline first.

Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment
Metadata Driven Fabric Pipelines (2 of 2) - Dynamic Pipelines & Deployment

Considerations & Lessons Learned

  • Originally, I tried using the connection (warehouse / lakehouse) names as that's how they appear in the drop down, but you need IDs here
  • For the lakehouse, be careful to not use the SQL analytics endpoint, the lakehouse and SQL endpoint IDs aren't the same
  • I would only follow this method where the input tables are static, or you're comfortable with either recreating the 'for each' copy assistant task when you need to update it or add another one (which would be messy). Though the JSON can be edited, you need to map every column source / sink
  • Not truncating tables or dropping them before writing to a warehouse will populate with records twice
  • This was done for testing, but in practice I would only be utilising the lakehouse and warehouse if the data was transformed between each so this pattern definitely needs updated before any production use, it's just a proof of concept
  • I extended this by also adding a parameter that was static for the connection to the environment variables table. I would recommend doing so, but this blog felt a little on the long side already!
]]>
<![CDATA[Metadata Driven Fabric Pipelines (1 of 2) - Setup]]>Intro & Context

Over the last couple of months I've been working on a few projects where the question of choosing the right components of Fabric to combine has come up. Typically, my personal preference is pro-code first, but as well as wanting to realise some of the

]]>
https://blog.alistoops.com/metadata-driven-fabric-pipelines-1-of-2-setup/67d35fce073d2a00011cd606Thu, 13 Mar 2025 23:05:24 GMTIntro & ContextMetadata Driven Fabric Pipelines (1 of 2) - Setup

Over the last couple of months I've been working on a few projects where the question of choosing the right components of Fabric to combine has come up. Typically, my personal preference is pro-code first, but as well as wanting to realise some of the benefits of the pre-built connectors in Fabric for extract and load processes, many users would often prefer to not be in Python or PySpark notebooks all day! The addition of CI/CD support for Gen 2 Dataflows was a step in the right direction for making them an easy recommendation, but I've noticed issues with preview versions either not being able to be invoked via pipelines or not connecting to staging lakehouses post CI/CD processing. Unfortunately, only preview versions are viable through CI/CD, so I was exploring some other options and wanted to share how I've used metadata to drive pipelines and copy activities as I've seen done for notebooks in the past.

It's worth noting that there are many examples of end to end scenarios and examples online, and this isn't to say copy tasks are the way to go, but rather to provide some guidance on paramaterising the pipeline itself.

Lastly, in order to enable people to re-create the content here, I'm splitting this into two parts. This blog is working through the initial setup of assets and structure, with the following blog, that I will link to retrospectively, covering the dynamic content and parameters. 

Pre-Reqs & Notes

The only real pre-requisite here is to have a Fabric (trial or otherwise capacity), and that's it! But the asset structure includes; a Sample Warehouse, a Sample Lakehouse, a few stored procedures, and sample data from the diabetes, retail (fact_sale), and nyctaxi databases available via the copy assistant. For context, the downstream processes here include combination of Lakehouse and Warehouse purely because I was conducting some testing for a customer whereby the staging option when writing direct to a warehouse from on-premises mandated external (ADLSG2), hence we wrote to a staging lakehouse first, and we also required SQL functionality not currently supported for lakehouse SQL endpoints. In production, I would consider doing this differently as you are running two jobs rather than one, and storing two copies of data. But for this purpose, given the jobs were only a few minutes and we didn't have capacity challenges, the constraints weren't of concern.

Creating the data assets

I'm briefly going to talk through the assets I created and how it could be replicated, but it's worth noting that this is a simplified version as it wouldn't be required to replicate my end state job. First, create the blank workspaces - a Dev, Test, and Prod workspace with any relevant prefixes. Within Dev, add a new lakehouse and warehouse (I've titled these "Sample Warehouse" and "SampleLakehouse"). The ETL processes here are simple copy tasks with no transformation as it was more to prove the concept of metadata driven pipelines.

Once provisioned, I added a source data pipeline (mine is on Github here). I created this by adding copy tasks for 3 sample data sets because I wanted to create something that looked more like a real scenario in moving multiple tables at once.

Create a new data pipeline (new item, data pipeline), then add sample data. I first did this through a copy activity for sample data from the NYC Taxi data set and loaded to the sample lakehouse:

Metadata Driven Fabric Pipelines (1 of 2) - Setup
Metadata Driven Fabric Pipelines (1 of 2) - Setup

For the purposes of being able to manage this across a number of tables at once, I created copy activities to ingest the diabetes, NYC Taxi, and fact_sale from Retail data sets. Ultimately, this looks something like the below:

Metadata Driven Fabric Pipelines (1 of 2) - Setup

The next step was to create a pipeline to move this data from the lakehouse to warehouse. This started by creating another pipeline (named Transformed Pipeline), then navigating through a copy assistant task and adding tables as new, but I noticed things getting a little funny running this multiple times and record counts didn't add up. The initial solution for testing was to combine a copy task creating data as new with a truncate stored procedure then create as existing. Instead, I have created a stored procedure to create the empty warehouse tables. Be sure to run the drop table and create table code once via the warehouse (i.e. below, excluding create proc as. Alternatively, create the procedure then run it via SQL using eexec dbo.storedprocname), then create the stored procedure before creating a new pipeline:

	CREATE PROC [dbo].[sp_create_table_schemas]
	AS
	BEGIN
	DROP TABLE IF EXISTS [dbo].[fact_sale]
	DROP TABLE IF EXISTS [dbo].[nyctlc]
	DROP TABLE IF EXISTS [dbo].[diabetes]
	CREATE TABLE [dbo].[fact_sale]
	(
	    [SaleKey] [bigint] NULL,
	    [CityKey] [int] NULL,
	    [CustomerKey] [int] NULL,
	    [BillToCustomerKey] [int] NULL,
	    [StockItemKey] [int] NULL,
	    [InvoiceDateKey] [datetime2](6) NULL,
	    [DeliveryDateKey] [datetime2](6) NULL,
	    [SalespersonKey] [int] NULL,
	    [WWIInvoiceID] [int] NULL,
	    [Description] [varchar](8000) NULL,
	    [Package] [varchar](8000) NULL,
	    [Quantity] [int] NULL,
	    [UnitPrice] [decimal](18,2) NULL,
	    [TaxRate] [decimal](18,3) NULL,
	    [TotalExcludingTax] [decimal](18,2) NULL,
	    [TaxAmount] [decimal](18,2) NULL,
	    [Profit] [decimal](18,2) NULL,
	    [TotalIncludingTax] [decimal](18,2) NULL,
	    [TotalDryItems] [int] NULL,
	    [TotalChillerItems] [int] NULL,
	    [LineageKey] [int] NULL
	)
	CREATE TABLE [Sample Warehouse].[dbo].[nyctlc]
	(
	    [vendorID] [int] NULL,
	    [lpepPickupDatetime] [datetime2](6) NULL,
	    [lpepDropoffDatetime] [datetime2](6) NULL,
	    [passengerCount] [int] NULL,
	    [tripDistance] [float] NULL,
	    [puLocationId] [varchar](8000) NULL,
	    [doLocationId] [varchar](8000) NULL,
	    [pickupLongitude] [float] NULL,
	    [pickupLatitude] [float] NULL,
	    [dropoffLongitude] [float] NULL,
	    [dropoffLatitude] [float] NULL,
	    [rateCodeID] [int] NULL,
	    [storeAndFwdFlag] [varchar](8000) NULL,
	    [paymentType] [int] NULL,
	    [fareAmount] [float] NULL,
	    [extra] [float] NULL,
	    [mtaTax] [float] NULL,
	    [improvementSurcharge] [varchar](8000) NULL,
	    [tipAmount] [float] NULL,
	    [tollsAmount] [float] NULL,
	    [ehailFee] [float] NULL,
	    [totalAmount] [float] NULL,
	    [tripType] [int] NULL
	)
	CREATE TABLE [Sample Warehouse].[dbo].[diabetes]
	(
	    [AGE] [bigint] NULL,
	    [SEX] [bigint] NULL,
	    [BMI] [float] NULL,
	    [BP] [float] NULL,
	    [S1] [bigint] NULL,
	    [S2] [float] NULL,
	    [S3] [float] NULL,
	    [S4] [float] NULL,
	    [S5] [float] NULL,
	    [S6] [bigint] NULL,
	    [Y] [bigint] NULL
	)
END

Then, I added a copy task through the copy assistant for populating tables in the warehouse. Select the sample lakehouse and 3 relevant tables as the data source and sample warehouse as destination. For each table, select "Load to existing table" (if this doesn't show, make sure you've run the above stored procedure) and click next. Then enable staging with workspace as the data store type. I unticked the start data transfer immediately box, then clicked okay. This is just personal preference, but I like giving a once-over before saving and running myself. Make sure to add a stored procedure activity and connect it (on success) to your copy task.

Metadata Driven Fabric Pipelines (1 of 2) - Setup
Metadata Driven Fabric Pipelines (1 of 2) - Setup

Deployment Pipeline Setup

The last pre-requisite before getting in to the metadata driving this, is setting up the deployment pipeline. Click the workspaces ribbon (left hand side), and then new deployment pipeline. I've left the deployment stages as-is with development, test, and production, but you can tailor if needed. Once created, attach relevant workspaces to each stage and make sure to press the green tick to assign them.

Metadata Driven Fabric Pipelines (1 of 2) - Setup

You'll see some additional items in mine, these haven't been created just yet but are part of the next bit of development. For now, just make sure to select the sample lakehouse and warehouse, and both source and transformed pipelines then deploy to the test workspace. This might take a few minutes on the first deployment, but it usually gets quicker for subsequent deployments when the resources you're changing (e.g. Warehouse) already exist in the target workspace.

Metadata Driven Fabric Pipelines (1 of 2) - Setup

I'm not going to run anything in the test workspace, I just wanted to work through this to show what it actually looks like without any metadata driving the deployment. If you open the transformed pipeline in the test workspace, you'll see that the destination is still set to the warehouse in the development workspace.

Metadata Driven Fabric Pipelines (1 of 2) - Setup

That brings part 1 to a close, for part 2 of this blog I will walk through the necessary changes to paramaterise inputs that will automatically associate tasks with the current workspace's assets.

]]>
<![CDATA[Fabric In Review: Thoughts 18 Months Later]]>Context & Intro

In the ever-evolving landscape of data analytics, Microsoft Fabric has emerged as a significant player, aiming to unify Azure Data Services to a single SaaS offering. This blog post reviews Fabric's journey, its unique selling points, areas for growth, and reflections on my experience. I&

]]>
https://blog.alistoops.com/fabric-in-review-my-thoughts-18-months-post-launch/67b91ef3073d2a00011cd5d2Sat, 22 Feb 2025 01:08:15 GMTContext & IntroFabric In Review: Thoughts 18 Months Later

In the ever-evolving landscape of data analytics, Microsoft Fabric has emerged as a significant player, aiming to unify Azure Data Services to a single SaaS offering. This blog post reviews Fabric's journey, its unique selling points, areas for growth, and reflections on my experience. I’ve tried to keep thoughts relatively high level purely because delving in to all the good and bad things at a level of depth would create much larger lists.

Fabric In Review: Thoughts 18 Months Later
Source: https://www.microsoft.com/en-us/education/blog/2025/01/microsoft-fabric-the-data-platform-for-the-ai-era/

Fabric’s Journey

Fabric's journey began in 2023 when, after a brief private preview, it was enabled in public preview for all tenants (May 2023), before its release to general availability in November 2023.

Microsoft also introduced the DP-600 and DP-700 or Analytics Engineer and Data Engineer certifications for Fabric, with DP-600 becoming Microsoft’s fastest-growing certification.

As of October 2024, Fabric touted over 14,000 customers (paying / non-trial) and over 17,000 individuals DP-600 certified. There has also been a significant investment in product development including Copilot integration, Real Time Intelligence, SQL Databases, industry-specific solutions (e.g. for manufacturing, healthcare, and ESG), and much needed improvement on functionality such as the eventual addition of Gen2 Dataflows CI/CD integration.

Fabric’s Unique Selling Points

Fabric In Review: Thoughts 18 Months Later
  • OneLake: provides a single data foundation with one copy of data accessible across all services. This also enables things like OneLake Data Catalog, Direct Lake Semantic Models, and the One Lake explorer for easily accessing Fabric assets you have access to
  • AI Integration: Fabric’s ability to easily access AI experiences (model deployment, AI Skills) and Copilot to accelerate insights and development. For example, this can simplify and streamline the process of deploying an ML model and surfacing the output via Power BI (as captured by the predicted churn rate in the example report above). This really applies to the broad range of services (or “experiences”) but I think the combination of data engineering, analytics, and science through one interface is excellent
  • Minimising friction / barrier to entry: Fabric offers a familiar and intuitive experience for users, especially those coming from Power BI. As a Microsoft SaaS solution, being built around M365 makes user access straightforward, the range of pre-built connectors and other functionality, and being able to combine pro and low code solutions as well as run a variety of workloads from one place can all contribute to a reduced time to value. I think it’s noticeable when you work through your first few development experiences just how quickly you can get something prototyped - for those who are new, I highly recommend both Microsoft’s end to end scenarios and Alex Power’s Data Factory In A Day repo.
  • Governance and Security: It felt like a cop-out titling something “not just ETL and analytics” so I’ve tried to be more concise - many platforms require integration with other toolsets for item level lineage and orchestration, or cataloging data. Fabric enables governance functionality out of the box including monitoring through the capacity metrics app, workspace and item lineage, and row and object level security

Areas for Growth

Fabric In Review: Thoughts 18 Months Later
Source: https://ideas.fabric.microsoft.com/ideas/search-ideas/?q=Workload%20management
  • Feature Lifecycle: While there’s been an incredible amount of progress made in terms of features released, and the Fabric release plan can be a help, there is still a remarkable number of features that are “preview” with no real idea what that means in terms of maturity (varies feature to feature). There are even some examples where there are two versions of the same feature such as “invoke pipeline” activity having a legacy and preview option, or Dataflow Gen2 having a GA version and preview (supporting CI/CD) version. For those working with new Fabric items, you will run into hiccups occasionally, but this is also tricky for new users to navigate. It would be great to see this improve, especially around core workloads
  • Capacity Management: When you purchase a capacity and attach workspaces to it, all workloads have been treated equally over the last 18 months. This causes challenges with a lack of functionality for queueing, workload management, and capacity consumption that need addressing. Currently, the main way to do this is with separate capacities
  • Cost Estimation: Estimating costs for capacity planning remains difficult. There is a calculator available in preview that I’ve had good experiences with, but for new or prospective customers, this is still a challenge
  • Solution or Deployment Best Practices: First, I appreciate that it’s not a vendor’s responsibility to define best practices, but I’ve found that the documentation can vary in terms of value. Some great articles are out there and easy to find like learning modules, demos, and specific development guidance (e.g. Python) as well as phenomenal community resources, but high level things supporting customer deployment like capacity management best practices, migration from SSAS multi-dimensional cubes, and moving from Power BI (e.g. gen2 dataflow performance) aren’t easy to navigate, and even at a lower level you can see examples such as MS documents like this one making users assume V-Order being enabled all the time would be the right thing to do. Sometimes this can necessitate the development of workarounds for specific challenges, custom solutions, or confusion related to making core decisions like how to manage CI/CD

My Experience

The above is a great snapshot of some individual experiences and the current online discourse around Fabric being truly “production ready” but over the last number of months I’ve struggled with some sweeping statements, both good and bad, around things Fabric can or cannot be used for. Of course, with data platforms, there are tradeoffs no matter what tooling you choose, and scenarios where certain tools are more closely aligned to requirements, but I wanted to add some generalised customer examples and patterns I’ve seen and why Fabric was viewed to be a good fit. There are some more examples I could add, but at that point it would be a blog on it’s own, so here are 4 examples, all of which I’ve seen first-hand:

  • Existing Microsoft (M365 & Azure) customers: I’ve separated this from “existing Power BI customers” for two reasons; first, I’ve seen organisations utilising Azure but not Power BI for their core BI workloads (e.g. where Qlik or Tableau have been used historically), and, second, is that drivers for implementation in this example differ compared to extending Power BI functionality (e.g. strategic modernisation, migrating from a multi-vendor environment, or adding Fabric as the single pane of glass across multiple data or cloud providers). For larger customers already in the Microsoft ecosystem, the tight integration with M365, ability to reuse existing infrastructure config (networking, landing zone), and being able to work with an existing vendor have all been seen as positive, as are ability to utilise Copilot and shortcuts supporting landing data cross-cloud. A couple of lessons learned are that, especially for the managed self-service pattern, it’s worthwhile considering implementing both Fabric-capacity-backed and Pro workspaces, and utilising pro-code experiences where possible (appreciating availability of skills and personal preferences) will offer benefits around capacity consumption
  • Small(er) Businesses - I know organisation size is all relative, but here I specifically mean where data volumes aren’t in the Terabytes and the data team is usually made up of people wearing multiple hats (engineering, analytics, architecture). SQLGene has posted their thoughts here, but my short summary here is that Fabric does offer a very easy way to just jump in, and can be a one stop shop for data workloads as well as offering flex in generally having multiple ways to get the job done as well as minimising the overall solution components to manage
  • Power BI customers looking to “do more,” without well established ETL capabilities, OR those migrating on-premises SQL (and SSIS) workloads - I’ve worked with a number of Power BI customers where there is already a good understanding of workspace admin and governance, the web UI, Microsoft SQL. This can make Fabric feel like an extension of something that’s already known and improve adoption speed. It’s common for many of these customers to also be looking to move some on-premises SQL workloads to the cloud, and being able to re-use data gateway connections is a plus. Some customers I’ve worked with have been able to migrate historic ETL processes by utilising copy jobs or dataflows for extracting and loading to Fabric before migrating SSIS stored procedures and creating semantic models rapidly (hours / days)
  • Greenfield scenarios - I think Fabric has a high value proposition in a number of scenarios, but I think it’s especially easy to understand this when you’re starting net-new. Alternate tech stacks may do specific elements really well, but you would need to stitch together more individual components, billing models, authentication methods etc. to meet engineering, analytics, science, architecture, and governance requirements. A good example was a project whereby there was a known requirement for analytics / reporting, as well as a data warehouse as well as minimal resources to support from central IT so being able to utilise a SaaS product that could meet functionality as well minimise infrastructure and operational support from IT was a big plus

Conclusion

Microsoft Fabric has been a fascinating addition to both Microsoft services and the data & AI technical landscape, but addressing some areas outlined above and core challenges in line with customer feedback will be crucial for its continued success. I would also highlight that the amount of development over the last 18 months and pace of updates has been impressive, so all of the above has to be taken in context of a technology offering that is still early in its product lifecycle and has massive further potential - in my opinion, Power BI was less impressive at a similar point.

]]>
<![CDATA[An Introduction to CI/CD on Microsoft Fabric]]>Context & Intro

In December 2024, I gave a lightning talk at the Dublin Fabric User Group providing a brief introduction to Continuous Integration and Continuous Deployment (CI/CD) on Microsoft Fabric. In researching the topic as well as various customer conversations, I’ve noticed that while the Microsoft

]]>
https://blog.alistoops.com/an-introduction-to-ci-cd-on-microsoft-fabric/67abda7c073d2a00011cd5b8Tue, 11 Feb 2025 23:27:05 GMTContext & IntroAn Introduction to CI/CD on Microsoft Fabric

In December 2024, I gave a lightning talk at the Dublin Fabric User Group providing a brief introduction to Continuous Integration and Continuous Deployment (CI/CD) on Microsoft Fabric. In researching the topic as well as various customer conversations, I’ve noticed that while the Microsoft documentation is quite detailed, especially for implementation steps, there wasn’t a single overview of the CI/CD options available.

CI/CD are fundamental practices in modern software development, designed to streamline the process of integrating code changes and deploying applications. By incorporating CI/CD, development teams can enhance efficiency, reduce errors, and improve the overall quality of software products. In this blog post, we explore working with branches and CI/CD capabilities in Microsoft Fabric.

Before getting into it, I’ll just mention that this is intended to be a very brief introduction to those unfamiliar with CI/CD on Fabric. Other blog posts do a great job of describing practical scenarios and technical details - here my intention is to set a foundation in about 1,000 words.

What and Why

An Introduction to CI/CD on Microsoft Fabric
Azure DevOps Pipeline Architecture

You will often see CI/CD referenced as a single concept, ultimately resulting in automated deployment of software products or applications, but each element is distinct. Continuous Integration (CI) and Continuous Deployment (CD), is a set of practices in software development aimed at ensuring that code changes are automatically tested and deployed to production. CI involves developers frequently integrating their code changes into a central repository, where automated builds and tests are run. CD extends this by automatically releasing validated changes to production, allowing for frequent and reliable updates. There are plenty of resources describing CI/CD (e.g. this from Atlassian) and the benefits. Without sharing details, some of these include; early bug detection, improved collaboration, faster development, easier rollback, and reduced manual effort.

Working with Branches

An Introduction to CI/CD on Microsoft Fabric

In Fabric, workspaces are your governance and development boundary, and your central location for storing source code (Azure DevOps, GitHub) is enabled through git integration. When you start looking to use Microsoft Fabric Git integration at scale you are going to be working with both branches and pull requests. Branches are an interesting concept in Git - a branch looks and feels like an entirely new copy of your work. They allow you to work with your code in a new area within the same Git repository, e.g. through the use of a feature branch which is a branch you work with short-term for a specific feature. Once finished you apply your changes back to the branch you were working on another branch.

Typically, you make changes in one branch and then merge those changes into another. These should not be confused with deployment environments. Where you deploy to various environments to perform various levels of testing before deploying to a production environment. In reality, it does not make a full copy of your work in a new branch. Instead, branches and changes are managed by Git. Anyway, there are a number of different ways you can manage branches including using client tools or workspaces (described here).

An Introduction to CI/CD on Microsoft Fabric

CI/CD Options in Fabric

For CI functionality, Fabric natively integrates with either Azure DevOps or GitHub.

An Introduction to CI/CD on Microsoft Fabric

As for CD, Microsoft has some guidance, and there are intricacies in how these are applied, but there are a few options here too; deployment pipelines, git integration, or REST APIs (CRUD).

An Introduction to CI/CD on Microsoft Fabric
Sample Microsoft deployment patterns from https://learn.microsoft.com/en-us/fabric/cicd/manage-deployment

From what I’ve seen, the direction of travel is a preference for the bottom option, utilising deployment pipelines, but due to challenges with functionality until recently, most people using CI/CD on Fabric in anger are using some customized version of git integration - some examples I’ve seen show the most complex environments using native Git API integration, but this seems to be less common.

In reality, this all depends on your branching model (Trunk based, Gitflow) as well as requirements such as whether you want to trigger deployment from Fabric or Git.

State of Play

EDIT: In February 2025, Mathias Thierback shared an updated version of his CI/CD support matrix That supersedes the below. The remaining challenges and lessons learned are mostly still relevant, but please do visit here for the update.

First of all, shout out to Mathias Thierback who not only created a version of this including Terraform, but has been presenting on all things Fabric CI/CD over the last month. Additionally, I haven’t considered Airflow Jobs. As for those marked proceed with caution, I’ve explained why in the final section of the blog.

An Introduction to CI/CD on Microsoft Fabric

Lessons Learned & What’s Next

I’ve seen a number of challenges with CI/CD on Fabric. Most issues can be maneouvered around, utilising Metadata driven pipelines for example, but it’s worth mentioning. For example;

  • Dataflows do not support deployment pipeline rules 
  • Notebooks don’t support parameters 
  • Where default lakehouses change across workspaces, the attachment doesn’t propagate

As for other lessons learned:

  • The REST API offers most flexibility, but is the most complex to implement, whereas deployment pipelines offer the least flexibility but are the most straightforward to implement
  • While there is support across a range of items, in my experience, notebooks really are the way to go for version control and CI/CD native integration to minimise potential issues
  • REST API implementation is only possible for all items through users rather than Service Principals. This is being improved upon over the last couple of months but isolated to Lakehouses and Notebooks at the time of writing
  • Unsurprisingly, good general CI/CD practices apply (e.g. early & often releases)
  • Some limitations exist (e.g. CI/CD or workspace variables, pipeline parameters being passed to DFG2 or pipelines invoking dataflows not supported)
  • Some specific guidance around ADO and GH e.g. regions and commit size are available on MS Learn. ADO recommended
  • When you have a dataflow that contains semantic models that are configured with incremental refresh, the refresh policy isn't copied or overwritten during deployment. After deploying a dataflow that includes a semantic model with incremental refresh to a stage that doesn't include this dataflow, if you have a refresh policy you'll need to reconfigure it in the target stage
  • This blog hasn’t touched upon two important areas for implementation including naming conventions (this post from Kevin Chant is great) and workspace structure (many posts online, but I’ve also written this blog)
]]>
<![CDATA[My Top 5 Power BI Updates of 2024]]>Intro & Context

With the January announcement of Power BI introducing Tabular Model Definition Language (TMDL) view, it reminded me that I had a written but unpublished blog about 2024 Power BI updates. Just a couple of weeks into 2025, it's a good time to look back at

]]>
https://blog.alistoops.com/my-top-5-power-bi-updates-of-2024/678ea2a5073d2a00011cd5a1Mon, 20 Jan 2025 19:28:06 GMTIntro & ContextMy Top 5 Power BI Updates of 2024

With the January announcement of Power BI introducing Tabular Model Definition Language (TMDL) view, it reminded me that I had a written but unpublished blog about 2024 Power BI updates. Just a couple of weeks into 2025, it's a good time to look back at the remarkable enhancements introduced in Power BI over the past twelve months. Power BI has consistently evolved over the last 8 years or so, offering users more powerful and intuitive tools to transform data into insightful visual stories. This year has been no exception, with a host of features varying from simple quality of life updates to fundamental improvements to the development workflow. The impressive growth in both functionality and user base underscores Power BI's position as a leading tool in the realm of business intelligence. In this blog, I will highlight my top ten favourite Power BI updates from 2024 that have significantly enhanced the Power BI developer experience.

Whether mentioned in seriousness or in jest, there was a lot of noise around the announcement of dark mode being available in Power BI but (spoiler) it hasn’t made my list.

My Top 5 Power BI Updates of 2024

My Top 5 Updates

My Top 5 Power BI Updates of 2024
  1. Core visuals vision board - I’ve included the core visuals board on this list for 2 reasons; first, it’s great to see Microsoft engaging the community and customers for input around upcoming Power BI developments and enabling not just an easy way to digest development ideas but vote on them and, second, I often think it’s difficult to see really good examples of well laid out and aesthetically pleasing Power BI dashboards which this definitely is.
  2. Copilot - I’m cheating a little here in combining some features, but Copilot functionalities including enhanced report creation, summaries in subscriptions, Copilot for mobile apps, and Copilot measure description all feel like they’ve massively improved the experience of using Copilot in Power BI. If I had to only call out one here, it’s a tough call between enhanced report creation and summaries in subscriptions, but I think I would go for enhanced report creation as I think it’s improved the developer experience.
  3. Live edit of Direct Lake models in PBI desktop - I appreciate this one blurs the line between a Power BI update and Fabric in that you need to have published a semantic model to a Fabric capacity, but this felt like a really nice quality of life improvement for anyone who prefers to work with Power BI Desktop and is already publishing or managing Direct Lake semantic models.
  4. Tabular Model Definition Language (TMDL) made generally available - I mentioned the 2025 update including TMDL view, so it’s likely some readers might have seen this one coming, but the GA release of TMDL was a welcomed addition. As well as being a pre-requisite for the TMDL view, the improvement here in source control for semantic models felt important.
  5. Paginated Report Authoring via the web GUI - until the preview announcement of the new paginated reports authoring experience, the method most would use for creating paginated reports was through Power BI Report Builder. Though I personally prefer Power BI Desktop to the web development experience, I think anything to simplify the tooling and development experience is, in general, a positive step.
My Top 5 Power BI Updates of 2024

Conclusion

I easily could have covered a top 10 here with some special mentions (e.g. DAX Query view for web), but I wanted to keep this list reasonably concise. One thing I will call out as I reflect on 12 months of updates is that there were so many positive Power BI developments which was great to see even if it’s hard to always stay on top of the what’s new.

]]>
<![CDATA[Microsoft Ignite Debrief for Data & AI]]>Intro & Context

Microsoft Ignite 2024, Microsoft’s flagship annual event, took place a few weeks ago, at the end of November. Having had time to digest the various sessions and announcements, I wanted to share my highlights and key takeaways related to data and AI. I won’

]]>
https://blog.alistoops.com/microsoft-ignite-debrief-for-data-ai/67642288073d2a00011cd582Thu, 19 Dec 2024 13:49:54 GMTIntro & ContextMicrosoft Ignite Debrief for Data & AI

Microsoft Ignite 2024, Microsoft’s flagship annual event, took place a few weeks ago, at the end of November. Having had time to digest the various sessions and announcements, I wanted to share my highlights and key takeaways related to data and AI. I won’t be covering all announcements as I’m keen to share some personal views on 5 or so areas and a list of what I consider to be the more interesting things announced but, for those interested, the Microsoft book of news is the one-stop-shop for all announcements. Fabric-specific announcements can be seen here.

It’s worth mentioning that most of the announcements below, other than those around general availability, aren’t usually available immediately. Some minor updates are but otherwise you can expect most features to be coming in 2025. 

My Key Takeaways

Microsoft Ignite Debrief for Data & AI
Fabric Databases enabling PBI writeback
  • SQL Databases in Microsoft Fabric - SQL Server workloads made available through Fabric, extending analytical capability to include a Fabric data store for transactional workloads. While this is interesting on its own merit in the sense that it offers a way for organisations to consolidate transactional and analytical databases, the real reason I’ve included this as one of my favourites was down to the additional announcement that this will enable functionality in 2025 for Power BI writeback where users can input data through Power BI and write to a backend database - awesome. 
  • Purview Investment - there is a whole section in the book of news on Purview that’s worth digesting - a couple of specific call outs include renaming Purview Data Catalog to Purview Unified Catalog, extending data quality support for Fabric, Databricks, Snowflake and more, extending DLP capability, and Microsoft Purview Analytics in OneLake. In my opinion, though, the key takeaway here was that Microsoft are clearly investing heavily in Purview. I would argue that’s somewhat overdue, but it will be an interesting space to keep a close eye on over the next 6-18 months
  • Azure AI - Azure AI Foundry is a rebranding and extension of functionality for the existing Azure AI Studio, AI Foundry is also going to provide prebuilt templates and an AI Agent service. Other AI announcements include AI reports covering impact assessments covering project details such as model cards, model versions, content safety filter configurations and evaluation metrics, an AI scenario in the Cloud Adoption Framework, and an Azure Well Architected AI scenario.
  • Copilot - I’m reasonably confident that Copilot will be part of most, if not all, Microsoft events for the foreseeable future. Broader announcements included things like Agents in SharePoint and Project Manager Agent (that I’m interested to try), but I think my takeaway from Ignite was that there was a focus on customising, integrating, and extending purpose built Copilots. I felt this was slightly more targeted than the usual Copilot off the shelf type of conversation. From the Data and AI angle, the integration between Copilot Studio and AI foundry is interesting as it unlocks the capability to utilise AI Search as well as bringing your own models (or using industry specific models). Since Ignite wrapped up, Microsoft also announced GitHub Copilot free for VSCode.
  • OneLake Catalog - “a complete solution to explore, manage, and govern your entire Fabric data estate.“ I see this as bridging the gap between existing management functionality (e.g. lineage) and what sits in Purview, and I’m sure that many users who use OneLake Catalog extensively will be attracted to further data quality and security functionality of Purview. For me, two key improvements OneLake Catalog brings are the ability to see metadata for all OneLake assets - not just lakehouses but pipelines, notebooks, etc. - such as tags, lineage, and the ability to view previous runs, and the new “govern” tab that gives a view of your whole data estate including sensitivity label coverage
Microsoft Ignite Debrief for Data & AI
  • General Availability of sustainability data solutions in Microsoft Fabric - it’s always refreshing to see some Fabric functionality get the GA label, but alongside that, I think the current ESG solutions through Microsoft are either the Sustainability Manager ($4k per month), CSRD reporting templates through Purview (limited scope), or to build your own. This could be a very interesting way to meet ESG data and reporting requirements through Microsoft Fabric. There was a spokesperson from an existing customer who mentioned using this capability of implementing new use cases in short sprints within two to six weeks - I can’t wait to get my hands on it
  • Fabric AutoML (or AI Functions in Fabric) - I got a preview of this during a couple of sessions at FabCon Europe in September, but Ignite marks the preview availability of auto ML, enabling implementations of various algorithms and AI functions in just one or two lines of code. This also ties in to a study Microsoft funded that, unsurprisingly, indicated the biggest challenge with AI adoption is skills. It’s always a fine line in balancing low-code capability and maintaining the quality of deeply technical products, but I personally welcome the AutoML functionality to support some rapid prototyping work

Other Announcements

  • Tenant switcher - for many people, this will seem inconsequential, but as someone that works across different tenancies in the nature of my role, this is such a quality of life update to be able to switch tenant without having to log out and log in
  • The general availability of external data sharing allows you to directly share OneLake tables and folders with other Fabric tenants in an easy, quick, and secure manner
  • Fabric Billing Update - organizations with multiple capacities can now direct Copilot in Fabric consumption and billing to a specific capacity, no matter where the Copilot in Fabric usage actually takes place. It’s worth noting this means you would still need an F64+ capacity, but Copilot could be triggered from an F4 for example. In an ideal world, this would be further improved so having 2xF32 capacities (i.e. anything totalling F64+) would also be able to utilise copilot, but it’s good progress
  • Fabric Surge Protection - helps protect capacities from unexpected surges in background workload consumption. Admins can use surge protection to set a limit on background activity consumption, which will prevent background jobs from starting when reached
  • SQL Server 2025 - now available in preview
  • Fabric General availability announcements - the most notable is probably Fabric Real Time Intelligence, but this also includes sustainability data solutions mentioned above, the API for GraphQL, Azure SQL DB Mirroring, the Workload Development Kit, and external data sharing. All great to see
  • Fabric Workspace monitoring - detailed diagnostic logs for workspaces to troubleshoot performance issues, capacity performance, and data downtime
  • Fabric integration to Esri ArcGIS - preview of integration with Esri ArcGIS for advanced spatial analytics
  • Fabric Open Mirroring - a feature that allows any application or data provider to write change data directly into a mirrored database within Fabric, released to preview
  • Power BI core visuals - okay, this isn’t quite available yet, but it’s interesting to see that Microsoft are sharing a bit more about upcoming visualisations for Power BI, and also that it looks so aesthetically pleasing. The core visuals vision board is available here
Microsoft Ignite Debrief for Data & AI
]]>
<![CDATA[Minimising Spark Startup Duration in Microsoft Fabric]]>Context

Often with cloud services, consumption equals cost. Microsoft fabric isn’t much different, though there is some nuance with the billing model in that, in some cases, increasing consumption by 20-30% could double the cost due to the need to move to a bigger and in other cases

]]>
https://blog.alistoops.com/minimising-spark-startup-duration-in-microsoft-fabric/673bd96f073d2a00011cd52bTue, 19 Nov 2024 00:52:29 GMTContextMinimising Spark Startup Duration in Microsoft Fabric

Often with cloud services, consumption equals cost. Microsoft fabric isn’t much different, though there is some nuance with the billing model in that, in some cases, increasing consumption by 20-30% could double the cost due to the need to move to a bigger and in other cases you might have the overhead and see no cost increase. I‘m keen not to get into the depths of SKUs and CU consumption here, but, at the most basic level for spark notebooks, time / duration has a direct correlation with cost and it makes sense generally to look for opportunities to minimise CU consumption.

In terms of where this becomes relevant in relation to spark startup times in Fabric, it’s worth noting that this duration counts as CU(s) consumption for scheduled notebooks, and also increases the duration of each pipeline. 

I’ll start by sharing a couple of screenshots with session start times where high concurrency sessions and starter pools (details below) aren’t used. After running each half a dozen times, the start times were almost always over 2 minutes and up to 7 minutes with an average of around 3 minutes. 

Environments

Minimising Spark Startup Duration in Microsoft Fabric
Custom pool with small node size

Before jumping in to a couple of recommendations and examples, I also wanted to comment briefly on Fabric environments. Environments can be used to consolidate and standardise hardware and software settings. The Microsoft documentation has more information on this. Up until running a series of testing for this blog I had mainly used environments for deploying custom python packages, but you’ll see a custom environment in some screenshots below (for the small node size) where I adapted environment settings to quickly enable changes in spark compute resources and apply them consistently across sessions, without changing the workspace default, for testing the high concurrency sessions with specific compute resources.

Minimising Spark Startup Duration in Microsoft Fabric
Custom pool with medium node size

Basic Testing

Having run sessions without utilising high concurrency or starter pools for a range of environments, the results are outlined below;

  • Small node size, memory optimised, 1-10 nodes - 2 minutes 42 seconds
  • Medium node size - this one was interesting. If you create a custom pool with similar settings to the default starter pool, startup can be around 10 seconds, but minor adjustments to the pool, namely adjusting number of drivers or executors, or memory from 56GB to 28GB, saw this jump to 7 minutes 7 seconds
  • Large node size, memory optimised, 1-6 nodes - 2 minutes 17 seconds
Minimising Spark Startup Duration in Microsoft Fabric
Small node size (demo environment details in first image above)
Minimising Spark Startup Duration in Microsoft Fabric
Medium node size, custom environment settings can be seen in the environment section
Minimising Spark Startup Duration in Microsoft Fabric
Large node size

High Concurrency 

Minimising Spark Startup Duration in Microsoft Fabric
Connecting to high concurrency sessions

High concurrency mode in Fabric enables users to share spark sessions for up to 5 concurrent sessions. Though there are some considerations, namely around the requirement for utilising the same spark compute properties, the Microsoft documentation suggests a 36 times faster session start for custom pools. In my experience, the actual start time was even quicker than suggested, almost instantaneous, compared to around 3 minutes, and across 3 tests this ranged from 55 times faster to almost 90. That said, it’s also worth noting that first high concurrency session start was often slightly longer than starting a standard session where it was more like 3 minutes than 2.5.

Minimising Spark Startup Duration in Microsoft Fabric
Startup of the first high concurrency session

In all node size variations, the startup times for further high concurrency sessions was either 2 or 3 seconds. The images below were taken for the demo environment outlined above (small node size).

Minimising Spark Startup Duration in Microsoft Fabric
Startup for attaching the second high concurrency session

Starter Pools

Minimising Spark Startup Duration in Microsoft Fabric

Fabric Starter Pools are always-on spark clusters that are ready to use with almost no startup time. You can still configure starter pools for autoscaling and dynamic allocation, but node family and size are locked to medium and memory optimised. In my experience, startup time was anywhere from 3 to 8 seconds.

Minimising Spark Startup Duration in Microsoft Fabric
Startup time utilising starter pools as the workspace default

Closing Thoughts

In short, where you’re comfortable with existing configurations and consumption, or no custom pools are required, look to utilise starter pools. Where custom pools are required due to tailoring requirements around node size or family, and multiple developers are working in parallel, aim to use high concurrency sessions.

]]>
<![CDATA[Power BI Pricing Update]]>Context

Yesterday, November 12th, Microsoft announced changes to the Power BI licensing that represents a 20-40% (license dependent) increase per user license. It’s worth mentioning that this is the first increase since July 2015, more than 9 years ago. From April 1st 2025, pro licensing will increase from

]]>
https://blog.alistoops.com/power-bi-pricing-update/6733ec20073d2a00011cd516Wed, 13 Nov 2024 00:03:10 GMTContextPower BI Pricing Update

Yesterday, November 12th, Microsoft announced changes to the Power BI licensing that represents a 20-40% (license dependent) increase per user license. It’s worth mentioning that this is the first increase since July 2015, more than 9 years ago. From April 1st 2025, pro licensing will increase from $10 to $14 per user per month and premium per user licensing from $20 to $24 per user per month.

More details are covered in the Microsoft blog post.

What’s not affected

Though this is naturally going to affect a large number of users, I think it’s most likely to impact small and medium sized corporates and those not currently or planning to use Fabric. This is because some elements will remain unaffected as the licensing changes are specific to the per user licensing not under enterprise agreements. Not changing, is:

  • Fabric F SKU pricing
  • Embedded pricing (under EM and F SKUs)
  • E5 licensing - this still includes a power BI pro license with no increase in cost 
  • PPU add on licensing for E5
  • Non-profit licensing (currently priced lower than enterprise or personal licenses)

Scenarios & suggestions 

  • Excluded / unaffected: If you’re currently E5 licensed, licensed through a non-profit, or utilising included in pricing viewer licensing through an F64 Fabric capacity, there’s nothing to worry about
  • Utilising Fabric, but below F64: the new licensing adjusts the tipping point for where the jump to F64 makes sense for the viewer licensing being included in capacity cost. With reserved pricing, this was previously around 250 users if you are utilising an F32 capacity with reserved pricing and pro licensed users, now it’s more like 180. The crossover point is the difference in cost from your current SKU to F64 divided by the per user per month license cost (e.g. 268 for F16)
  • Utilising fabric but not embedding: more often than not, Power BI users are licensed for accessing reports via the Power BI service. However, with Fabric F SKUs, you can make the most of embedding for your organisation and organisational apps to facilitate consumption of reports without needing to access Power BI service and therefore reducing the potential licensing requirements for viewers
  • Utilising Power BI but not yet Fabric: Well, both of the above two points are still worth considering. In fact, I think the lower SKUs (F2 and F4) could be paid for if you’re able to utilise embedding for your org instead of Power BI licenses for report viewers for as few as 11 (F2, reserved pricing) users. This could be a great reason to consider Fabric if you’re not already

As for everyone else, unfortunately there isn’t much of an option beyond preparing for the increased cost sooner than later and communicating it to decision makers. That said, I would hazard a guess that the cost to transition organisational reporting would likely outweigh any benefit, and given the previous history I would hope this is not likely to happen again for some time.

]]>
<![CDATA[Microsoft Fabric Data Engineer Associate (DP-700) Beta - My Experience & Tips]]>TLDR

DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric is a challenging exam that covers most aspects of data engineering on Microsoft Fabric. Personally, I consider it a tougher exam than DP-600 and I believe it could do with some rebalancing around examining more spark and pipeline / orchestration topics, but

]]>
https://blog.alistoops.com/microsoft-fabric-data-engineer-associate-dp-700-beta-my-experience-tips/67281b0b073d2a00011cd4a7Mon, 04 Nov 2024 01:07:21 GMTTLDRMicrosoft Fabric Data Engineer Associate (DP-700) Beta - My Experience & Tips

DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric is a challenging exam that covers most aspects of data engineering on Microsoft Fabric. Personally, I consider it a tougher exam than DP-600 and I believe it could do with some rebalancing around examining more spark and pipeline / orchestration topics, but all topics felt relevant and there wasn’t too much variation in question complexity with one or two exceptions. 

That said, it’s likely existing Fabric data engineers are more familiar with some experiences than others, so there is probably some learning that’s needed for most people - especially those with limited real time intelligence (or KQL) and error resolution experience.

I expect some of the balancing to be addressed for the more left-field questions when the exam goes live, based on feedback from those undertaking the beta. Even though DP-700 has a large overlap with high level topics to DP-600, I think the exam considers them from different angles or contexts and should have good alignment with those looking to prove their data engineering skills with Fabric.

What is a beta exam?

Exams in beta are assessed for performance and accuracy with feedback on questions being gathered. The beta process is slightly different with AWS exams (described here) than Azure - for this beta exam there isn’t a guaranteed discount (usually 50% with AWS), the beta period duration is not clearlyy defined ahead of time (end date isn’t currently published), and the results aren’t released 5 days after taking the exam but around 10 days after the exam exists its beta period so I don’t yet have my results. Microsoft publish some information on beta exams here

What is the Fabric DE certification and who is it aimed at? 

Aside from the obvious role alignment in the certification title, the Microsoft documentation describes the expected candidate for this exam as someone with “subject matter expertise with data loading patterns, data architectures, and orchestration processes” as well as the below measured skills:

  • Implement and manage an analytics solution (30–35%)
  • Ingest and transform data (30–35%)
  • Monitor and optimize an analytics solution (30–35%)

One thing I would call out around the ingesting and transforming data is that the exam covers all mechanisms of doing so - notebooks, stored procedures, data flows, KQL Querysets - and utilising Spark, SQL, and KQL.

Exam prep 

At the time of the beta exam being available, there wasn’t a formal Microsoft Learn training path. Beyond a few blogs from some Microsoft Data Platform MVPs, the only real collateral that exists is a collection of Learn modules. For those who undertook the DP-600 exam before November 15th 2024 (see some blogs about the updates here and here), this collection is mostly similar to the DP-600 learn modules. Additions include “Secure a Microsoft Fabric data warehouse”,  three Real Time Intelligence modules (getting started with RTI, using Real Time Eventstreams, and querying data in a KQL database) as well as “Getting started with Data Activator in Microsoft Fabric.” Beyond this, the only real preparation I could suggest is a reasonable amount of hands on development experience across data engineering workloads and development experiences. Though it’s also worth saying that the suggestions from my DP-600 blog still apply.

My experience and recommendations

Most of the reason I wanted to publish this was to cover exam topics, but before doing so there are three key things worth calling out:

  • Get used to navigating MSLearn - you can open an MSLearn window during the exam. It’s not quite “open book” and it’s definitely trickier to navigate MSLearn using only the search bar rather than search engine optimised results, but effectively navigating MSLearn means not always needing to remember the finest intricate details. It is time consuming, so I aimed to use it sparingly and only when I knew where I could find the answer quickly during the exam. I also forgot that this was possible so missed using it for the first 25 questions
  • Though it’s somewhat related to time sent navigating MSLearn as above, I did run quite tight on time and only had about 7 minutes remaining when I finished the exam so use your time wisely
  • Case studies were front-loaded and not timed separately. It’s a delicate balance to be struck, but you can’t go back to the case studies so want to make sure you spend enough effort on them but it’s worth being careful to not waste time for the remaining questions. For reference, I spent about 20 minutes on the 2 case studies

As for the exam topics:

  • I observed a number of questions focused on error resolution across a range of examples such as access controls (linked to the item below), SQL queries, scheduled refreshes, and more
  • Be confident around permissions and access control. Though they’re accessible via MSLearn, I’d recommend memorising the workspace roles (admin, contributor, member, admin), but also consider more complex requirements such as row level security, data warehouse object security and dynamic data masking (including evaluating predicate pushdown logic)
  • Though it could be covered above, I would also suggest having some experience in testing scenarios related to workspace governance and permissions such as configuring various read and write requirements across multiple workspaces via both roles and granular access requirements. I think some questions extend this beyond a simple security question and more into a question of architecture or design
  • A broad understanding of engineering optimisation techniques is helpful, but I would recommend having a deeper understanding and hands on experience of read optimisation and table maintenance including V-Order, OPTIMIZE, and VACUUM
  • Deployment processes - pipelines, triggering orchestration based on schedules and processes, and understanding cross workspace deployment, but also azure DevOps integration
  • Experience selection - at face value, the notebook vs data flow vs KQL and Lakehouse vs warehouse vs event stream seem straightforward but as always the detail is crucial. Be aware of scenarios outlining more specific requirements like a requirement to read data with SQL, KQL, and spark but only write with spark or choosing between methods of orchestration such as job definitions vs. pipelines 
  • Intermediate SQL, PySpark, and KQL Expertise is required - I noted intermediate SQL and beginner PySpark being important for DP-600. Here, both are still true and perhaps more intermediate PySpark is required, but KQL experience is needed too, and I had quite a few code completion exercises across SQL, spark, and KQL with a mix of simple queries, intermediate functions like grouping, sorting, and aggregating, and more advanced queries including complex joins, windowing functions and creation of primary keys. I also had one question that was around evaluating functionality of a few dozens of lines of PySpark code with multiple variables and try/except loops - I felt the complexity of questions was much higher than in DP-600, but it’s hard to know whether this will remain post-beta
  • I’ve already mentioned a few times above, but Real Time Intelligence was scattered across a number of questions. Alongside understanding various real time assets and KQL logic, a number of scenarios followed a similar pattern around sourcing data from event hub, outputting to lakehouses and implementing filters or aggregations, sometimes with a focus on optimisation techniques for KQL 
  • Understand mechanisms for ingesting external data sources. Though seemingly obvious for a data engineering exam, a couple of things that I would suggest being confident around are PySpark library (or environment) management and shortcuts, including shortcut caching
  • Capacity monitoring and administration including cross-workspace monitoring, checking running sql query status, and interacting with the Fabric capacity monitoring app
]]>
<![CDATA[AWS Certified AI Practitioner Beta - My Experience & Tips]]>TLDR

Though an interesting learning experience, and something I wouldn’t actively discourage those interested from, the beta AWS certified AI practitioner felt a little bit caught in the middle and isn’t something I’d recommend for most audiences. I think it examined things like model

]]>
https://blog.alistoops.com/aws-certified-ai-practitioner-beta-my-experience-tips/672425bb073d2a00011cd490Fri, 01 Nov 2024 01:03:46 GMTTLDRAWS Certified AI Practitioner Beta - My Experience & Tips

Though an interesting learning experience, and something I wouldn’t actively discourage those interested from, the beta AWS certified AI practitioner felt a little bit caught in the middle and isn’t something I’d recommend for most audiences. I think it examined things like model selection and broader AWS services including IAM and networking to be more in depth than necessary for most business or sales people, and not in depth enough to be practically useful for practitioners.

That said, I feel there is some value for those potentially interested in layering their knowledge and certifications and plan to undertake the next level of AI/ML AWS certification but haven’t taken a cloud or AWS exam before. For most other people, there is likely a better starting point.

What is a beta exam?

I covered this under my associate data engineer exam blog, and nothing really changed since posting that. Two things worth mentioning are that AWS are currently offering a free retake before February 2025 if you fail the exam and that they are offering an additional “early adopter” badge for those that achieve the certification during the beta phase or first 6 months after. An additional point that isn’t strictly beta related, but something I noted, was that the certification expires after 3 years as with all other AWS certifications, but where certified cloud practitioner is renewed by passing any other exam, I’m not clear on (and doubt) the same thing occurs for the AI practitioner cert.

AWS Certified AI Practitioner Beta - My Experience & Tips

What is the AI practitioner certification and who is it aimed at? 

From the AWS website

“The ideal candidate for this exam is familiar with AI/ML technologies on AWS and uses, but does not necessarily build AI/ML solutions on AWS. Professionals in roles such as sales, marketing, and product management will be better positioned to succeed in their careers by building their skills through training and validating knowledge through certifications like AWS Certified AI Practitioner.”

And I think this is where I will call out my main gripe with the exam - I’m not sure the content is completely aligned to the recommended candidate in that some of the topics I will call out below are areas I wouldn’t expect someone in sales or marketing and perhaps not product management to really get value from - model selection, s3 policies, VPCs, and Bedrock internet restrictions. I would at least acknowledge that the guidance on this being a certification not aimed for those building AI/ML solutions seems right, and that there is a good amount of content from the certification that is more aligned to the proposed audience such as understanding the differences between fine tuning and prompt engineering, so perhaps some of what I saw in the beta won’t actually be included in the generally available exam.

Exam prep 

At the time I undertook the exam, there wasn’t much collateral beyond the exam guide and the SkillBuilder plan had cost involved. Given this was a practitioner level certification and I was already familiar with both AWS technologies and AI concepts and technologies, I decided to take the exam with no preparation.

Since then, Stephane Maarek has released his Udemy course. This looks like it has very good coverage of the topics examined and, though I can’t speak to its quality, I have had positive experiences with other courses Stephane has released and I would expect this to be a good resource.

My experience and recommendations

One thing I would call out is that the exam felt almost entirely generative AI focused and there wasn’t much coverage of things like natural language processing, classification, predictive techniques, or AI concepts outside large language model applications. 

  • Understand the AI deployment process (build, train, deploy) and AWS tools you can use aligned to these. I did also notice some specific questions around EC2 types for training so I would be aware of HPC and Accelerated computing types (especially Inf)
  • Though it’s an AI certification, understanding some core AWS concepts is absolutely required. I would recommend, based on my exam experience, having more than just a conceptual understanding of IaM and S3 policies, VPCs, and security groups
  • Understand the core AWS AI tools and technologies - I think the most common areas are related to having a reasonable understanding of Bedrock and Sagemaker (as well as when to use one over the other), but other things also included integration with broader services such as Connect, and knowing the use cases for Macie vs Comprehend vs Textract vs Lex
  • Commit the Sagemaker features to memory - it’s less than 20 features by 6 categories, but I found that this came up in quite a few questions
  • Prompt engineering is likely to come up more than once, so if you’re unfamiliar with AI prior to looking at this certification I would suggest reading up on prompt engineering concepts and approaches. I think this AWS site is a good place to start, but the questions are often phrased around determining the most effective approach so this is an area where comparing and contrasting methods is important
  • Understand both fine tuning and pre-training of LLMs and the difference between the two i.e. when it’s preferable to conduct each
  • I mentioned above that observed some specific, and more than surface level, AI questions around model selection and adjustment processes. I would suggest being familiar with
    • BERT - and relevant applications such as filling blanks in sentences
    • EPOCHS and adjusting them in line with observed over and under fitting
    • Temperature - common word choices, consistency, etc. (context here, and AWS guidance here)
]]>
<![CDATA[Microsoft Fabric Sample Architecture Diagram]]>I was recently preparing a presentation for an introduction to Microsoft Fabric during which I wanted to briefly talk about where Fabric fits in a typical hub and spoke Azure Landing zone as well as showing the end-to-end processing of data in Fabric for downstream consumption. Admittedly, some high level

]]>
https://blog.alistoops.com/microsoft-fabric-sample-architecture-diagram/66f5254f34c9070001acb7bfSat, 12 Oct 2024 14:38:07 GMT

I was recently preparing a presentation for an introduction to Microsoft Fabric during which I wanted to briefly talk about where Fabric fits in a typical hub and spoke Azure Landing zone as well as showing the end-to-end processing of data in Fabric for downstream consumption. Admittedly, some high level diagrams exist for the latter, but I wanted to present this slightly differently as well as showing processing of multiple data types - I always find it helpful to consider these visually.

Fabric in Azure Landing Zones

First, a few core concepts (Microsoft information). Microsoft Fabric is enabled at the tenant level, and must be attached to a subscription, resource group, and region. In the below example, I’ve aligned the Fabric capacity to a “data and ai” subscription in a landing zone spoke. It’s worth mentioning highlighting that it’s entirely possible for the Fabric box (green dotted line) to exist without any additional complexity, but I’ve included some examples like private endpoints to highlight that Fabric can be configured to meet more complex security or networking needs - see the network security documentation for more information. Lastly, alongside using the Microsoft backbone network for integrating with any Azure resources, Fabric enables native access to any data the user has access to at a tenant level, so you could imagine the dotted green box being extended to other subscriptions.

Microsoft Fabric Sample Architecture Diagram

Fabric-specific architecture

A simple overview of the medallion architecture is available via Microsoft (image below). I also stumbled across an automotive test fleets example Microsoft shared, but could only see the base image that I wanted to extend and have something editable for going forward.

Microsoft Fabric Sample Architecture Diagram

So, alongside my visual updates, the below visual combines these two examples to provide an example of implementing multiple types of ETL processes and consumption.

Microsoft Fabric Sample Architecture Diagram

Finally, the real value I wanted to share here is the Visio file (from this GitHub repo) for the above Fabric diagram so you can use the icons or adapt to your needs, as this was the most time consuming part of my preparation as I appreciate that the concepts are already covered through the various content openly available but only with specific images that aren’t always the best for visual representation. I’ve stored this on GitHub, but please also share any suggestions or feedback and I’m happy to produce other examples. Note, the base landing zone diagram was authored by a colleague before I added the data & ai subscription so I’m not comfortable sharing that openly just yet.

]]>
<![CDATA[My Takeaways from FabCon Europe 2024 - 1 Week Later]]>Intro

From September 24th - 27th I attended the 2024 European Fabric Community Conference (or FabCon) in Stockholm. The first day of this was spent attending hands on tutorials run by the Microsoft product team, with the remaining three days kicking off with a keynote followed by at least 4

]]>
https://blog.alistoops.com/my-takeaways-from-the/6705b7ed34c9070001acb7cdTue, 08 Oct 2024 22:58:24 GMTIntroMy Takeaways from FabCon Europe 2024 - 1 Week Later

From September 24th - 27th I attended the 2024 European Fabric Community Conference (or FabCon) in Stockholm. The first day of this was spent attending hands on tutorials run by the Microsoft product team, with the remaining three days kicking off with a keynote followed by at least 4 slots to attend one of a number of breakout sessions. Alongside standard breaks for networking and food, there was also a community booth, Fabric developer challenge, Ask the Experts stand, and a number of partner booths. I’d planned to share some of my experiences and takeaways, but I wanted to take a beat and reflect once things settled. There are a couple of points below with some overlap to my daily LinkedIn reflections, but I’ve tried to minimise this or add extra detail where relevant.

Community takeaways

First of all, a note on the community aspect of the conference. Prior to attending, I wasn’t sure what exactly the branding of a “community conference” would mean and I must admit that it did feel a little bit different than the traditional tech or data conference. It felt as though there was a dedicated effort to engage the community from the “Fast with Fabric” developer challenge being focused on understanding how people are actually using the tool, to constantly looking for product feedback that would feed future developments, it did feel like Microsoft wanted community engagement. The community booth was constantly busy throughout the entire conference, too.

  • The Fabric community is massive and growing rapidly. From the 3,300 attendees (across 61 countries) and massive Microsoft representation, 14,000+ Fabric customers, the active forum users and user groups, and more, there’s so much going on in the Fabric space. One particular note here was that Fabric is on the same trajectory that PowerBI was at the same point in its lifecycle - given the almost 400,000 PowerBI customers today, there is obviously a large targeted growth
  • In terms of user growth, there were two interesting things I noted; Fabric has the fastest growing certification Microsoft offer (Analytics Engineer certification, DP-600) with 17,000+ certified people, and from interacting with a number of attendees, especially during the tutorials, so many users are from non-technical or data backgrounds until they needed to fix a specific problem with data (often using PowerBI) - a refreshing change from my background and from seeing so many consider data-first rather than business problems
  • It can be incredibly valuable spending time engaging with the community, especially in person. It’s hard to imagine anyone left the event without being just a little more energised than beforehand. From resolving specific technical queries and validating design decisions and best practice to just understanding what other people are working on, and how, there was a lot to gain from talking with the Microsoft staff and MVPs at the “Ask the Experts” booths, those leading sessions, and other attendees
  • I’d extend the above to suggest that while you might not have access to so many product experts in person on a daily basis, the community forum and sub reddit are great places to engage online. I had the pleasure of meeting some of the active Reddit members (picture below, sourced from this post)
My Takeaways from FabCon Europe 2024 - 1 Week Later

Product and broader theme takeaways

  • Effective data governance is crucial - alongside lots of discussion on how Fabric can meet the governance needs in the era of AI, there was a lot of detailed coverage around both Fabric’s built in governance features and utilising Fabric alongside Purview for extended data governance and security. I also noticed quite a number of partners in the governance and MDM space in the exhibition space including Profisee, Informatica, Semarchy, Cluedin, and more. On my daily LinkedIn updates, I called out one important quote; “Without proper governance and security, analytics initiatives are at risk of producing unreliable or compromised results" 
  • Fabric is intended to be an AI-embedded experience for developers and business users - it’s easy to say this means going all-in on AI, especially in today’s market, but I thought it was interesting that all this was discussed from all angles. From a generic focus on copilot driven development and consumption, and generative AI solutions, to integrating OneLake data to custom built AI applications and the ability to call AI functions (e.g. classify, translate) directly via notebooks with functions. This included covering key aspects like generative AI not always being the right solution and getting ready for and appropriately governing data to support AI, all backed by great demos
  • Power hour was a thoroughly enjoyable experience, and one I highly recommend checking out if you’re not familiar with them. The energy on stage, and lighthearted, enjoyable nature of the various demos was a master class in storytelling during your presentation and how to have fun with data. It reiterated something that I think most passionate data practitioners are conscious of; the value of your data is often determined by the narrative you can drive by effectively using it, and how you can explain it in simple or understandable terms to the business
  • There was a real focus on ease of use and how Microsoft are trying to minimise the barrier to entry. This included extension of Copilot features (in PowerBI, and building custom Copilots on Fabric data), the inclusion of PowerBI semantic model authoring via desktop, changes to the UI, ability to embed Real Time Dashboards, and features around ease of implementation / migration including integrating Azure Data Factory pipelines supported directly in Fabric, sessions around migrating from Azure Synapse, and upcoming (in 2024) support for migrating existing SSIS jobs
  • There were lots of great sessions and technical deep dives where architecture examples were presented e.g. connecting to Dataverse, implementing CI/CD, production lakehouse and warehouse solutions as well as conversation with other attendees about other data technologies (Azure Data Factory, Databricks, and others). This was all a firm reminder of Werner Vogel’s Frugal Architect (Law 3) that architecting is a series of tradeoffs. Don’t waste time chasing perfection, but invest in resources aligned to business needs
  • While ultimately the ”proof is in the pudding” as far as listening to customer feedback goes, it felt clear that Microsoft want to factor in user feedback to how they develop Fabric - there was a Fast at Fabric challenge that was entirely aimed at gathering user feedback, Microsoft product leads were engaging with attendees to understand key sticking points, and I even had a conversation with Arun Ulag, Corporate VP of Azure Data, where the first thing he wanted to know was how I am using Fabric and how it could be improved. It was also good to see the deep dive into data warehouse performance explain the trajectory of warehouse developments and acknowledge why there were some shortcomings at launch tied to the significant effort to move to delta (format) under the hood
My Takeaways from FabCon Europe 2024 - 1 Week Later
Me with A Guy in a Cube’s Adam Saxton and Patrick Leblanc

My favourite feature announcements

Frankly, there were too many announcements to capture or list individually, though Arun’s blog and the September update blog cover most things I can recall. I still wanted to call out the announcements I saw as either the most impressive or that have previously come up in conversation as potential blockers to adoption (looking at you, git integration!)

  • Incremental refresh for Gen2 Dataflows - those working in data engineering will be more than familiar with implementing incremental refresh, but this brings the ability via low-code through Gen2 Dataflows, which is great for those who had it as a core requirement, but also would reduce the consumption and cost of existing pipelines that are conducting full refreshes
  • Copy jobs - think of copy jobs as a prebuilt packaged copy pipeline including incremental refresh capability. Put simply, copy jobs are the quickest and easiest way to rapidly automate data ingestion
  • Tabular Model Definition Language (TMDL) for semantic models - coming soon is both the ability to create or script out existing semantic models using code, enabling versioning but also consistent best practice (e.g. reusable measures). Alongside this, an additional TMDL view will be added to PowerBI 
  • GIT integration - though Git integration has existed for some time, it’s always needed some kind of hacks to be properly functional. During FabCon, it was announced that all core items will be covered for git integration by the end of the year - the standout here is inclusion of Dataflow Gen2 items
  • The new Fabric Runtime 1.3 was released. This was quoted as achieving up to four times faster performance compared to traditional Spark based on the TPC-DS 1TB benchmark, and is available at no additional cost

Best practice takeaways

  • Focused effort on optimising your semantic model is important - though Direct Lake can add value and performance, a bad (or not optimised) model outside fabric will be a bad model in fabric. Also, don’t use the default semantic model, but create a dedicated semantic model 
  • V-Order is primarily intended for read operations - think carefully about how you use it. Best practice advised was to utilise V-Order for “gold” layer data, feeding semantic models. Use Delta analyser to examine specific details
  • Using DirectLake is as important in Fabric as query folding is to PowerBI - DirectLake can massively improve read performance - an example referenced reading 2 billion rows taking 13 seconds in a P1 capacity via direct query and 234ms for 1 billion rows via DirectLake. While the record count isn’t identical, it was designed like this because a 2 billion record read would force direct query fallback
  • A metadata driven approach to building pipelines is best practice, but it’s not easy to tackle all at once, so start small and gradually expand across the organisation 
  • There’s a lot to tweak around spark optimisation (more via another blog), but one key area of discussion is around spark session startup times. Two callouts on this were to utilise starter pools and high concurrency mode. High concurrency mode enables multiple sessions to share spark sessions and can reduce startup to as little as 5 seconds

Lastly, a quick shoutout for a couple of food recommendations. If you’re visiting Stockholm, do try and check out Berns (Asian fusion and Sushi) and Mr. Churros!

My Takeaways from FabCon Europe 2024 - 1 Week Later
]]>
<![CDATA[My Hopes for Feature Announcements at FabCon ‘24]]>What is FabCon 24?

Europe’s first Microsoft Fabric community conference, labelled FabCon, is kicking off on September 24th, following the first global Fabric conference and preceding the 2025 event planned in Las Vegas next March/April (2025). As detailed on the conference website, a number of Microsoft’

]]>
https://blog.alistoops.com/my-hopes-for-the-microsoft-fabric/66f1b37634c9070001acb77eMon, 23 Sep 2024 19:04:12 GMTWhat is FabCon 24?My Hopes for Feature Announcements at FabCon ‘24

Europe’s first Microsoft Fabric community conference, labelled FabCon, is kicking off on September 24th, following the first global Fabric conference and preceding the 2025 event planned in Las Vegas next March/April (2025). As detailed on the conference website, a number of Microsoft’s Data & AI experts, including some of the Fabric product leadership will be in attendance.

Alongside a number of interesting speaker sessions, there is a planned session focusing on the Microsoft Fabric roadmap. On some online community forums, there has been talk around some announcements being “held back” over the last 1-2 months with the intention of providing some announcements during the conference this week.

My hopes for announcements

I thought it would be helpful to shortlist the top N announcements I’d like to see, but given the number of ongoing development activities and items covered both in the roadmap and Fabric ideas, I thought it helpful to break it down to the top 3 hopes related to known roadmap developments, top 3 unrelated to known roadmap developments, and top 3 other or more out there hoped-for announcements.

  • Version Control Updates - improvement in item support, especially for Dataflow Gen2, and ideally the ability to see the code that sits behind Dataflow Gen2 items to validate and version control
  • Data Exfiltration Protection - though it’s unlikely to be a killer feature for all customers, data exfiltration protection for notebooks would be important for the more security conscious users and not covered by OneSecurity (planned) or Purview
  • T-SQL improvements (e.g. MERGE, CTEs) - I think the current T-SQL functionality meets most core needs, but does provide some challenge in specific needs not being met (e.g. MERGE). I also see some ongoing questions around the pattern of looking to move on-premise data to Azure SQL before feeding in to Fabric vs. Ingesting to Fabric immediately, and given where Azure SQL functionality is, there’s a compelling argument to move there first
  • While I tried to stick with top 3, I have to add that the incremental load support in dataflows is another roadmap item that I would love to see more about this week

Unrelated to roadmap hopes

  • Workload Management (queues, prioritisation, adjustable burst behaviours) - I think there are lots of options as to how this is implemented, but ultimately I believe there would be a lot of value to being able to prioritise certain jobs or workspaces, and it would encourage use of larger capacities rather than splitting all non-production and production workloads at the capacity level only, which is currently the only way to facilitate workload segregation
  • Low-code UPSERT for data pipelines and dataflows - similar to the utility around implementing incremental load support above, being able to UPSERT via data pipelines would be a great quality of life improvement, especially considering it’s already possible in dataverse dataflows
  • Parameters / environment variable support (for notebooks and between experiences) - I have seen a number of examples around metadata driven pipelines already, and made use of functionality to utilise parameters for dataflows, but I think that being able to do this within notebooks and across experiences (between data pipelines and dataflows) as well as for deployment pipelines would be great

Others

  • Preview Features going GA - Fabric is maturing rapidly, and also adding lots of new features regularly, but it would be great to see a few bits of information around things becoming generally available. First of all, for longer-standing features, It would be good to see some features move from preview to GA. I’d also like to understand what the cadence or path from preview to GA looks like for upcoming features, as well as get more visibility of when things are actually happening and what the impact is. We have the roadmap for new features, but it’s more difficult to get a clear view on how this impacts existing features or environments
  • Native mirroring for on-prem SQL Server - currently, mirroring on-prem data sources somewhat relies on setup via an AzureSQL instance. In theory, this doesn’t provide much of a blocker, but in practice, native integration to onprem SQL Servers for mirroring would provide a much better user experience
  • Asset Tagging - a few months ago, Microsoft added Folder support to help organise Workspace assets, which do their job, but the addition of tags could be more user-friendly by extending the current “certified” and “favourite” options in OneLake and hopefully provide additional options for security and governance (e.g. attaching RBAC to tags)
]]>
<![CDATA[Common Microsoft Fabric Cost Misconceptions]]>Fabric has came under its share of scrutiny since going generally available in November 2023, and much of it was or is still worth consideration. Specifically, concerns around the maturity of version control or CI/CD (in preview), some observed delays in SQL Analytics Endpoint synchronisation are often reference, and

]]>
https://blog.alistoops.com/microsoft-fabric/66df682c34c9070001acb751Mon, 09 Sep 2024 21:47:07 GMT

Fabric has came under its share of scrutiny since going generally available in November 2023, and much of it was or is still worth consideration. Specifically, concerns around the maturity of version control or CI/CD (in preview), some observed delays in SQL Analytics Endpoint synchronisation are often reference, and the pricing model being subscription or capacity-based rather than purely consumption-based are probably the points I see most commonly referenced.

Though these are all fair, and hopefully being addressed, it's also with addressing some common misconceptions, in this case associated to capacity features, cost, and performance:

  • The minimum cost for utilising PowerBI embedded is “high”: at launch, embedded was limited to F64 capacities, but that’s no longer the case and embedding is possible with any F SKU.  I’ve labelled this misconception as “high” cost as it tends to be one of two figures quoted - either the F64 capacity cost, or the embedded pricing. Ultimately, this means the actual minimum cost could be as low as a couple of hundred USD per month compared to anywhere from 800 USD (EM1) or 8,000 USD (F64) 
  • F64 is required for fully-featured fabric: This is partially true in that all features are available on F64 and above capacities, but previously this included smaller capacities not having features like trusted capacities or managed private endpoints, but all F SKUs are mostly at feature parity. The only feature addition at F64 and above is Copilot
  • All Fabric experiences cost the same for like-to-like operations: William Crayger shared the most vivid example of this that I can remember, which describes seeing a more than 95% reduction in consumption units running the same process for ingesting data with a spark notebook compared to a low-code pipeline. I haven’t seen quite so dramatic results in my experience, but I have observed, for specific activities, up to a 75% reduced cost. That is to say, not all experiences will result in the same consumption
  • Capacity performance is better for larger capacities: some small differences can be observed due to smoothing & bursting but, as outlined here by Reitse, even comparing the smallest F2 capacity to F64, performance is largely the same
  • Fabric is "more expensive" than expected, or than PowerBI at scale: as with everything, this depends on many factors but, in terms of perception, it's worth bearing in wind that the F64 capacity cost is equivalent cost (when reserved) to a PowerBI Premium capacity (at $5k p/m)
  • The F64 SKU cost, including free PowerBI access for report consumers, is best fit for 800 or more users: This comes from the fact that otherwise you would be paying for around 800 (x 10 USD) PowerBI licenses for report viewers.  However, it doesn’t factor in capacity reservation (~40%) savings nor any licensing covered through enterprise E5 licenses. In real terms, the driving factor for capacity selection needs to be predominantly data requirements, but considering purely licensing costs for report viewers, the crossover point for cost will vary. It could be higher (if E5 licenses are considered) or lower (if the capacity is reserved, more like 500)

Now, this isn’t to say that additional considerations such as portability, version control, or workload management and prioritisation aren’t worthwhile considerations, but I think it’s good to see barriers to entry being removed for new users and a mostly consistent experience for all capacities.

]]>
<![CDATA[Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations]]>Intro & Context

I’ll start by saying that I won’t be discussing the merits of the medallion architecture nor the discussion around building a literal medallion vs. Structuring against the requirements for your domain, and using medallion as a conceptual talking point, in any

]]>
https://blog.alistoops.com/microsoft-fabric-workspace-structure-and-medallion-architecture/6696aa2934c9070001acb746Wed, 17 Jul 2024 13:37:28 GMTIntro & ContextMicrosoft Fabric Workspace Structure and Medallion Architecture Recommendations

I’ll start by saying that I won’t be discussing the merits of the medallion architecture nor the discussion around building a literal medallion vs. Structuring against the requirements for your domain, and using medallion as a conceptual talking point, in any great detail as part of this blog (but I think this is a good video talking through the latter). We will just assume you’re following the guidance Microsoft publish describing the usual bronze, silver, and gold zones or layers. 

I’ve seen a number of conversations online, namely on Fabric community blogs, Reddit, and Stackoverflow that primarily focus on the question of whether implementing the medallion architecture means we should have one workspace per domain (e.g. sales, HR data) that covers all layers or a workspace per layer (bronze, silver, gold) per domain. While these threads often end up with a single answer, I don’t think there is a single right answer, and I also think there is some nuance missed in this conversation. So what I’m going to cover here is some guidance around what the implications are around Fabric workspace structure as well as recommendations for a starting point. It’s also worth noting that this focuses primarily on lakehouse implementation.

Key Design Implications

Before sharing any recommendations, I want to get the “it depends” out of the way early. The “right” answer will always depend on the context of the environment and domain(s) in which you’re operating.

This is just a starting point and a way to break down some key decision areas. It’s worth noting there are lots of good resources and examples of people sharing online around their experiences, but the reason this blog exists is because I have seen these more often than not represent the most straightforward examples (single source systems or 1-2 user profiles) where following Microsoft’s demos learn materials is enough. As far as considering the implications of your workspace structure in Microsoft Fabric, I would suggest key areas include:

  • Administration & Governance (who’s responsible for what, user personas)
  • Security and Access Control (data sensitivity requirements, principle of least privilege)
  • Data Quality checks, consistency, and lineage
  • Capacity (SKU) features and requirements (F64-only features, isolating workloads)
  • Users and skillsets
  • Version control & elevation processes (naming conventions, keeping layers in sync)

Potential High Level Options

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 1 (B)
  1. 1 workspace per layer per domain. This is recommended by Microsoft (see “Deployment Model”) as it provides more control and governance. However, it does mean the number of workspaces can grow very quickly, and operational management (i.e. who owns the pipelines, data contracts, lineage from the source systems) needs to be carefully considered
  2. 1 landing zone or bronze workspace then 1 workspace per domain for silver and gold layers. This could slightly reduce the number of workspaces by simply landing raw data in one place centrally but still maintaining separation in governance / management
  3. One workspace per domain covering all layers. Though this is against the suggestion in Microsoft documentation, it is the most straightforward option, and for simple use cases where there are no specific constraints around governance and access, or where users will be operating across all layers, this could still be suitable
Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 2 (C) - not recommended

There’s also an additional decision point in terms of the approach to implementing the above for managing your bronze or raw data layer:

  • A) Duplicate data - where multiple domains use the same data, simply duplicate the pipelines and data in each domain. I think this is mostly a case of ownership or stewardship and your preferred operating model given the cost of storage is relatively low
  • B) Land data in a single bronze layer only, then use shortcuts - though the aim would be to utilise shortcuts to pull in data where possible, here I am specifically talking about agreeing that where raw data is used in multiple domains, you could land it in one bronze layer (say domain 1) then shortcut access to that in subsequent domains (domain 2, 3, etc.) to avoid duplication
  • C) Use Cloud object storage for Bronze rather than Fabric Workspaces - this is an interesting one, and I think it only really applies if you plan to go with the core option 2 where you’re looking to have a centralised bronze layer. In this case you could do it in a Fabric workspace, or you could have a bronze layer cloud object store (ADLS gen2, s3, Google Cloud Storage, etc.). I think the only potential reason to consider this is to manage granular permissions for the data store outside of Fabric. In real terms, I would rule this out completely and instead consider (B)
Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 3

With the above in mind, you see how there are a number of approaches or options where the potential number of workspaces could be as large as the number of domains (d) multiplied by both the number of layers (3) and number of environments (e), and as small as the number of domains multiplied by the number of environments (d x 9 or d x 3 for dev, test, and prod environment structure). In the number of workspaces for each image, note that each would need to be multiplied by the number of environments (usually at least 3).

It’s also worth noting, the above list isn’t exhaustive. There are other options such as considering a monolithic (one workspace for all layers) approach for some federated teams, but segregated workspaces for centrally managed data, or using monolithic for medallion layers as in option 3 (think, platform or data workspace) and a separate workspace for reporting. This is all about targeting a starting point.

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Options Overview (1A, 1B top, 2A, 2B middle, 3A bottom)

Sample Scenario

You might start to see why straightforward examples where an individual is setting up their Fabric workspace(s) covering use of data required for a single domain or team with limited risk of duplication and clear requirements around access controls results in an obvious structure. However, what does this mean when we begin to scale this across multiple domains in an organisation and a larger number of Fabric users?

For the purpose of considering any options, I’m going to make some assumptions for a sample scenario. We’re going to consider a fictional organisation that is rolling out Fabric across 3 domains initially; Sales, Marketing, and HR. Sales and Marketing will utilise at least one of the same source datasets, and there are no clear security or access controls between layers, but administration and governance must be segregated by domain. The organisation must keep separation between prod and non-prod workloads, and there will be dev, test, and prod environments. 

In the sample scenario, there are a number of options, but we could recommend, for example:

  • For each domain, utilise a workspace per environment (dev/test/prod)
  • For each domain, utilise a single workspace consolidating medallion layers (bronze / raw to gold / ready for analytical use)

While this doesn’t seem unreasonable, I will admit that it’s not particularly realistic. In most cases, I would expect here to be a preference to have better governance and control around access to raw (bronze) data. In that case, while much of the above is accurate, I would expect the last point to be fundamentally different. Rather than moving to the other end of the spectrum creating individual workspaces for all layers, domains, and environments, a real-world example I worked through previously that was reasonably similar to the description above went with a single bronze layer landing zone (option 2B).

Recommendations

  • There isn’t a single “right” answer here, so discussing the trade offs will only get you so far. I would suggest picking your best or clearest use case(s), reviewing your high level requirements, and build to a proposed or agreed approach then look to figure out if certain issues need addressed 
  • In general, I would recommend using shortcuts to minimise data duplication both across source storage and Fabric and across Fabric workspaces (described in option B). I really think it’s the best way to operate
  • Start by testing the assumption that it makes sense to use one workspace covering all layers of the medallion (option 3). While I think this will only make sense in practice with some adjustment (e.g. splitting data and reporting), and Microsoft recommend a workspace per layer, this is the biggest influencing factor in terms of affecting administration at scale
  • If you need to segregate medallion layers into individual workspace, I would propose starting with option 1 (B)
  • Where possible, utilise different capacities for each environment, or at least for production environments to make sure production workloads aren’t affected by dev or test jobs. Though not specifically a recommendation related to workspaces, this has come up in any scoping I’ve been part of with Fabric capacity planning to date - in the sample scenario, that would mean two separate capacities, one for dev and test environments and one for prod to separate workloads, compromising between flexibility and management overhead. This would result in the number of workspaces being 3 times the number of domains rather than 9 times. This may change if there is more flexible workload management in future Fabric updates, and smoothing will help those running a single capacity, but it’s not currently possible to isolate workloads without multiple capacities
  • There are also a couple of item level recommendations I would consider:
    • Semantic models should primarily be implemented in the Gold layer - these are really to facilitate end use of modelled data. Adding semantic models in all layers could just add to the complexity of your Fabric estate
    • It’s likely that the design pattern of utilising Lakehouses only or Warehouses only (only meaning from bronze to gold) will be common. However, it’s worth considering the different permutations against your needs. In my experience, a good starting point is using lakehouses for bronze and silver, and warehouses in gold (see here for some key differences)
  • If you use a warehouse in the gold layer, look to utilise tables where possible and avoid views. Views will disable Directlake (or, rather, cause fallback behaviour) resulting in poorer performance

What Next?

I appreciate this has been focused on creating the starting point and I just wanted to add som personal opinions for how I’ve seen this work effectively. First of all, I think the decision around environment setup is more has a bigger effect than the trade off between a higher or lower number of workspaces. What I mean by that is that without having environments separated by a workspace boundary, utilising deployment pipelines and version control are either difficult or impossible, so it’s crucial to have different workspaces per environment.

Next, I believe that the key driver for creating workspaces per medallion layer is around data governance and access controls. For me, the most logical way to balance that with the administration overhead is to use option 3 the monolithic approach, and add additional “reporting” workspaces for each to allow governance and access control management between accessing source data and consuming reports without having a massive number of workspaces to manage.

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option…4?
]]>