<![CDATA[blog.AliStoops.com]]>https://blog.alistoops.com/https://blog.alistoops.com/favicon.pngblog.AliStoops.comhttps://blog.alistoops.com/Ghost 5.82Tue, 19 Nov 2024 12:20:59 GMT60<![CDATA[Minimising Spark Startup Duration in Microsoft Fabric]]>Context

Often with cloud services, consumption equals cost. Microsoft fabric isn’t much different, though there is some nuance with the billing model in that, in some cases, increasing consumption by 20-30% could double the cost due to the need to move to a bigger and in other cases

]]>
https://blog.alistoops.com/minimising-spark-startup-duration-in-microsoft-fabric/673bd96f073d2a00011cd52bTue, 19 Nov 2024 00:52:29 GMTContextMinimising Spark Startup Duration in Microsoft Fabric

Often with cloud services, consumption equals cost. Microsoft fabric isn’t much different, though there is some nuance with the billing model in that, in some cases, increasing consumption by 20-30% could double the cost due to the need to move to a bigger and in other cases you might have the overhead and see no cost increase. I‘m keen not to get into the depths of SKUs and CU consumption here, but, at the most basic level for spark notebooks, time / duration has a direct correlation with cost and it makes sense generally to look for opportunities to minimise CU consumption.

In terms of where this becomes relevant in relation to spark startup times in Fabric, it’s worth noting that this duration counts as CU(s) consumption for scheduled notebooks, and also increases the duration of each pipeline. 

I’ll start by sharing a couple of screenshots with session start times where high concurrency sessions and starter pools (details below) aren’t used. After running each half a dozen times, the start times were almost always over 2 minutes and up to 7 minutes with an average of around 3 minutes. 

Environments

Minimising Spark Startup Duration in Microsoft Fabric
Custom pool with small node size

Before jumping in to a couple of recommendations and examples, I also wanted to comment briefly on Fabric environments. Environments can be used to consolidate and standardise hardware and software settings. The Microsoft documentation has more information on this. Up until running a series of testing for this blog I had mainly used environments for deploying custom python packages, but you’ll see a custom environment in some screenshots below (for the small node size) where I adapted environment settings to quickly enable changes in spark compute resources and apply them consistently across sessions, without changing the workspace default, for testing the high concurrency sessions with specific compute resources.

Minimising Spark Startup Duration in Microsoft Fabric
Custom pool with medium node size

Basic Testing

Having run sessions without utilising high concurrency or starter pools for a range of environments, the results are outlined below;

  • Small node size, memory optimised, 1-10 nodes - 2 minutes 42 seconds
  • Medium node size - this one was interesting. If you create a custom pool with similar settings to the default starter pool, startup can be around 10 seconds, but minor adjustments to the pool, namely adjusting number of drivers or executors, or memory from 56GB to 28GB, saw this jump to 7 minutes 7 seconds
  • Large node size, memory optimised, 1-6 nodes - 2 minutes 17 seconds
Minimising Spark Startup Duration in Microsoft Fabric
Small node size (demo environment details in first image above)
Minimising Spark Startup Duration in Microsoft Fabric
Medium node size, custom environment settings can be seen in the environment section
Minimising Spark Startup Duration in Microsoft Fabric
Large node size

High Concurrency 

Minimising Spark Startup Duration in Microsoft Fabric
Connecting to high concurrency sessions

High concurrency mode in Fabric enables users to share spark sessions for up to 5 concurrent sessions. Though there are some considerations, namely around the requirement for utilising the same spark compute properties, the Microsoft documentation suggests a 36 times faster session start for custom pools. In my experience, the actual start time was even quicker than suggested, almost instantaneous, compared to around 3 minutes, and across 3 tests this ranged from 55 times faster to almost 90. That said, it’s also worth noting that first high concurrency session start was often slightly longer than starting a standard session where it was more like 3 minutes than 2.5.

Minimising Spark Startup Duration in Microsoft Fabric
Startup of the first high concurrency session

In all node size variations, the startup times for further high concurrency sessions was either 2 or 3 seconds. The images below were taken for the demo environment outlined above (small node size).

Minimising Spark Startup Duration in Microsoft Fabric
Startup for attaching the second high concurrency session

Starter Pools

Minimising Spark Startup Duration in Microsoft Fabric

Fabric Starter Pools are always-on spark clusters that are ready to use with almost no startup time. You can still configure starter pools for autoscaling and dynamic allocation, but node family and size are locked to medium and memory optimised. In my experience, startup time was anywhere from 3 to 8 seconds.

Minimising Spark Startup Duration in Microsoft Fabric
Startup time utilising starter pools as the workspace default

Closing Thoughts

In short, where you’re comfortable with existing configurations and consumption, or no custom pools are required, look to utilise starter pools. Where custom pools are required due to tailoring requirements around node size or family, and multiple developers are working in parallel, aim to use high concurrency sessions.

]]>
<![CDATA[Power BI Pricing Update]]>Context

Yesterday, November 12th, Microsoft announced changes to the Power BI licensing that represents a 20-40% (license dependent) increase per user license. It’s worth mentioning that this is the first increase since July 2015, more than 9 years ago. From April 1st 2025, pro licensing will increase from

]]>
https://blog.alistoops.com/power-bi-pricing-update/6733ec20073d2a00011cd516Wed, 13 Nov 2024 00:03:10 GMTContextPower BI Pricing Update

Yesterday, November 12th, Microsoft announced changes to the Power BI licensing that represents a 20-40% (license dependent) increase per user license. It’s worth mentioning that this is the first increase since July 2015, more than 9 years ago. From April 1st 2025, pro licensing will increase from $10 to $14 per user per month and premium per user licensing from $20 to $24 per user per month.

More details are covered in the Microsoft blog post.

What’s not affected

Though this is naturally going to affect a large number of users, I think it’s most likely to impact small and medium sized corporates and those not currently or planning to use Fabric. This is because some elements will remain unaffected as the licensing changes are specific to the per user licensing not under enterprise agreements. Not changing, is:

  • Fabric F SKU pricing
  • Embedded pricing (under EM and F SKUs)
  • E5 licensing - this still includes a power BI pro license with no increase in cost 
  • PPU add on licensing for E5
  • Non-profit licensing (currently priced lower than enterprise or personal licenses)

Scenarios & suggestions 

  • Excluded / unaffected: If you’re currently E5 licensed, licensed through a non-profit, or utilising included in pricing viewer licensing through an F64 Fabric capacity, there’s nothing to worry about
  • Utilising Fabric, but below F64: the new licensing adjusts the tipping point for where the jump to F64 makes sense for the viewer licensing being included in capacity cost. With reserved pricing, this was previously around 250 users if you are utilising an F32 capacity with reserved pricing and pro licensed users, now it’s more like 180. The crossover point is the difference in cost from your current SKU to F64 divided by the per user per month license cost (e.g. 268 for F16)
  • Utilising fabric but not embedding: more often than not, Power BI users are licensed for accessing reports via the Power BI service. However, with Fabric F SKUs, you can make the most of embedding for your organisation and organisational apps to facilitate consumption of reports without needing to access Power BI service and therefore reducing the potential licensing requirements for viewers
  • Utilising Power BI but not yet Fabric: Well, both of the above two points are still worth considering. In fact, I think the lower SKUs (F2 and F4) could be paid for if you’re able to utilise embedding for your org instead of Power BI licenses for report viewers for as few as 11 (F2, reserved pricing) users. This could be a great reason to consider Fabric if you’re not already

As for everyone else, unfortunately there isn’t much of an option beyond preparing for the increased cost sooner than later and communicating it to decision makers. That said, I would hazard a guess that the cost to transition organisational reporting would likely outweigh any benefit, and given the previous history I would hope this is not likely to happen again for some time.

]]>
<![CDATA[Microsoft Fabric Data Engineer Associate (DP-700) Beta - My Experience & Tips]]>TLDR

DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric is a challenging exam that covers most aspects of data engineering on Microsoft Fabric. Personally, I consider it a tougher exam than DP-600 and I believe it could do with some rebalancing around examining more spark and pipeline / orchestration topics, but

]]>
https://blog.alistoops.com/microsoft-fabric-data-engineer-associate-dp-700-beta-my-experience-tips/67281b0b073d2a00011cd4a7Mon, 04 Nov 2024 01:07:21 GMTTLDRMicrosoft Fabric Data Engineer Associate (DP-700) Beta - My Experience & Tips

DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric is a challenging exam that covers most aspects of data engineering on Microsoft Fabric. Personally, I consider it a tougher exam than DP-600 and I believe it could do with some rebalancing around examining more spark and pipeline / orchestration topics, but all topics felt relevant and there wasn’t too much variation in question complexity with one or two exceptions. 

That said, it’s likely existing Fabric data engineers are more familiar with some experiences than others, so there is probably some learning that’s needed for most people - especially those with limited real time intelligence (or KQL) and error resolution experience.

I expect some of the balancing to be addressed for the more left-field questions when the exam goes live, based on feedback from those undertaking the beta. Even though DP-700 has a large overlap with high level topics to DP-600, I think the exam considers them from different angles or contexts and should have good alignment with those looking to prove their data engineering skills with Fabric.

What is a beta exam?

Exams in beta are assessed for performance and accuracy with feedback on questions being gathered. The beta process is slightly different with AWS exams (described here) than Azure - for this beta exam there isn’t a guaranteed discount (usually 50% with AWS), the beta period duration is not clearlyy defined ahead of time (end date isn’t currently published), and the results aren’t released 5 days after taking the exam but around 10 days after the exam exists its beta period so I don’t yet have my results. Microsoft publish some information on beta exams here

What is the Fabric DE certification and who is it aimed at? 

Aside from the obvious role alignment in the certification title, the Microsoft documentation describes the expected candidate for this exam as someone with “subject matter expertise with data loading patterns, data architectures, and orchestration processes” as well as the below measured skills:

  • Implement and manage an analytics solution (30–35%)
  • Ingest and transform data (30–35%)
  • Monitor and optimize an analytics solution (30–35%)

One thing I would call out around the ingesting and transforming data is that the exam covers all mechanisms of doing so - notebooks, stored procedures, data flows, KQL Querysets - and utilising Spark, SQL, and KQL.

Exam prep 

At the time of the beta exam being available, there wasn’t a formal Microsoft Learn training path. Beyond a few blogs from some Microsoft Data Platform MVPs, the only real collateral that exists is a collection of Learn modules. For those who undertook the DP-600 exam before November 15th 2024 (see some blogs about the updates here and here), this collection is mostly similar to the DP-600 learn modules. Additions include “Secure a Microsoft Fabric data warehouse”,  three Real Time Intelligence modules (getting started with RTI, using Real Time Eventstreams, and querying data in a KQL database) as well as “Getting started with Data Activator in Microsoft Fabric.” Beyond this, the only real preparation I could suggest is a reasonable amount of hands on development experience across data engineering workloads and development experiences. Though it’s also worth saying that the suggestions from my DP-600 blog still apply.

My experience and recommendations

Most of the reason I wanted to publish this was to cover exam topics, but before doing so there are three key things worth calling out:

  • Get used to navigating MSLearn - you can open an MSLearn window during the exam. It’s not quite “open book” and it’s definitely trickier to navigate MSLearn using only the search bar rather than search engine optimised results, but effectively navigating MSLearn means not always needing to remember the finest intricate details. It is time consuming, so I aimed to use it sparingly and only when I knew where I could find the answer quickly during the exam. I also forgot that this was possible so missed using it for the first 25 questions
  • Though it’s somewhat related to time sent navigating MSLearn as above, I did run quite tight on time and only had about 7 minutes remaining when I finished the exam so use your time wisely
  • Case studies were front-loaded and not timed separately. It’s a delicate balance to be struck, but you can’t go back to the case studies so want to make sure you spend enough effort on them but it’s worth being careful to not waste time for the remaining questions. For reference, I spent about 20 minutes on the 2 case studies

As for the exam topics:

  • I observed a number of questions focused on error resolution across a range of examples such as access controls (linked to the item below), SQL queries, scheduled refreshes, and more
  • Be confident around permissions and access control. Though they’re accessible via MSLearn, I’d recommend memorising the workspace roles (admin, contributor, member, admin), but also consider more complex requirements such as row level security, data warehouse object security and dynamic data masking (including evaluating predicate pushdown logic)
  • Though it could be covered above, I would also suggest having some experience in testing scenarios related to workspace governance and permissions such as configuring various read and write requirements across multiple workspaces via both roles and granular access requirements. I think some questions extend this beyond a simple security question and more into a question of architecture or design
  • A broad understanding of engineering optimisation techniques is helpful, but I would recommend having a deeper understanding and hands on experience of read optimisation and table maintenance including V-Order, OPTIMIZE, and VACUUM
  • Deployment processes - pipelines, triggering orchestration based on schedules and processes, and understanding cross workspace deployment, but also azure DevOps integration
  • Experience selection - at face value, the notebook vs data flow vs KQL and Lakehouse vs warehouse vs event stream seem straightforward but as always the detail is crucial. Be aware of scenarios outlining more specific requirements like a requirement to read data with SQL, KQL, and spark but only write with spark or choosing between methods of orchestration such as job definitions vs. pipelines 
  • Intermediate SQL, PySpark, and KQL Expertise is required - I noted intermediate SQL and beginner PySpark being important for DP-600. Here, both are still true and perhaps more intermediate PySpark is required, but KQL experience is needed too, and I had quite a few code completion exercises across SQL, spark, and KQL with a mix of simple queries, intermediate functions like grouping, sorting, and aggregating, and more advanced queries including complex joins, windowing functions and creation of primary keys. I also had one question that was around evaluating functionality of a few dozens of lines of PySpark code with multiple variables and try/except loops - I felt the complexity of questions was much higher than in DP-600, but it’s hard to know whether this will remain post-beta
  • I’ve already mentioned a few times above, but Real Time Intelligence was scattered across a number of questions. Alongside understanding various real time assets and KQL logic, a number of scenarios followed a similar pattern around sourcing data from event hub, outputting to lakehouses and implementing filters or aggregations, sometimes with a focus on optimisation techniques for KQL 
  • Understand mechanisms for ingesting external data sources. Though seemingly obvious for a data engineering exam, a couple of things that I would suggest being confident around are PySpark library (or environment) management and shortcuts, including shortcut caching
  • Capacity monitoring and administration including cross-workspace monitoring, checking running sql query status, and interacting with the Fabric capacity monitoring app
]]>
<![CDATA[AWS Certified AI Practitioner Beta - My Experience & Tips]]>TLDR

Though an interesting learning experience, and something I wouldn’t actively discourage those interested from, the beta AWS certified AI practitioner felt a little bit caught in the middle and isn’t something I’d recommend for most audiences. I think it examined things like model

]]>
https://blog.alistoops.com/aws-certified-ai-practitioner-beta-my-experience-tips/672425bb073d2a00011cd490Fri, 01 Nov 2024 01:03:46 GMTTLDRAWS Certified AI Practitioner Beta - My Experience & Tips

Though an interesting learning experience, and something I wouldn’t actively discourage those interested from, the beta AWS certified AI practitioner felt a little bit caught in the middle and isn’t something I’d recommend for most audiences. I think it examined things like model selection and broader AWS services including IAM and networking to be more in depth than necessary for most business or sales people, and not in depth enough to be practically useful for practitioners.

That said, I feel there is some value for those potentially interested in layering their knowledge and certifications and plan to undertake the next level of AI/ML AWS certification but haven’t taken a cloud or AWS exam before. For most other people, there is likely a better starting point.

What is a beta exam?

I covered this under my associate data engineer exam blog, and nothing really changed since posting that. Two things worth mentioning are that AWS are currently offering a free retake before February 2025 if you fail the exam and that they are offering an additional “early adopter” badge for those that achieve the certification during the beta phase or first 6 months after. An additional point that isn’t strictly beta related, but something I noted, was that the certification expires after 3 years as with all other AWS certifications, but where certified cloud practitioner is renewed by passing any other exam, I’m not clear on (and doubt) the same thing occurs for the AI practitioner cert.

AWS Certified AI Practitioner Beta - My Experience & Tips

What is the AI practitioner certification and who is it aimed at? 

From the AWS website

“The ideal candidate for this exam is familiar with AI/ML technologies on AWS and uses, but does not necessarily build AI/ML solutions on AWS. Professionals in roles such as sales, marketing, and product management will be better positioned to succeed in their careers by building their skills through training and validating knowledge through certifications like AWS Certified AI Practitioner.”

And I think this is where I will call out my main gripe with the exam - I’m not sure the content is completely aligned to the recommended candidate in that some of the topics I will call out below are areas I wouldn’t expect someone in sales or marketing and perhaps not product management to really get value from - model selection, s3 policies, VPCs, and Bedrock internet restrictions. I would at least acknowledge that the guidance on this being a certification not aimed for those building AI/ML solutions seems right, and that there is a good amount of content from the certification that is more aligned to the proposed audience such as understanding the differences between fine tuning and prompt engineering, so perhaps some of what I saw in the beta won’t actually be included in the generally available exam.

Exam prep 

At the time I undertook the exam, there wasn’t much collateral beyond the exam guide and the SkillBuilder plan had cost involved. Given this was a practitioner level certification and I was already familiar with both AWS technologies and AI concepts and technologies, I decided to take the exam with no preparation.

Since then, Stephane Maarek has released his Udemy course. This looks like it has very good coverage of the topics examined and, though I can’t speak to its quality, I have had positive experiences with other courses Stephane has released and I would expect this to be a good resource.

My experience and recommendations

One thing I would call out is that the exam felt almost entirely generative AI focused and there wasn’t much coverage of things like natural language processing, classification, predictive techniques, or AI concepts outside large language model applications. 

  • Understand the AI deployment process (build, train, deploy) and AWS tools you can use aligned to these. I did also notice some specific questions around EC2 types for training so I would be aware of HPC and Accelerated computing types (especially Inf)
  • Though it’s an AI certification, understanding some core AWS concepts is absolutely required. I would recommend, based on my exam experience, having more than just a conceptual understanding of IaM and S3 policies, VPCs, and security groups
  • Understand the core AWS AI tools and technologies - I think the most common areas are related to having a reasonable understanding of Bedrock and Sagemaker (as well as when to use one over the other), but other things also included integration with broader services such as Connect, and knowing the use cases for Macie vs Comprehend vs Textract vs Lex
  • Commit the Sagemaker features to memory - it’s less than 20 features by 6 categories, but I found that this came up in quite a few questions
  • Prompt engineering is likely to come up more than once, so if you’re unfamiliar with AI prior to looking at this certification I would suggest reading up on prompt engineering concepts and approaches. I think this AWS site is a good place to start, but the questions are often phrased around determining the most effective approach so this is an area where comparing and contrasting methods is important
  • Understand both fine tuning and pre-training of LLMs and the difference between the two i.e. when it’s preferable to conduct each
  • I mentioned above that observed some specific, and more than surface level, AI questions around model selection and adjustment processes. I would suggest being familiar with
    • BERT - and relevant applications such as filling blanks in sentences
    • EPOCHS and adjusting them in line with observed over and under fitting
    • Temperature - common word choices, consistency, etc. (context here, and AWS guidance here)
]]>
<![CDATA[Microsoft Fabric Sample Architecture Diagram]]>I was recently preparing a presentation for an introduction to Microsoft Fabric during which I wanted to briefly talk about where Fabric fits in a typical hub and spoke Azure Landing zone as well as showing the end-to-end processing of data in Fabric for downstream consumption. Admittedly, some high level

]]>
https://blog.alistoops.com/microsoft-fabric-sample-architecture-diagram/66f5254f34c9070001acb7bfSat, 12 Oct 2024 14:38:07 GMT

I was recently preparing a presentation for an introduction to Microsoft Fabric during which I wanted to briefly talk about where Fabric fits in a typical hub and spoke Azure Landing zone as well as showing the end-to-end processing of data in Fabric for downstream consumption. Admittedly, some high level diagrams exist for the latter, but I wanted to present this slightly differently as well as showing processing of multiple data types - I always find it helpful to consider these visually.

Fabric in Azure Landing Zones

First, a few core concepts (Microsoft information). Microsoft Fabric is enabled at the tenant level, and must be attached to a subscription, resource group, and region. In the below example, I’ve aligned the Fabric capacity to a “data and ai” subscription in a landing zone spoke. It’s worth mentioning highlighting that it’s entirely possible for the Fabric box (green dotted line) to exist without any additional complexity, but I’ve included some examples like private endpoints to highlight that Fabric can be configured to meet more complex security or networking needs - see the network security documentation for more information. Lastly, alongside using the Microsoft backbone network for integrating with any Azure resources, Fabric enables native access to any data the user has access to at a tenant level, so you could imagine the dotted green box being extended to other subscriptions.

Microsoft Fabric Sample Architecture Diagram

Fabric-specific architecture

A simple overview of the medallion architecture is available via Microsoft (image below). I also stumbled across an automotive test fleets example Microsoft shared, but could only see the base image that I wanted to extend and have something editable for going forward.

Microsoft Fabric Sample Architecture Diagram

So, alongside my visual updates, the below visual combines these two examples to provide an example of implementing multiple types of ETL processes and consumption.

Microsoft Fabric Sample Architecture Diagram

Finally, the real value I wanted to share here is the Visio file (from this GitHub repo) for the above Fabric diagram so you can use the icons or adapt to your needs, as this was the most time consuming part of my preparation as I appreciate that the concepts are already covered through the various content openly available but only with specific images that aren’t always the best for visual representation. I’ve stored this on GitHub, but please also share any suggestions or feedback and I’m happy to produce other examples. Note, the base landing zone diagram was authored by a colleague before I added the data & ai subscription so I’m not comfortable sharing that openly just yet.

]]>
<![CDATA[My Takeaways from FabCon Europe 2024 - 1 Week Later]]>Intro

From September 24th - 27th I attended the 2024 European Fabric Community Conference (or FabCon) in Stockholm. The first day of this was spent attending hands on tutorials run by the Microsoft product team, with the remaining three days kicking off with a keynote followed by at least 4

]]>
https://blog.alistoops.com/my-takeaways-from-the/6705b7ed34c9070001acb7cdTue, 08 Oct 2024 22:58:24 GMTIntroMy Takeaways from FabCon Europe 2024 - 1 Week Later

From September 24th - 27th I attended the 2024 European Fabric Community Conference (or FabCon) in Stockholm. The first day of this was spent attending hands on tutorials run by the Microsoft product team, with the remaining three days kicking off with a keynote followed by at least 4 slots to attend one of a number of breakout sessions. Alongside standard breaks for networking and food, there was also a community booth, Fabric developer challenge, Ask the Experts stand, and a number of partner booths. I’d planned to share some of my experiences and takeaways, but I wanted to take a beat and reflect once things settled. There are a couple of points below with some overlap to my daily LinkedIn reflections, but I’ve tried to minimise this or add extra detail where relevant.

Community takeaways

First of all, a note on the community aspect of the conference. Prior to attending, I wasn’t sure what exactly the branding of a “community conference” would mean and I must admit that it did feel a little bit different than the traditional tech or data conference. It felt as though there was a dedicated effort to engage the community from the “Fast with Fabric” developer challenge being focused on understanding how people are actually using the tool, to constantly looking for product feedback that would feed future developments, it did feel like Microsoft wanted community engagement. The community booth was constantly busy throughout the entire conference, too.

  • The Fabric community is massive and growing rapidly. From the 3,300 attendees (across 61 countries) and massive Microsoft representation, 14,000+ Fabric customers, the active forum users and user groups, and more, there’s so much going on in the Fabric space. One particular note here was that Fabric is on the same trajectory that PowerBI was at the same point in its lifecycle - given the almost 400,000 PowerBI customers today, there is obviously a large targeted growth
  • In terms of user growth, there were two interesting things I noted; Fabric has the fastest growing certification Microsoft offer (Analytics Engineer certification, DP-600) with 17,000+ certified people, and from interacting with a number of attendees, especially during the tutorials, so many users are from non-technical or data backgrounds until they needed to fix a specific problem with data (often using PowerBI) - a refreshing change from my background and from seeing so many consider data-first rather than business problems
  • It can be incredibly valuable spending time engaging with the community, especially in person. It’s hard to imagine anyone left the event without being just a little more energised than beforehand. From resolving specific technical queries and validating design decisions and best practice to just understanding what other people are working on, and how, there was a lot to gain from talking with the Microsoft staff and MVPs at the “Ask the Experts” booths, those leading sessions, and other attendees
  • I’d extend the above to suggest that while you might not have access to so many product experts in person on a daily basis, the community forum and sub reddit are great places to engage online. I had the pleasure of meeting some of the active Reddit members (picture below, sourced from this post)
My Takeaways from FabCon Europe 2024 - 1 Week Later

Product and broader theme takeaways

  • Effective data governance is crucial - alongside lots of discussion on how Fabric can meet the governance needs in the era of AI, there was a lot of detailed coverage around both Fabric’s built in governance features and utilising Fabric alongside Purview for extended data governance and security. I also noticed quite a number of partners in the governance and MDM space in the exhibition space including Profisee, Informatica, Semarchy, Cluedin, and more. On my daily LinkedIn updates, I called out one important quote; “Without proper governance and security, analytics initiatives are at risk of producing unreliable or compromised results" 
  • Fabric is intended to be an AI-embedded experience for developers and business users - it’s easy to say this means going all-in on AI, especially in today’s market, but I thought it was interesting that all this was discussed from all angles. From a generic focus on copilot driven development and consumption, and generative AI solutions, to integrating OneLake data to custom built AI applications and the ability to call AI functions (e.g. classify, translate) directly via notebooks with functions. This included covering key aspects like generative AI not always being the right solution and getting ready for and appropriately governing data to support AI, all backed by great demos
  • Power hour was a thoroughly enjoyable experience, and one I highly recommend checking out if you’re not familiar with them. The energy on stage, and lighthearted, enjoyable nature of the various demos was a master class in storytelling during your presentation and how to have fun with data. It reiterated something that I think most passionate data practitioners are conscious of; the value of your data is often determined by the narrative you can drive by effectively using it, and how you can explain it in simple or understandable terms to the business
  • There was a real focus on ease of use and how Microsoft are trying to minimise the barrier to entry. This included extension of Copilot features (in PowerBI, and building custom Copilots on Fabric data), the inclusion of PowerBI semantic model authoring via desktop, changes to the UI, ability to embed Real Time Dashboards, and features around ease of implementation / migration including integrating Azure Data Factory pipelines supported directly in Fabric, sessions around migrating from Azure Synapse, and upcoming (in 2024) support for migrating existing SSIS jobs
  • There were lots of great sessions and technical deep dives where architecture examples were presented e.g. connecting to Dataverse, implementing CI/CD, production lakehouse and warehouse solutions as well as conversation with other attendees about other data technologies (Azure Data Factory, Databricks, and others). This was all a firm reminder of Werner Vogel’s Frugal Architect (Law 3) that architecting is a series of tradeoffs. Don’t waste time chasing perfection, but invest in resources aligned to business needs
  • While ultimately the ”proof is in the pudding” as far as listening to customer feedback goes, it felt clear that Microsoft want to factor in user feedback to how they develop Fabric - there was a Fast at Fabric challenge that was entirely aimed at gathering user feedback, Microsoft product leads were engaging with attendees to understand key sticking points, and I even had a conversation with Arun Ulag, Corporate VP of Azure Data, where the first thing he wanted to know was how I am using Fabric and how it could be improved. It was also good to see the deep dive into data warehouse performance explain the trajectory of warehouse developments and acknowledge why there were some shortcomings at launch tied to the significant effort to move to delta (format) under the hood
My Takeaways from FabCon Europe 2024 - 1 Week Later
Me with A Guy in a Cube’s Adam Saxton and Patrick Leblanc

My favourite feature announcements

Frankly, there were too many announcements to capture or list individually, though Arun’s blog and the September update blog cover most things I can recall. I still wanted to call out the announcements I saw as either the most impressive or that have previously come up in conversation as potential blockers to adoption (looking at you, git integration!)

  • Incremental refresh for Gen2 Dataflows - those working in data engineering will be more than familiar with implementing incremental refresh, but this brings the ability via low-code through Gen2 Dataflows, which is great for those who had it as a core requirement, but also would reduce the consumption and cost of existing pipelines that are conducting full refreshes
  • Copy jobs - think of copy jobs as a prebuilt packaged copy pipeline including incremental refresh capability. Put simply, copy jobs are the quickest and easiest way to rapidly automate data ingestion
  • Tabular Model Definition Language (TMDL) for semantic models - coming soon is both the ability to create or script out existing semantic models using code, enabling versioning but also consistent best practice (e.g. reusable measures). Alongside this, an additional TMDL view will be added to PowerBI 
  • GIT integration - though Git integration has existed for some time, it’s always needed some kind of hacks to be properly functional. During FabCon, it was announced that all core items will be covered for git integration by the end of the year - the standout here is inclusion of Dataflow Gen2 items
  • The new Fabric Runtime 1.3 was released. This was quoted as achieving up to four times faster performance compared to traditional Spark based on the TPC-DS 1TB benchmark, and is available at no additional cost

Best practice takeaways

  • Focused effort on optimising your semantic model is important - though Direct Lake can add value and performance, a bad (or not optimised) model outside fabric will be a bad model in fabric. Also, don’t use the default semantic model, but create a dedicated semantic model 
  • V-Order is primarily intended for read operations - think carefully about how you use it. Best practice advised was to utilise V-Order for “gold” layer data, feeding semantic models. Use Delta analyser to examine specific details
  • Using DirectLake is as important in Fabric as query folding is to PowerBI - DirectLake can massively improve read performance - an example referenced reading 2 billion rows taking 13 seconds in a P1 capacity via direct query and 234ms for 1 billion rows via DirectLake. While the record count isn’t identical, it was designed like this because a 2 billion record read would force direct query fallback
  • A metadata driven approach to building pipelines is best practice, but it’s not easy to tackle all at once, so start small and gradually expand across the organisation 
  • There’s a lot to tweak around spark optimisation (more via another blog), but one key area of discussion is around spark session startup times. Two callouts on this were to utilise starter pools and high concurrency mode. High concurrency mode enables multiple sessions to share spark sessions and can reduce startup to as little as 5 seconds

Lastly, a quick shoutout for a couple of food recommendations. If you’re visiting Stockholm, do try and check out Berns (Asian fusion and Sushi) and Mr. Churros!

My Takeaways from FabCon Europe 2024 - 1 Week Later
]]>
<![CDATA[My Hopes for Feature Announcements at FabCon ‘24]]>What is FabCon 24?

Europe’s first Microsoft Fabric community conference, labelled FabCon, is kicking off on September 24th, following the first global Fabric conference and preceding the 2025 event planned in Las Vegas next March/April (2025). As detailed on the conference website, a number of Microsoft’

]]>
https://blog.alistoops.com/my-hopes-for-the-microsoft-fabric/66f1b37634c9070001acb77eMon, 23 Sep 2024 19:04:12 GMTWhat is FabCon 24?My Hopes for Feature Announcements at FabCon ‘24

Europe’s first Microsoft Fabric community conference, labelled FabCon, is kicking off on September 24th, following the first global Fabric conference and preceding the 2025 event planned in Las Vegas next March/April (2025). As detailed on the conference website, a number of Microsoft’s Data & AI experts, including some of the Fabric product leadership will be in attendance.

Alongside a number of interesting speaker sessions, there is a planned session focusing on the Microsoft Fabric roadmap. On some online community forums, there has been talk around some announcements being “held back” over the last 1-2 months with the intention of providing some announcements during the conference this week.

My hopes for announcements

I thought it would be helpful to shortlist the top N announcements I’d like to see, but given the number of ongoing development activities and items covered both in the roadmap and Fabric ideas, I thought it helpful to break it down to the top 3 hopes related to known roadmap developments, top 3 unrelated to known roadmap developments, and top 3 other or more out there hoped-for announcements.

  • Version Control Updates - improvement in item support, especially for Dataflow Gen2, and ideally the ability to see the code that sits behind Dataflow Gen2 items to validate and version control
  • Data Exfiltration Protection - though it’s unlikely to be a killer feature for all customers, data exfiltration protection for notebooks would be important for the more security conscious users and not covered by OneSecurity (planned) or Purview
  • T-SQL improvements (e.g. MERGE, CTEs) - I think the current T-SQL functionality meets most core needs, but does provide some challenge in specific needs not being met (e.g. MERGE). I also see some ongoing questions around the pattern of looking to move on-premise data to Azure SQL before feeding in to Fabric vs. Ingesting to Fabric immediately, and given where Azure SQL functionality is, there’s a compelling argument to move there first
  • While I tried to stick with top 3, I have to add that the incremental load support in dataflows is another roadmap item that I would love to see more about this week

Unrelated to roadmap hopes

  • Workload Management (queues, prioritisation, adjustable burst behaviours) - I think there are lots of options as to how this is implemented, but ultimately I believe there would be a lot of value to being able to prioritise certain jobs or workspaces, and it would encourage use of larger capacities rather than splitting all non-production and production workloads at the capacity level only, which is currently the only way to facilitate workload segregation
  • Low-code UPSERT for data pipelines and dataflows - similar to the utility around implementing incremental load support above, being able to UPSERT via data pipelines would be a great quality of life improvement, especially considering it’s already possible in dataverse dataflows
  • Parameters / environment variable support (for notebooks and between experiences) - I have seen a number of examples around metadata driven pipelines already, and made use of functionality to utilise parameters for dataflows, but I think that being able to do this within notebooks and across experiences (between data pipelines and dataflows) as well as for deployment pipelines would be great

Others

  • Preview Features going GA - Fabric is maturing rapidly, and also adding lots of new features regularly, but it would be great to see a few bits of information around things becoming generally available. First of all, for longer-standing features, It would be good to see some features move from preview to GA. I’d also like to understand what the cadence or path from preview to GA looks like for upcoming features, as well as get more visibility of when things are actually happening and what the impact is. We have the roadmap for new features, but it’s more difficult to get a clear view on how this impacts existing features or environments
  • Native mirroring for on-prem SQL Server - currently, mirroring on-prem data sources somewhat relies on setup via an AzureSQL instance. In theory, this doesn’t provide much of a blocker, but in practice, native integration to onprem SQL Servers for mirroring would provide a much better user experience
  • Asset Tagging - a few months ago, Microsoft added Folder support to help organise Workspace assets, which do their job, but the addition of tags could be more user-friendly by extending the current “certified” and “favourite” options in OneLake and hopefully provide additional options for security and governance (e.g. attaching RBAC to tags)
]]>
<![CDATA[Common Microsoft Fabric Cost Misconceptions]]>Fabric has came under its share of scrutiny since going generally available in November 2023, and much of it was or is still worth consideration. Specifically, concerns around the maturity of version control or CI/CD (in preview), some observed delays in SQL Analytics Endpoint synchronisation are often reference, and

]]>
https://blog.alistoops.com/microsoft-fabric/66df682c34c9070001acb751Mon, 09 Sep 2024 21:47:07 GMT

Fabric has came under its share of scrutiny since going generally available in November 2023, and much of it was or is still worth consideration. Specifically, concerns around the maturity of version control or CI/CD (in preview), some observed delays in SQL Analytics Endpoint synchronisation are often reference, and the pricing model being subscription or capacity-based rather than purely consumption-based are probably the points I see most commonly referenced.

Though these are all fair, and hopefully being addressed, it's also with addressing some common misconceptions, in this case associated to capacity features, cost, and performance:

  • The minimum cost for utilising PowerBI embedded is “high”: at launch, embedded was limited to F64 capacities, but that’s no longer the case and embedding is possible with any F SKU.  I’ve labelled this misconception as “high” cost as it tends to be one of two figures quoted - either the F64 capacity cost, or the embedded pricing. Ultimately, this means the actual minimum cost could be as low as a couple of hundred USD per month compared to anywhere from 800 USD (EM1) or 8,000 USD (F64) 
  • F64 is required for fully-featured fabric: This is partially true in that all features are available on F64 and above capacities, but previously this included smaller capacities not having features like trusted capacities or managed private endpoints, but all F SKUs are mostly at feature parity. The only feature addition at F64 and above is Copilot
  • All Fabric experiences cost the same for like-to-like operations: William Crayger shared the most vivid example of this that I can remember, which describes seeing a more than 95% reduction in consumption units running the same process for ingesting data with a spark notebook compared to a low-code pipeline. I haven’t seen quite so dramatic results in my experience, but I have observed, for specific activities, up to a 75% reduced cost. That is to say, not all experiences will result in the same consumption
  • Capacity performance is better for larger capacities: some small differences can be observed due to smoothing & bursting but, as outlined here by Reitse, even comparing the smallest F2 capacity to F64, performance is largely the same
  • Fabric is "more expensive" than expected, or than PowerBI at scale: as with everything, this depends on many factors but, in terms of perception, it's worth bearing in wind that the F64 capacity cost is equivalent cost (when reserved) to a PowerBI Premium capacity (at $5k p/m)
  • The F64 SKU cost, including free PowerBI access for report consumers, is best fit for 800 or more users: This comes from the fact that otherwise you would be paying for around 800 (x 10 USD) PowerBI licenses for report viewers.  However, it doesn’t factor in capacity reservation (~40%) savings nor any licensing covered through enterprise E5 licenses. In real terms, the driving factor for capacity selection needs to be predominantly data requirements, but considering purely licensing costs for report viewers, the crossover point for cost will vary. It could be higher (if E5 licenses are considered) or lower (if the capacity is reserved, more like 500)

Now, this isn’t to say that additional considerations such as portability, version control, or workload management and prioritisation aren’t worthwhile considerations, but I think it’s good to see barriers to entry being removed for new users and a mostly consistent experience for all capacities.

]]>
<![CDATA[Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations]]>Intro & Context

I’ll start by saying that I won’t be discussing the merits of the medallion architecture nor the discussion around building a literal medallion vs. Structuring against the requirements for your domain, and using medallion as a conceptual talking point, in any

]]>
https://blog.alistoops.com/microsoft-fabric-workspace-structure-and-medallion-architecture/6696aa2934c9070001acb746Wed, 17 Jul 2024 13:37:28 GMTIntro & ContextMicrosoft Fabric Workspace Structure and Medallion Architecture Recommendations

I’ll start by saying that I won’t be discussing the merits of the medallion architecture nor the discussion around building a literal medallion vs. Structuring against the requirements for your domain, and using medallion as a conceptual talking point, in any great detail as part of this blog (but I think this is a good video talking through the latter). We will just assume you’re following the guidance Microsoft publish describing the usual bronze, silver, and gold zones or layers. 

I’ve seen a number of conversations online, namely on Fabric community blogs, Reddit, and Stackoverflow that primarily focus on the question of whether implementing the medallion architecture means we should have one workspace per domain (e.g. sales, HR data) that covers all layers or a workspace per layer (bronze, silver, gold) per domain. While these threads often end up with a single answer, I don’t think there is a single right answer, and I also think there is some nuance missed in this conversation. So what I’m going to cover here is some guidance around what the implications are around Fabric workspace structure as well as recommendations for a starting point. It’s also worth noting that this focuses primarily on lakehouse implementation.

Key Design Implications

Before sharing any recommendations, I want to get the “it depends” out of the way early. The “right” answer will always depend on the context of the environment and domain(s) in which you’re operating.

This is just a starting point and a way to break down some key decision areas. It’s worth noting there are lots of good resources and examples of people sharing online around their experiences, but the reason this blog exists is because I have seen these more often than not represent the most straightforward examples (single source systems or 1-2 user profiles) where following Microsoft’s demos learn materials is enough. As far as considering the implications of your workspace structure in Microsoft Fabric, I would suggest key areas include:

  • Administration & Governance (who’s responsible for what, user personas)
  • Security and Access Control (data sensitivity requirements, principle of least privilege)
  • Data Quality checks, consistency, and lineage
  • Capacity (SKU) features and requirements (F64-only features, isolating workloads)
  • Users and skillsets
  • Version control & elevation processes (naming conventions, keeping layers in sync)

Potential High Level Options

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 1 (B)
  1. 1 workspace per layer per domain. This is recommended by Microsoft (see “Deployment Model”) as it provides more control and governance. However, it does mean the number of workspaces can grow very quickly, and operational management (i.e. who owns the pipelines, data contracts, lineage from the source systems) needs to be carefully considered
  2. 1 landing zone or bronze workspace then 1 workspace per domain for silver and gold layers. This could slightly reduce the number of workspaces by simply landing raw data in one place centrally but still maintaining separation in governance / management
  3. One workspace per domain covering all layers. Though this is against the suggestion in Microsoft documentation, it is the most straightforward option, and for simple use cases where there are no specific constraints around governance and access, or where users will be operating across all layers, this could still be suitable
Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 2 (C) - not recommended

There’s also an additional decision point in terms of the approach to implementing the above for managing your bronze or raw data layer:

  • A) Duplicate data - where multiple domains use the same data, simply duplicate the pipelines and data in each domain. I think this is mostly a case of ownership or stewardship and your preferred operating model given the cost of storage is relatively low
  • B) Land data in a single bronze layer only, then use shortcuts - though the aim would be to utilise shortcuts to pull in data where possible, here I am specifically talking about agreeing that where raw data is used in multiple domains, you could land it in one bronze layer (say domain 1) then shortcut access to that in subsequent domains (domain 2, 3, etc.) to avoid duplication
  • C) Use Cloud object storage for Bronze rather than Fabric Workspaces - this is an interesting one, and I think it only really applies if you plan to go with the core option 2 where you’re looking to have a centralised bronze layer. In this case you could do it in a Fabric workspace, or you could have a bronze layer cloud object store (ADLS gen2, s3, Google Cloud Storage, etc.). I think the only potential reason to consider this is to manage granular permissions for the data store outside of Fabric. In real terms, I would rule this out completely and instead consider (B)
Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 3

With the above in mind, you see how there are a number of approaches or options where the potential number of workspaces could be as large as the number of domains (d) multiplied by both the number of layers (3) and number of environments (e), and as small as the number of domains multiplied by the number of environments (d x 9 or d x 3 for dev, test, and prod environment structure). In the number of workspaces for each image, note that each would need to be multiplied by the number of environments (usually at least 3).

It’s also worth noting, the above list isn’t exhaustive. There are other options such as considering a monolithic (one workspace for all layers) approach for some federated teams, but segregated workspaces for centrally managed data, or using monolithic for medallion layers as in option 3 (think, platform or data workspace) and a separate workspace for reporting. This is all about targeting a starting point.

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Options Overview (1A, 1B top, 2A, 2B middle, 3A bottom)

Sample Scenario

You might start to see why straightforward examples where an individual is setting up their Fabric workspace(s) covering use of data required for a single domain or team with limited risk of duplication and clear requirements around access controls results in an obvious structure. However, what does this mean when we begin to scale this across multiple domains in an organisation and a larger number of Fabric users?

For the purpose of considering any options, I’m going to make some assumptions for a sample scenario. We’re going to consider a fictional organisation that is rolling out Fabric across 3 domains initially; Sales, Marketing, and HR. Sales and Marketing will utilise at least one of the same source datasets, and there are no clear security or access controls between layers, but administration and governance must be segregated by domain. The organisation must keep separation between prod and non-prod workloads, and there will be dev, test, and prod environments. 

In the sample scenario, there are a number of options, but we could recommend, for example:

  • For each domain, utilise a workspace per environment (dev/test/prod)
  • For each domain, utilise a single workspace consolidating medallion layers (bronze / raw to gold / ready for analytical use)

While this doesn’t seem unreasonable, I will admit that it’s not particularly realistic. In most cases, I would expect here to be a preference to have better governance and control around access to raw (bronze) data. In that case, while much of the above is accurate, I would expect the last point to be fundamentally different. Rather than moving to the other end of the spectrum creating individual workspaces for all layers, domains, and environments, a real-world example I worked through previously that was reasonably similar to the description above went with a single bronze layer landing zone (option 2B).

Recommendations

  • There isn’t a single “right” answer here, so discussing the trade offs will only get you so far. I would suggest picking your best or clearest use case(s), reviewing your high level requirements, and build to a proposed or agreed approach then look to figure out if certain issues need addressed 
  • In general, I would recommend using shortcuts to minimise data duplication both across source storage and Fabric and across Fabric workspaces (described in option B). I really think it’s the best way to operate
  • Start by testing the assumption that it makes sense to use one workspace covering all layers of the medallion (option 3). While I think this will only make sense in practice with some adjustment (e.g. splitting data and reporting), and Microsoft recommend a workspace per layer, this is the biggest influencing factor in terms of affecting administration at scale
  • If you need to segregate medallion layers into individual workspace, I would propose starting with option 1 (B)
  • Where possible, utilise different capacities for each environment, or at least for production environments to make sure production workloads aren’t affected by dev or test jobs. Though not specifically a recommendation related to workspaces, this has come up in any scoping I’ve been part of with Fabric capacity planning to date - in the sample scenario, that would mean two separate capacities, one for dev and test environments and one for prod to separate workloads, compromising between flexibility and management overhead. This would result in the number of workspaces being 3 times the number of domains rather than 9 times. This may change if there is more flexible workload management in future Fabric updates, and smoothing will help those running a single capacity, but it’s not currently possible to isolate workloads without multiple capacities
  • There are also a couple of item level recommendations I would consider:
    • Semantic models should primarily be implemented in the Gold layer - these are really to facilitate end use of modelled data. Adding semantic models in all layers could just add to the complexity of your Fabric estate
    • It’s likely that the design pattern of utilising Lakehouses only or Warehouses only (only meaning from bronze to gold) will be common. However, it’s worth considering the different permutations against your needs. In my experience, a good starting point is using lakehouses for bronze and silver, and warehouses in gold (see here for some key differences)
  • If you use a warehouse in the gold layer, look to utilise tables where possible and avoid views. Views will disable Directlake (or, rather, cause fallback behaviour) resulting in poorer performance

What Next?

I appreciate this has been focused on creating the starting point and I just wanted to add som personal opinions for how I’ve seen this work effectively. First of all, I think the decision around environment setup is more has a bigger effect than the trade off between a higher or lower number of workspaces. What I mean by that is that without having environments separated by a workspace boundary, utilising deployment pipelines and version control are either difficult or impossible, so it’s crucial to have different workspaces per environment.

Next, I believe that the key driver for creating workspaces per medallion layer is around data governance and access controls. For me, the most logical way to balance that with the administration overhead is to use option 3 the monolithic approach, and add additional “reporting” workspaces for each to allow governance and access control management between accessing source data and consuming reports without having a massive number of workspaces to manage.

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option…4?
]]>
<![CDATA[Microsoft Fabric Roundup - Build 2024]]>Summary

The intent from Microsoft during Build 2024 was quite clear where Fabric was concerned; show the significant and ongoing progress since the General Availability in late 2023, land the message around Fabric’s ease of access, and focus on the SaaS nature of the product by demonstrating a

]]>
https://blog.alistoops.com/microsoft-fabric-roundup-build-2024/665724103f0d860001248b1aWed, 29 May 2024 13:06:08 GMTSummaryMicrosoft Fabric Roundup - Build 2024

The intent from Microsoft during Build 2024 was quite clear where Fabric was concerned; show the significant and ongoing progress since the General Availability in late 2023, land the message around Fabric’s ease of access, and focus on the SaaS nature of the product by demonstrating a “single pane of glass” view of all your data and analytical workloads. To that effect, I think there was a lot covered, and so much potential or opportunity, but there is no doubt that the devil is in the details and which can only be properly understood through hands on development.

Intro

Microsoft Build 2024 is ran from May 21-23 in Seattle and Online, and there was a significant focus on all-things data and AI, as I mentioned in last week’s blog about the keynote session. Since all the sessions I was tracking have been made available on-demand, and there were so many Fabric updates or announcements, I wanted to share the relevant talks, and highlight a few key updates as well as relevant opinions.

Here are the various Fabric-specific sessions from MS Build that inform the rest of this blog post:

Market & Roadmap Update

Microsoft shared that they have 11,000 customers now using Fabric and over 350,000 customers using PowerBI including 97% of Fortune500 organisations. Though seeing an update on the growth of the customer base is interesting, the thing I would call out here is that I think Microsoft would be targeting at least the same target audience for Fabric as those that are currently using PowerBI. In fact, they made a direct comparison with the customer growth over the time since launch between the two products.

An update to the Fabric roadmap was announced. There’s far too much to dissect here, but it’s worth having a browse as there were a few interesting items such as data exfiltration protection.

Key Announcements

I’ll note that this is heavily subjective, but I’ve highlighted what I see to be the “bigger” announcements either based on potential impact or Microsoft’s messaging 

  1. Snowflake Interoperability and Apache Iceberg compatibility. Though much of the focus to date has been on delta tables in Fabric, Microsoft announced Iceberg compatibility during Build which is great in terms of extending to an additional open table format, and increasing flexibility, but the main thing to call out here is interoperability with Snowflake. Microsoft talked about and demonstrated the seamless two-way integration with Snowflake. This seems awesome, and alongside other announcements such as mirroring, will massively reduce the overhead in managing multiple data sources within Fabric. However, a word of warning is to consider the implications in terms of cost depending on storage and compute as this would likely be split across two products / services
  2. Fabric real-time intelligence. Aside from the nuts and bolts of this announcement, including integration with sources like Kinesis and Fabric Workspace Events in real time, it’s worth calling out that real-time event data will be a Fabric use case that is of interest to a wide range of users and something that would potentially have ruled Fabric out as a primary analytics platform before now. Microsoft had a representative from McLaren racing on stage to discuss how they were using real-time intelligence to ingest sensor data that, as an F1 fan, was really interesting. I’m sure I will have a lot of questions as I investigate this in more detail, but an obvious one is how cost compares to batch workloads and performance comparison to other options in this space
  3. AI Skills (see around halfway down this page). Released in preview, this is essentially a Custom Copilot type capability for business users that utilised Generative AI to convert natural language to SQL, enabling rapid Q&A of Fabric data sources and straightforward prompt engineering. I think an important consideration here is where this sits in terms of your overall user experience or use case in the sense that I think more regularly asked questions would be surfaced through PowerBI reports. Perhaps this would have a similar place to Athena in AWS for ad-hoc queries where more regular questions are answered via Quicksight. Nonetheless, this seems interesting in terms of time to value and Fabric GenAI integration
  4. Fabric Workload Development for ISVs. This was one I didn’t see coming at all. Microsoft announced the availability (in preview) of the workload development SDK and shared a demo that walked through how developers can create offerings via the Azure Marketplace and allow Fabric customers to utilise their solutions without leaving their Fabric environment. Some workloads mentioned include Informatica, neo4j, SAS, Quantexa, Profisee MDM, and Teradata. One really interesting thing here, is that Microsoft have essentially provided opened their React front-end so that workloads can have a “Fabric look and feel” for customers. I’m looking forward to getting hands-on with the SDK, and seeing how customers can accurately estimate pricing with ISV Workloads.
Microsoft Fabric Roundup - Build 2024
Fabric Workload Development Architecture. Source: Microsoft

Other Announcements 

It’s difficult to have complete coverage across all the news shared, but I still thought it to be worthwhile to call out that there was so much covered through these sessions so I’ve listed below a number of additional things that piqued my interest:

  • External data sharing (cross tenant)
  • Shortcuts to on premises sources for OneLake (search here for “lake-centric) available in public preview (relies on data in open format so also brought in mirroring of external databases) - mirroring is free for storage replicas and minimises duplication, but we need to be careful around managing cost as it would, presumably, mean egress and compute costs from the source provider
  • new real time intelligence Fabric MSLearn module 
  • Azure Databricks and Fabric integration - soon you will be able to access Azure Databricks Unity Catalog tables directly in Fabric
  • Anomaly detection preview 
  • Fabric User Data Functions & functions hub - think about Azure functions but tightly integrated with Fabric. One example discussed was adding forecast projections to warehouse data, which is a common use case in PowerBI that would otherwise require some ETL-specific coding
  • VSCode Microsoft Fabric Extension enabling workspace integration, local debugging, and user data functions
  • Copilot for Data Warehouse - noting this produces the query but doesn’t run it so allows human in the loop adjustment and reduces unnecessary CU consumption. This also includes IDE-style (like GitHub copilot) autocomplete
  • Fabric API for GraphQL
Microsoft Fabric Roundup - Build 2024
Source: Microsoft

Of course these will only be relevant for those reading shortly after publishing, but I wanted to share links and MS Forms in case people are looking for sign up links to public previews or a Microsoft Fabric trial after the event. 

]]>
<![CDATA[Microsoft Build 2024 - Keynote Key Takeaways]]>Microsoft Build 2024 is running from May 21-23 in Seattle and Online. Over the weekend, I will look to distil something more detailed around the Fabric and/or AI sessions from Build, but following a Keynote session that lasted over 2 hours, I thought I’d share my top

]]>
https://blog.alistoops.com/microsoft-build-2024-keynote-key-takeaways/664f93723f0d860001248b01Thu, 23 May 2024 19:25:51 GMT

Microsoft Build 2024 is running from May 21-23 in Seattle and Online. Over the weekend, I will look to distil something more detailed around the Fabric and/or AI sessions from Build, but following a Keynote session that lasted over 2 hours, I thought I’d share my top 5 takeaways below. For those interested, the full Day 1 is available here, and Microsoft also posted a 90 second recap.

Key Takeaways

  • Copilots everywhere! Unsurprisingly, Copilot was front and centre for a significant portion of the keynote and there’s a lot to unpack. Outside the announcements, Satya also equated the Copilot runtime to being a fundamental shift similar to what Win32 did for GUIs, and shared that nearly 60% Fortune 500 companies using copilot as well as calling out around half a dozen organisations with over 10k seats. I’ve included links to the announcements below, but I’m particularly interested to use Team Copilot
  • GPT-4o is Generally Available in Azure OpenAI (demo). LLMs and SLMs (namely GPT-4o and Phi-3) were peppered throughout the keynote, including a number of product launches and case studies (link). Recurring themes seemed to be around optimisation and efficiency in either cost or performance, and multi-modality including image, text, and speech. It’s also worth calling out a moment where Open AI CTO Sam Altman made a point of referencing AI as an enabler. I’ll take any opportunity to reiterate the importance of focusing on providing value through solving business problems (customer / user first), not using specific technology “just because”
  • Khan Academy announced a partnership with Microsoft focused on utilising AI to support educators, and Sal Khan highlighted that teaching will be an area that will see real change through the use of technology, something that I was interested to see during the keynote having recently been involved in a STEM Learning roundtable with many industry and education leaders on exactly this topic. A big part of the Khan Academy presentation was around the intention to make Khanmigo available to all teachers in the US for free 
  • Fabric Real Time Intelligence is in preview- I’m excited to see more detail on this, but integrating with sources like Kinesis, Blob Storage Events, Kafka, CDC Events from Cosmos DB, and Fabric Workspace Events in real time will be critical to a number of prospective and existing Fabric customers, and opens up a number of new use cases
  • Continued investment in AI infrastructure through AMD MI300X Instinct accelerators and Cobalt VMs

I would also add that it was fantastic to see Kevin Scott (Microsoft CTO) on stage, he was a wonderful addition, and his personal anecdotes around the use of technology and AI having the power to enable real change in fields of medicine and education were poignant reminders of why I love working in this space.

As someone who’s played video games my whole life, there was one added bonus - Copilot might finally help me understand the world of Minecraft!

Microsoft Build 2024 - Keynote Key Takeaways
]]>
<![CDATA[MS Fabric Copilot - Recommendations and Pricing Considerations]]>TLDR

Ultimately, Fabric Copilot looks to be a really simple way to integrate the use of OpenAI services into a developers workflow, with one single method of billing and usage monitoring through your Fabric SKU. There are some assumptions that will need to be made and tested against when it

]]>
https://blog.alistoops.com/ms-fabric-copilot-recommendations-and-pricing-considerations/663666c43f0d860001248af1Sat, 04 May 2024 16:50:18 GMTTLDRMS Fabric Copilot - Recommendations and Pricing Considerations

Ultimately, Fabric Copilot looks to be a really simple way to integrate the use of OpenAI services into a developers workflow, with one single method of billing and usage monitoring through your Fabric SKU. There are some assumptions that will need to be made and tested against when it comes to accurately baselining cost, and certainly some considerations around highest value use cases, but the cost model is appealing and I consider Fabric Copilot worth using or at least trialling / assessing for your use cases, with appropriate considerations.

As with Github Copilot, Amazon CodeWhisperer, and other tools in this space, I think the focus should be on accelerating development and shifting focus of skilled developers to more complex or higher value tasks.

Context

In March, Ruixin Xu shared a community blog detailing a Fabric Copilot pricing example, building on the February announcement of Fabric Copilot pricing. It’s exceptionally useful, and details how Fabric Copilot works, the maths behind calculating consumption through an end-to-end example. I’m aiming to minimise any regurgitation here, but I’m keen to add my view on key considerations for rolling out the use of Fabric copilot. For the purpose of this blog, I will focus on cost and value considerations, with any views on technical application and accuracy to be compiled in a follow up blog.

MS Fabric Copilot - Recommendations and Pricing Considerations
Source: https://blog.fabric.microsoft.com/en-us/blog/fabric-copilot-pricing-an-end-to-end-example-2/

Enabling Fabric Copilot

First, it’s worth sharing some “mechanical” considerations:

  • Copilot in Fabric is limited to F64 or higher capacities. These start at $11.52/hour or $8,409.60/month, excluding storage cost in OneLake and pay-as-you-go
  • Copilot needs to be enabled at the tenant level
  • AI services in Fabric are in preview currently
  • If your tenant or capacity is outside the US or France, Copilot is disabled by default. From MS docs - “Azure OpenAI Service is powered by large language models that are currently only deployed to US datacenters (East US, East US2, South Central US, and West US) and EU datacenter (France Central). If your data is outside the US or EU, the feature is disabled by default unless your tenant admin enables Data sent to Azure OpenAI can be processed outside your capacity's geographic region, compliance boundary, or national cloud instance tenant setting”
MS Fabric Copilot - Recommendations and Pricing Considerations
Source: microsoft.com (https://learn.microsoft.com/en-us/fabric/get-started/copilot-fabric-consumption)

Recommendations

  1. My first recommendation is to “give it a go” yourself. Based on how Fabric Copilot pricing works at the time of writing, I can see clear value in developers utilising Fabric Copilot - even for a simple example that optimistically might only take 30 minutes of development, a $0.63 cost feels pretty hard to beat. It’s hard to imagine Fabric Copilot costing more than a developer’s time.
  2. Consider who actually needs or benefits from access to Copilot. From what I’ve seen so far, the primary use case is around accelerating development for analysts and engineers, so those consuming reports might see much less value in comparison. Personally, I would also recommend any outputs are appropriately tested and reviewed by someone with the capability of building or developing without Copilot (for now, at least).
  3. Unsurprisingly, it feels as though Copilot could really accelerate analytics and engineering development, but I think it’s crucial that organisations considering adoption consider rolling out the use of copilot in stages, starting with small user groups. This is for two reasons; both to support the next recommendation on my list, but also as it will help in managing and monitoring resources at a smaller scale before considering impact on your capacity.
  4. I think it will be important to build out your internal best practice guidance / knowledge base. For example, in Ruixin’s blog, the CU consumption for creating a dataflow to keep European customers was about 50% more than what was required for creating a PowerBI page around customer distribution by geography. In my opinion, the benefit in terms of time saved is larger in the PowerBI use case than a simple dataflow column filter. In this example, I would also suggest that it’s best practice to use Copilot to generate a starting point for PowerBI reports, rather than something ready for publication / consumption. As is the case in many applications of Generative AI, there’s likely additional value in standardising Fabric Copilot inputs in terms of more consistent costs, so developing a prompt library could be useful.
  5. You need to plan for expected CU consumption based on users and capacity SKU. Admittedly this seems obvious and is likely only a potential issue at a scale where multiple people are utilising Fabric Copilot all at once against the same capacity. For context, although organisations with more than 20 developers may be on a larger capacity than F64, a team of 20 developers submitting Fabric Copilot queries of a similar cost to those detailed in the blog (copied below) would be more than 100% (20*12,666.8 = 253,336) of available F64 CU capacity (64 * 3600 = 230,400) in a given hour. Admittedly, this needs considered over a 24-hour period, and it’s unlikely that parallel requests will be submitted outside standard working hours, but it should be evaluated in alongside other processes such as pipelines and data refreshes.
  6. Though I believe much of the Microsoft guidance around data modelling for using copilot is generally considered good practice for PowerBI modelling, I would recommend assessing your data model and adapting in line with the Microsoft guidance in order to maximise effectiveness of utilising Fabric (or PowerBI) Copilot.
MS Fabric Copilot - Recommendations and Pricing Considerations
Source: https://learn.microsoft.com/en-us/power-bi/create-reports/copilot-create-report-service
]]>
<![CDATA[Microsoft Fabric Analytics Engineer Associate (DP-600) Certification - My Experience & Tips]]>TLDR

The Fabric Analytics Engineer Associate certification is intended to cover almost all generally available functionality of Fabric at the time of writing. That said, it’s not surprising that it covers lots of modules / collateral that exists in other data engineer and data analyst learning, and I expect

]]>
https://blog.alistoops.com/microsoft-fabric-analytics-engineer-associate-dp-600-certification-my-experience-tips/6632087f3f0d860001248ad0Wed, 01 May 2024 09:18:29 GMTTLDRMicrosoft Fabric Analytics Engineer Associate (DP-600) Certification - My Experience & Tips

The Fabric Analytics Engineer Associate certification is intended to cover almost all generally available functionality of Fabric at the time of writing. That said, it’s not surprising that it covers lots of modules / collateral that exists in other data engineer and data analyst learning, and I expect the material in the exam will change as the product matures and additional functionality is added - this will also be interesting to review over time. It’s currently quite PowerBI focused, and I don’t expect that to change a great deal.

While it’s easy to naturally question where an “Analytics Engineer” sits in terms of role alignment given that data analyst and data engineer roles are more common in the industry than analytics engineers, I do feel that it covers enough topics across both areas to live up to the name. Any potential candidates will need to have in depth knowledge of SQL, Spark, and DAX, as well as developing Fabric analytics workloads and managing or administrating Fabric capacities. Some topics are covered with a purely Fabric lens (e.g. Version control, deployment best practices, and orchestration), which is totally reasonable, but it’s worth any potential candidates considering building considering undertaking broader foundational analytics engineering training that aren’t covered in the core DP-600 learning.  

I think that any data practitioners interested in, or needing to prepare for, working with Fabric would benefit from undertaking the DP-600 learning and exam.

What is the Analytics Engineer Certification and Who is it Aimed at

Microsoft Fabric Analytics Engineer Associate (DP-600) Certification - My Experience & Tips
Fabric Capabilities and Components

Before getting in to details it’s worth mentioning that the Fabric Analytics Engineer Associate Certification follows a similar structure to other Azure certifications in that the name of the certification and name of the exam are different, but ultimately undertaking the DP-600 exam grants the certification. As such, I will use the two terms synonymously.

The exam details do a good job of describing the experience needed, and skills measured.

Experience Needed:

  • Data modeling
  • Data transformation
  • Git-based source control
  • Exploratory analytics
  • Languages, including Structured Query Language (SQL), Data Analysis Expressions (DAX), and PySpark

Skills measured:

  • Plan, implement, and manage a solution for data analytics (10–15%)
  • Prepare and serve data (40–45%)
  • Implement and manage semantic models (20–25%)
  • Explore and analyze data (20–25%)

To anyone already operating in the domain, much of the skills and experience assessed won’t be surprising. I think much of the material will feel familiar to those working as data practitioners in Azure, especially PowerBI experts, but it’s relevant to consider that most candidates will feel stronger or more closely aligned to either the analytics or engineering topics. The interesting thing here is that I don’t think it’s immediately obvious as to whether a more analytics focused background or engineering focused background would be beneficial, but at first glance, I would suggest this reads as slightly more focused on analytics. I will add my views and explain specifics in later recommendations but, ultimately, I feel as though the exam is aimed at people who want to have a rounded view of managing Fabric capacities and building and deploying Fabric workloads, whatever their title may be.

Exam Preparation

For context, I took the DP-600 in early April 2024, and given it was only in beta in early 2024 there aren’t many high quality courses from reputable trainers. I used Microsoft Learn’s self-paced learning (available here) and practice assessment. 

I’ve since seen a number of Reddit posts (r/MicrosoftFabric) and Youtube videos (from users such as Guy in a Cube) that look promising, and I’m sure popular trainers in the Azure space such as John Cavill and James Lee will have DP-600 training content in future. 

There are a couple of additions worth pointing out here. Though I was confident in my SQL, PySpark, and PowerBI skills, I would recommend that if you’re new to the domain or feel room for improvement in one of those areas that it would be worthwhile considering resources such as Udemy or Coursera SQL courses, Cloudera or Databricks PySpark training, MSLearnPowerBI training, or your favourite training provider to close any gaps. It’s worth noting that the exam expects a level of SQL and PySpark knowledge beyond just what is covered in the MSLearn labs.

My Experience & Recommendations

Before sharing focus areas for learning, and key topics from the exam, there are a few mechanical pieces to call out:

  • Do the practice assessment Microsoft offer (on the Dp-600 page linked above) - none of the questions here came up not he exam, but I felt like they were exactly the right level of difficulty and with a good speed of topic areas to assess readiness prior to booking the exam. I did experience the webpage crashing, and not remembering my progress (I’m assuming it’s a randomised question set / structure) so I would suggest doing this in one sitting
  • Do all the labs / hands on demos - it’s very easy to want to move past these as much of the text will be a repeat of the theoretical material covered during the learning modules, but there really is no substitute for hands on experience. I also would suggest that the only topics I struggle with in the practice assessment were ones that I had not done the lab for
  • Get used to navigating MSLearn - you can open an MSLearn window during the exam. It’s not quite “open book” and it’s definitely trickier to navigate MSLearn using only the search bar rather than search engine optimised results, but effectively navigating MSLearn means not always needing to remember the finest intricate details. That said, it is time consuming, so I aimed to use it sparingly and only when I knew where I could find the answer quickly during the exam. Note that this won’t help for most PySpark or SQL syntax / code questions
  • Though it’s not a pre-requisite, I would firmly recommend undertaking the PL-300(PowerBI Data Analyst Associate) exam before DP-600. So much of the DP-600 learning is based on understanding DAX, PowerBI semantic modelling and best practices, and PowerBI external tools (Tabular Editor, DAX Studio, ALM). Those who have undertaken PL-300 will have a much easier time with the DP-600 exam.

As for topic areas covered during the exam:

  • Intermediate to advanced SQL knowledge - most SQL questions rely on knowledge that isn’t covered via MSLearn beyond a few examples. These include understanding SQL joins, where, group by and having clauses, CTEs, partitioning (row number, rank, dense rank), LEAD and LAG, and date/time functions
  • PowerBI - in addition to the above pieces noted in relation to PL-300, I would also call out creating measures, using VAR and SWITCH, Loading data to PowerBI, using time intelligence functions, using data profiling tools in PowerBI, and using non-active relationships (functions such as USERELATIONSHIP). Managing datasets using XMLA endpoints and general model security are also key topic areas
  • Beginner to intermediate PySpark (as well as Jupyter Notebooks and Delta Lake Tables) - in all honesty, I think the more hands on PySpark experience you have, the better. But, I was only specifically asked about thinks like the correct syntax for reading / writing data to tables, profiling dataframes, filtering and summarising dataframes, and utilising visualisation libraries (matplotlib, plotly)
  • Fabric Shortcuts - I think this was relatively clear in the learning for basic examples, but it’s worth understanding how this would work in more complex scenarios such as multi-workspace or cross-account examples (same applies to Warehouses), and how shortcuts relate to the underlying data (e.g. what happens when a shortcut is deleted)
  • Core data engineering concepts such as Change Data Capture (CDC), batch vs. Real-time / streaming data, query optimisation, data flows and orchestration, indexes, data security [Row Level Security (RLS) and Role-Based Access Control (RBAC)]
  • Storage optimisation techniques such as Optimize and Vacuum 
  • Fabric licenses and capabilities (see here)
  • Source control best practices for Fabric and PowerBI
  • Knowledge of carrying out Fabric admin tasks - for example, understanding where capacity increased occur (Azure portal), where XMLA endpoints are disabled (Tenant settings), enabling high concurrency (Workspace settings)

What next?

I have no certifications planned for now, but I plan to build some thing around MS Purview, similar to what I did with AWS DataZone. I will plan over the next few months what the next certification step is, but given the recent additions to AI-102 (AI Engineer Associate) related to OpenAI, I will likely start there.


]]>
<![CDATA[AWS Data Engineer Associate and Data Analytics Specialty Certifications Compared]]>Intro and Context

Let me start by saying that I usually wouldn’t compare two AWS certifications beyond considering common services (e.g. how much of my exam prep for X will have helped for certification Y) as I think AWS do a good in logically separating the certifications

]]>
https://blog.alistoops.com/aws-data-engineer-associate-and-data-analytics-specialty-certifications-compared/6595edaa3c1634000146e0f0Mon, 29 Jan 2024 22:39:36 GMTIntro and ContextAWS Data Engineer Associate and Data Analytics Specialty Certifications Compared

Let me start by saying that I usually wouldn’t compare two AWS certifications beyond considering common services (e.g. how much of my exam prep for X will have helped for certification Y) as I think AWS do a good in logically separating the certifications as long as you understand the differences in practitioner, associate, professional, and specialty levels. I also typically take each learning experience independently and try to cover all material even if it’s part of a previous certification.

However, alongside the announcement of the new Certified Data Engineer Associate (DEA-C01) certification in November 2023, AWS announced they will be retiring AWS Certified Data Analytics - Specialty (DAS-C01) in April 2024, stating “This retirement is part of a broader effort to focus on certifications that align more closely to key cloud job roles, such as data engineer.”

Having undertaken the Certified Data Engineer Associate beta exam in December, I thought it would be worth sharing my views on comparing the two certifications.

Exam Guide Comparison

Reading through the exam guide for DEA-C01, it felt familiar, and I don’t think this is just due to domain experience. The reason it felt familiar to the Data Analytics Specialty (DAS-C01) content was evident when looking at the list of services, but even at a higher level of certification domain you can see there would be a large degree of overlap.

Domain Number

Data Analytics Specialty

Data Engineering Associate

1

Collection

Data Ingestion and transformation

2

Storage and Data Management

Data Store Management

3

Processing

Data Operations and Support

4

Analytics and Visualisation

Data Security and Governance

5

Security


In terms of services in the exam guides, the notable differences are the additions of AppFlow & Managed Workflow for Apache Airflow (MWAA), Cloud Financial Management (Budgets, Cost Explorer), AWS Batch & Serverless Application Model, Amazon Keyspaces (for Apache Cassandra), and AWS Developer tools (Cloud9, CDK, Code*) for the associate data engineer certification.

The overlap is unsurprising given the link between data analytics and engineering, as well as the fact that the data analytics specialty focuses on a lot more than just analytics. That said, the real reason I wanted to call this out is that the data analytics specialty certification recommends 5 years domain experience (compared to 2-3 in DE and 1-2 in AWS for data engineer associate).

It’s worth mentioning that there is a distinct difference between associate level and specialty level certifications. As mentioned above, that starts with suggested experience, which I think this Adrian Cantrill page covers quite well, but also in the exam format. The data engineer exam guide indicates 65 questions (50 scored) and although the exam duration isn’t specified, it’s safe to assume it will be 130 minutes as is the case for the other associate exams, compared to the same question structure but a 180 minute duration. As you might expect, that typically means lower complexity questions for the associate level exam.

Exam Experience

I appreciate this is only useful to anyone who has previously undertaken DAS-C01, or already prepared for it, and is considering DEA-C01 in future. If that’s not you, just skip this section.

Though the maximum durations above indicate less time needed for the data engineer associate certification, I don’t think it’s completely representative of the jump in difficulty. I found time not to be tight for associate level, finishing each with a good amount of time to spare, but only finished the data analytics specialty exam with a few minutes to share. Difficulty aside, I found the exam experience a lot more taxing and needing a lot of mental stamina

In general, the certified data engineer associate questions are centred on less complex solutions (i.e. fewer interacting components or requirements to assess)

In line with the FAQs around why AWS are retiring the data analytics specialty certification, it felt like questions around overarching solution design were much less common in the data engineer associate exam and replaced with more developer type questions (e.g. what would you do as a DE, how would you configure something or code you would write)

Data engineer associate questions follow a more typical format (e.g. least operational overhead, lowest cost). Though these terms also appear in the specialty certification, the ties from question to correct answer aren’t always as clear in the sense that things are less obvious (e.g. could me multiple correct answers, but a specific requirement for queue ordering, throughput, or access controls is slightly different)

There were specific topics with different levels of focus. Compared to the data analytics specialty, my beta exam for certified data engineer associate had

  • A similar level of focus on data types, formats, and storage
  • Less focus on DB scaling and deployment mechanisms
  • Slightly less detail on Streaming solutions - still a key topic, but less depth required. Specifically, nothing on MSK or KCL, and very little on error handling
  • Similar areas covered in security and governance such as row level security, IAM, cross account access, but less around encryption
  • Much less (very little) on EMR and open-source spark applications
  • More focus on Glue and Lambda
  • Additional questions covering AppFlow, MWAA, AWS Batch, Cloud9, and Cost Explorer
  • Similar coverage on database services
  • Much more focus on handling PII
  • Some specific questions around SQL, Regex, and hashing

This list is not exhaustive, just the areas I found more notable.

My Thought on The Change

In all honesty, I don’t see this as particularly positive or negative. I personally found the experience of achieving the data analytics specialty certification to be rewarding, but I often comment on the fact that it covers a large range of topics within the data domain, not just analytics, and although it requires lot of solution architecture knowledge that not all data practitioners may have, or want to have, it will add value in some way to most people working in the data domain. Naturally, the question that jumps to mind is; what certifications should data practitioners take if they aren’t engineers or scientists?

However, role-specific / aligned certifications are certainly clearer in terms of deciding the appropriate learning and certification pathway to follow and recognising the relevance related to your day job, and I think there good options, both AWS and non-AWS, for those working in data that may not be well aligned to the AWS developer, data engineer, or ML specialty certifications. Also, the collateral and training for the data analytics specialty certification isn’t going anywhere, and will still be valuable for those who need that level of depth, which isn’t covered as part of the new data engineer associate certification.

With all that said, AWS’ intention of aligning a certification to the role of a data engineer is clear and easy to understand. I’m very interested to see what this means for potential changes or additions to other associate and specialty level AWS certifications as well as how the data engineer associate exam matures over time. If my beta exam is anything to go by, AWS certified data engineer associate will be a worthwhile certification for existing or aspiring AWS Data Engineers.

]]>
<![CDATA[AWS Certified Data Engineer Associate Beta - My Experience & Tips]]>TLDR;

The AWS Certified Data Engineer Associate certification is intended to cover much of the same material as the data analytics specialty certification. In that vein, it felt more domain-aligned than the existing associate level certifications, and covered a large range of topics to a reasonable level of depth. Having

]]>
https://blog.alistoops.com/aws-certified-data-engineer-associate-beta-my-experience-tips/658dfde53c1634000146e023Thu, 28 Dec 2023 23:39:41 GMTTLDR;AWS Certified Data Engineer Associate Beta - My Experience & Tips

The AWS Certified Data Engineer Associate certification is intended to cover much of the same material as the data analytics specialty certification. In that vein, it felt more domain-aligned than the existing associate level certifications, and covered a large range of topics to a reasonable level of depth. Having worked in a data engineering capacity in previous roles, it’s clear the exam covers relevant subject areas and I think much of the material will feel familiar to those working as data practitioners in AWS or existing data engineering roles

There were some very specific questions that I'm not sure would fairly reflect the ability to operate as a data engineer on AWS, but these were few and far between. It's worth noting that undertaking the beta exam means there was a larger spread of difficulty in questions, a longer exam experience, and some difficulty in exam prep. (not knowing where to focus), all of which would likely change the future exam experience.

Ultimately, I feel that when the exam is available in April 2024, it will be of value for those in data engineering roles, aspiring data engineers, or those in other data roles such as scientists and analysts looking to broaden their understanding.

What is a beta exam?

So I think it’s worth calling out that the beta Associate Data Engineer exam (DEA-C01) was available between November 27, 2023 and January 12, 2024 - I undertook it in December 2023 - which is relevant context for what will follow in this blog as much of it will be subject to change when the exam is generally available in April 2024. AWS use beta exams to test exam item performance before use in a live exam, but there are a few key differences that are worth calling out, though these are specific to the associate DE exam, they are applicable in some way to all beta exams;

  • The exam cost is reduced by 50% (to 75 USD plus VAT)
  • The duration and number of questions are different. In this case, all other associate exams* are 130 minutes in duration and 65 questions (50 scored, 15 not scored), whereas the associate data engineer exam is 170 minutes in duration with 85 questions. The DEA-C01 exam guide indicates the same format of the other associate level exams (once it moves from beta to generally available) in terms of question number but doesn't call out duration - for now, I would expect it to also be 130 minutes
  • You don’t get results within the typical 5 day window following exam completion. Instead, your results are available 90 days after the beta exam closes. In this case, that would mean early April, so I do not know how I scored yet
  • Though not specific to the exam itself, the nature of the available preparation material being limited to an exam guide and a practice question set containing 20 questions, as well as 3-4 skill builder links, means that it’s difficult to be confident around exam readiness

*The SysOps Associate exam is in this format from March 2023 when the labs were removed until further notice.

What is the DE associate cert and who is it aimed at?

I’m not going to regurgitate the exam guide beyond the bullet points below, but unsurprisingly the exam is aimed at data engineers or aspiring data engineers. I would say, in terms of the other 3 associate level certifications, it’s a little closer to the developer associate than solutions architect or SysOps administrator on the basis that I think the latter two can add a lot of value to people working in AWS that aren’t fulfilling administrative or solutions architect roles. In my opinion, the certified data engineer certification is not just aimed at data engineers, but is only of real value to those working in data, or those hoping to.

As in the exam guide, the exam also validates a candidate’s ability to complete the following tasks:

  • Ingest and transform data, and orchestrate data pipelines while applying programming concepts.
  • Choose an optimal data store, design data models, catalog data schemas, and manage data lifecycles.
  • Operationalize, maintain, and monitor data pipelines. Analyze data and ensure data quality.
  • Implement appropriate authentication, authorization, data encryption, privacy, and governance. Enable logging.

I plan to write a brief follow-up post sharing a comparison between the data engineer associate and data analytics specialty certifications, but it is worth mentioning that there is a significant overlap in domains and services on the exam guides, so those who have previously passed the data analytics specialty will be well positioned to pass the associate data engineer certification.

Exam Preparation

This will be short and sweet. With the exam in beta, I didn’t have much to go on for preparation. I used 3 resources alongside the exam guide;

  • AWS Skillbuilder - for the 20 question exam set
  • Big data analytics whitepaper - I only used this to brush up on a couple of services on the basis that I found it exceptionally useful when studying for the data analytics specialty exam. In retrospect, I think the same applies here, and I would have relied on it more
  • Specific parts of the data analytics udemy course by Stephane Maarek and Frank Kane - similar to the whitepaper, for overlapping services

Admittedly, my preparation could have been better. You'll note I didn't reference anything under "Step 2" on the exam page. It's worth noting that since undertaking the exam, I stumbled across this training course. I've not used it, but thought it worth drawing attention to. Adrian Cantrill is also planning to release a course that is due at the end of January 2024.

My Experience & Recommendations

You might ask yourself why all of the earlier preamble on this being a beta exam matters. The short answer is that the below topic areas and observations should be read with a few key considerations in that the questions themselves will be subject to change as is the case with all certification exams over time, but those changes are likely to be a little more high frequency in the earliest days the exam is offered, the increased number of questions means any of the below could be examined, but it’s not to say all will be, and I also don’t know if I passed the exam yet so take this with a pinch of salt.

In addition to the above, I felt as though the spread in question difficulty was much larger than other associate level exams. I’ve undertaken all 3 from 2021 to 2023 and the nature of those being well established means the question difficulty tends to be on an even keel - there aren’t many gimmes or higher difficulty questions. I found that not to be the case during the beta exam, and that’s likely to be the case during the first few months the exam is available.

Finally, my experience with Pearson Vue online was quite negative. I had to queue after check-in for 50 minutes and when I was next in queue it reset me to position 70. I did use the chat function to contact support but got nothing productive in return. Given there are no breaks allowed, this will have definitely had some effect on my concentration. This was the first time something like this has happened in 7 (I think) exam experiences, so I'm hoping it's a one-off, but next time I would likely use the support / chat to reschedule.

Now on to the focus areas;

  • Data formats - namely JSON, csv, avro, parquet. Existing native service integrations (e.g. can Quicksight import csv), specific data type limitations such as compression types or ability to handle nulls / missing values, and performance implications such as correct use of columnar data types are all inportant
  • PII redaction - understand implementation of redaction through Sagemaker, Glue, Databrew, Comprehend during transformation, redaction at the consumption layer through RLS in Quicksight, or understanding foundational concepts in hashing or salting
  • Orchestration - I’d recommend understanding the differences between Glue workflows, step functions, and Managed Workflows for Apache Airflow (MWAA)
  • DMS, data sync, app flow, and data exchange
  • Data catalogues and metastores. Not just Glue, but Hive and external metastores too
  • Analytics - surfacing in Quicksight, Quicksight commections to data in other services such as redshift and S3, and appropriate user or service access controls
  • Lakeformation - I wouldn't say theres a need to be a lakeformation expert, but certainly be familiar with it and understand the different elements of access control it grants over other methods
  • Hands on SQL queries (e.g. CTAS and group by, where / having clauses) are important to understand so be sure you're comfortable with basic SQL
  • Most DB services are likely to be tested - aurora, MSSQL, PostgreSQL, DynamoDB, DocumentDB, Redshift. I found the Data Analytics Soecialty preparation I had done oreviously to be helpful - it covered things like Redshift key types, federated queries, WLM, Vacuum types, indexes, and hashing
  • Networking - security groups vs NACL and cross region as well as cross account redshift access
  • Serverless - what serverless solutions exist, serverless stacks being aligned to lowest cost, and how / when serverless should be a preference
  • As with all associate exams, some critical areas include preparing for questions types of assessing “least operational overhead” or “lowest cost” options, understanding IAM, and implementing cross-VPC and cross-account solutions

A few topic areas that I wasn’t totally expecting;

  • Regex. I wasn’t surprised to see something around pattern or text matching, but I would recommend reciewing key operators such as starting with, ending in, and upper / lower case
  • Data mesh - its important to have fundamental understanding of the concept, and terminologies (data products, federated data, distributed teams)

Finally, I didn’t see much appearing on;

  • Code commit, deploy, build, or pipeline
  • Machine Learning. Consider data supply for ML use cases and Sagemaker functionality only
  • Containers

What’s next?

I’m planning to go for the solution architect professional exam in 2024 but, otherwise, I will update this in April once I get the results back.

]]>