<![CDATA[blog.AliStoops.com]]>https://blog.alistoops.com/https://blog.alistoops.com/favicon.pngblog.AliStoops.comhttps://blog.alistoops.com/Ghost 5.82Thu, 26 Sep 2024 14:53:12 GMT60<![CDATA[My Hopes for Feature Announcements at FabCon ‘24]]>What is FabCon 24?

Europe’s first Microsoft Fabric community conference, labelled FabCon, is kicking off on September 24th, following the first global Fabric conference and preceding the 2025 event planned in Las Vegas next March/April (2025). As detailed on the conference website, a number of Microsoft’

]]>
https://blog.alistoops.com/my-hopes-for-the-microsoft-fabric/66f1b37634c9070001acb77eMon, 23 Sep 2024 19:04:12 GMTWhat is FabCon 24?My Hopes for Feature Announcements at FabCon ‘24

Europe’s first Microsoft Fabric community conference, labelled FabCon, is kicking off on September 24th, following the first global Fabric conference and preceding the 2025 event planned in Las Vegas next March/April (2025). As detailed on the conference website, a number of Microsoft’s Data & AI experts, including some of the Fabric product leadership will be in attendance.

Alongside a number of interesting speaker sessions, there is a planned session focusing on the Microsoft Fabric roadmap. On some online community forums, there has been talk around some announcements being “held back” over the last 1-2 months with the intention of providing some announcements during the conference this week.

My hopes for announcements

I thought it would be helpful to shortlist the top N announcements I’d like to see, but given the number of ongoing development activities and items covered both in the roadmap and Fabric ideas, I thought it helpful to break it down to the top 3 hopes related to known roadmap developments, top 3 unrelated to known roadmap developments, and top 3 other or more out there hoped-for announcements.

  • Version Control Updates - improvement in item support, especially for Dataflow Gen2, and ideally the ability to see the code that sits behind Dataflow Gen2 items to validate and version control
  • Data Exfiltration Protection - though it’s unlikely to be a killer feature for all customers, data exfiltration protection for notebooks would be important for the more security conscious users and not covered by OneSecurity (planned) or Purview
  • T-SQL improvements (e.g. MERGE, CTEs) - I think the current T-SQL functionality meets most core needs, but does provide some challenge in specific needs not being met (e.g. MERGE). I also see some ongoing questions around the pattern of looking to move on-premise data to Azure SQL before feeding in to Fabric vs. Ingesting to Fabric immediately, and given where Azure SQL functionality is, there’s a compelling argument to move there first
  • While I tried to stick with top 3, I have to add that the incremental load support in dataflows is another roadmap item that I would love to see more about this week

Unrelated to roadmap hopes

  • Workload Management (queues, prioritisation, adjustable burst behaviours) - I think there are lots of options as to how this is implemented, but ultimately I believe there would be a lot of value to being able to prioritise certain jobs or workspaces, and it would encourage use of larger capacities rather than splitting all non-production and production workloads at the capacity level only, which is currently the only way to facilitate workload segregation
  • Low-code UPSERT for data pipelines and dataflows - similar to the utility around implementing incremental load support above, being able to UPSERT via data pipelines would be a great quality of life improvement, especially considering it’s already possible in dataverse dataflows
  • Parameters / environment variable support (for notebooks and between experiences) - I have seen a number of examples around metadata driven pipelines already, and made use of functionality to utilise parameters for dataflows, but I think that being able to do this within notebooks and across experiences (between data pipelines and dataflows) as well as for deployment pipelines would be great

Others

  • Preview Features going GA - Fabric is maturing rapidly, and also adding lots of new features regularly, but it would be great to see a few bits of information around things becoming generally available. First of all, for longer-standing features, It would be good to see some features move from preview to GA. I’d also like to understand what the cadence or path from preview to GA looks like for upcoming features, as well as get more visibility of when things are actually happening and what the impact is. We have the roadmap for new features, but it’s more difficult to get a clear view on how this impacts existing features or environments
  • Native mirroring for on-prem SQL Server - currently, mirroring on-prem data sources somewhat relies on setup via an AzureSQL instance. In theory, this doesn’t provide much of a blocker, but in practice, native integration to onprem SQL Servers for mirroring would provide a much better user experience
  • Asset Tagging - a few months ago, Microsoft added Folder support to help organise Workspace assets, which do their job, but the addition of tags could be more user-friendly by extending the current “certified” and “favourite” options in OneLake and hopefully provide additional options for security and governance (e.g. attaching RBAC to tags)
]]>
<![CDATA[Common Microsoft Fabric Cost Misconceptions]]>Fabric has came under its share of scrutiny since going generally available in November 2023, and much of it was or is still worth consideration. Specifically, concerns around the maturity of version control or CI/CD (in preview), some observed delays in SQL Analytics Endpoint synchronisation are often reference, and

]]>
https://blog.alistoops.com/microsoft-fabric/66df682c34c9070001acb751Mon, 09 Sep 2024 21:47:07 GMT

Fabric has came under its share of scrutiny since going generally available in November 2023, and much of it was or is still worth consideration. Specifically, concerns around the maturity of version control or CI/CD (in preview), some observed delays in SQL Analytics Endpoint synchronisation are often reference, and the pricing model being subscription or capacity-based rather than purely consumption-based are probably the points I see most commonly referenced.

Though these are all fair, and hopefully being addressed, it's also with addressing some common misconceptions, in this case associated to capacity features, cost, and performance:

  • The minimum cost for utilising PowerBI embedded is “high”: at launch, embedded was limited to F64 capacities, but that’s no longer the case and embedding is possible with any F SKU.  I’ve labelled this misconception as “high” cost as it tends to be one of two figures quoted - either the F64 capacity cost, or the embedded pricing. Ultimately, this means the actual minimum cost could be as low as a couple of hundred USD per month compared to anywhere from 800 USD (EM1) or 8,000 USD (F64) 
  • F64 is required for fully-featured fabric: This is partially true in that all features are available on F64 and above capacities, but previously this included smaller capacities not having features like trusted capacities or managed private endpoints, but all F SKUs are mostly at feature parity. The only feature addition at F64 and above is Copilot
  • All Fabric experiences cost the same for like-to-like operations: William Crayger shared the most vivid example of this that I can remember, which describes seeing a more than 95% reduction in consumption units running the same process for ingesting data with a spark notebook compared to a low-code pipeline. I haven’t seen quite so dramatic results in my experience, but I have observed, for specific activities, up to a 75% reduced cost. That is to say, not all experiences will result in the same consumption
  • Capacity performance is better for larger capacities: some small differences can be observed due to smoothing & bursting but, as outlined here by Reitse, even comparing the smallest F2 capacity to F64, performance is largely the same
  • Fabric is "more expensive" than expected, or than PowerBI at scale: as with everything, this depends on many factors but, in terms of perception, it's worth bearing in wind that the F64 capacity cost is equivalent cost (when reserved) to a PowerBI Premium capacity (at $5k p/m)
  • The F64 SKU cost, including free PowerBI access for report consumers, is best fit for 800 or more users: This comes from the fact that otherwise you would be paying for around 800 (x 10 USD) PowerBI licenses for report viewers.  However, it doesn’t factor in capacity reservation (~40%) savings nor any licensing covered through enterprise E5 licenses. In real terms, the driving factor for capacity selection needs to be predominantly data requirements, but considering purely licensing costs for report viewers, the crossover point for cost will vary. It could be higher (if E5 licenses are considered) or lower (if the capacity is reserved, more like 500)

Now, this isn’t to say that additional considerations such as portability, version control, or workload management and prioritisation aren’t worthwhile considerations, but I think it’s good to see barriers to entry being removed for new users and a mostly consistent experience for all capacities.

]]>
<![CDATA[Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations]]>Intro & Context

I’ll start by saying that I won’t be discussing the merits of the medallion architecture nor the discussion around building a literal medallion vs. Structuring against the requirements for your domain, and using medallion as a conceptual talking point, in any

]]>
https://blog.alistoops.com/microsoft-fabric-workspace-structure-and-medallion-architecture/6696aa2934c9070001acb746Wed, 17 Jul 2024 13:37:28 GMTIntro & ContextMicrosoft Fabric Workspace Structure and Medallion Architecture Recommendations

I’ll start by saying that I won’t be discussing the merits of the medallion architecture nor the discussion around building a literal medallion vs. Structuring against the requirements for your domain, and using medallion as a conceptual talking point, in any great detail as part of this blog (but I think this is a good video talking through the latter). We will just assume you’re following the guidance Microsoft publish describing the usual bronze, silver, and gold zones or layers. 

I’ve seen a number of conversations online, namely on Fabric community blogs, Reddit, and Stackoverflow that primarily focus on the question of whether implementing the medallion architecture means we should have one workspace per domain (e.g. sales, HR data) that covers all layers or a workspace per layer (bronze, silver, gold) per domain. While these threads often end up with a single answer, I don’t think there is a single right answer, and I also think there is some nuance missed in this conversation. So what I’m going to cover here is some guidance around what the implications are around Fabric workspace structure as well as recommendations for a starting point. It’s also worth noting that this focuses primarily on lakehouse implementation.

Key Design Implications

Before sharing any recommendations, I want to get the “it depends” out of the way early. The “right” answer will always depend on the context of the environment and domain(s) in which you’re operating.

This is just a starting point and a way to break down some key decision areas. It’s worth noting there are lots of good resources and examples of people sharing online around their experiences, but the reason this blog exists is because I have seen these more often than not represent the most straightforward examples (single source systems or 1-2 user profiles) where following Microsoft’s demos learn materials is enough. As far as considering the implications of your workspace structure in Microsoft Fabric, I would suggest key areas include:

  • Administration & Governance (who’s responsible for what, user personas)
  • Security and Access Control (data sensitivity requirements, principle of least privilege)
  • Data Quality checks, consistency, and lineage
  • Capacity (SKU) features and requirements (F64-only features, isolating workloads)
  • Users and skillsets
  • Version control & elevation processes (naming conventions, keeping layers in sync)

Potential High Level Options

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 1 (B)
  1. 1 workspace per layer per domain. This is recommended by Microsoft (see “Deployment Model”) as it provides more control and governance. However, it does mean the number of workspaces can grow very quickly, and operational management (i.e. who owns the pipelines, data contracts, lineage from the source systems) needs to be carefully considered
  2. 1 landing zone or bronze workspace then 1 workspace per domain for silver and gold layers. This could slightly reduce the number of workspaces by simply landing raw data in one place centrally but still maintaining separation in governance / management
  3. One workspace per domain covering all layers. Though this is against the suggestion in Microsoft documentation, it is the most straightforward option, and for simple use cases where there are no specific constraints around governance and access, or where users will be operating across all layers, this could still be suitable
Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 2 (C) - not recommended

There’s also an additional decision point in terms of the approach to implementing the above for managing your bronze or raw data layer:

  • A) Duplicate data - where multiple domains use the same data, simply duplicate the pipelines and data in each domain. I think this is mostly a case of ownership or stewardship and your preferred operating model given the cost of storage is relatively low
  • B) Land data in a single bronze layer only, then use shortcuts - though the aim would be to utilise shortcuts to pull in data where possible, here I am specifically talking about agreeing that where raw data is used in multiple domains, you could land it in one bronze layer (say domain 1) then shortcut access to that in subsequent domains (domain 2, 3, etc.) to avoid duplication
  • C) Use Cloud object storage for Bronze rather than Fabric Workspaces - this is an interesting one, and I think it only really applies if you plan to go with the core option 2 where you’re looking to have a centralised bronze layer. In this case you could do it in a Fabric workspace, or you could have a bronze layer cloud object store (ADLS gen2, s3, Google Cloud Storage, etc.). I think the only potential reason to consider this is to manage granular permissions for the data store outside of Fabric. In real terms, I would rule this out completely and instead consider (B)
Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option 3

With the above in mind, you see how there are a number of approaches or options where the potential number of workspaces could be as large as the number of domains (d) multiplied by both the number of layers (3) and number of environments (e), and as small as the number of domains multiplied by the number of environments (d x 9 or d x 3 for dev, test, and prod environment structure). In the number of workspaces for each image, note that each would need to be multiplied by the number of environments (usually at least 3).

It’s also worth noting, the above list isn’t exhaustive. There are other options such as considering a monolithic (one workspace for all layers) approach for some federated teams, but segregated workspaces for centrally managed data, or using monolithic for medallion layers as in option 3 (think, platform or data workspace) and a separate workspace for reporting. This is all about targeting a starting point.

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Options Overview (1A, 1B top, 2A, 2B middle, 3A bottom)

Sample Scenario

You might start to see why straightforward examples where an individual is setting up their Fabric workspace(s) covering use of data required for a single domain or team with limited risk of duplication and clear requirements around access controls results in an obvious structure. However, what does this mean when we begin to scale this across multiple domains in an organisation and a larger number of Fabric users?

For the purpose of considering any options, I’m going to make some assumptions for a sample scenario. We’re going to consider a fictional organisation that is rolling out Fabric across 3 domains initially; Sales, Marketing, and HR. Sales and Marketing will utilise at least one of the same source datasets, and there are no clear security or access controls between layers, but administration and governance must be segregated by domain. The organisation must keep separation between prod and non-prod workloads, and there will be dev, test, and prod environments. 

In the sample scenario, there are a number of options, but we could recommend, for example:

  • For each domain, utilise a workspace per environment (dev/test/prod)
  • For each domain, utilise a single workspace consolidating medallion layers (bronze / raw to gold / ready for analytical use)

While this doesn’t seem unreasonable, I will admit that it’s not particularly realistic. In most cases, I would expect here to be a preference to have better governance and control around access to raw (bronze) data. In that case, while much of the above is accurate, I would expect the last point to be fundamentally different. Rather than moving to the other end of the spectrum creating individual workspaces for all layers, domains, and environments, a real-world example I worked through previously that was reasonably similar to the description above went with a single bronze layer landing zone (option 2B).

Recommendations

  • There isn’t a single “right” answer here, so discussing the trade offs will only get you so far. I would suggest picking your best or clearest use case(s), reviewing your high level requirements, and build to a proposed or agreed approach then look to figure out if certain issues need addressed 
  • In general, I would recommend using shortcuts to minimise data duplication both across source storage and Fabric and across Fabric workspaces (described in option B). I really think it’s the best way to operate
  • Start by testing the assumption that it makes sense to use one workspace covering all layers of the medallion (option 3). While I think this will only make sense in practice with some adjustment (e.g. splitting data and reporting), and Microsoft recommend a workspace per layer, this is the biggest influencing factor in terms of affecting administration at scale
  • If you need to segregate medallion layers into individual workspace, I would propose starting with option 1 (B)
  • Where possible, utilise different capacities for each environment, or at least for production environments to make sure production workloads aren’t affected by dev or test jobs. Though not specifically a recommendation related to workspaces, this has come up in any scoping I’ve been part of with Fabric capacity planning to date - in the sample scenario, that would mean two separate capacities, one for dev and test environments and one for prod to separate workloads, compromising between flexibility and management overhead. This would result in the number of workspaces being 3 times the number of domains rather than 9 times. This may change if there is more flexible workload management in future Fabric updates, and smoothing will help those running a single capacity, but it’s not currently possible to isolate workloads without multiple capacities
  • There are also a couple of item level recommendations I would consider:
    • Semantic models should primarily be implemented in the Gold layer - these are really to facilitate end use of modelled data. Adding semantic models in all layers could just add to the complexity of your Fabric estate
    • It’s likely that the design pattern of utilising Lakehouses only or Warehouses only (only meaning from bronze to gold) will be common. However, it’s worth considering the different permutations against your needs. In my experience, a good starting point is using lakehouses for bronze and silver, and warehouses in gold (see here for some key differences)
  • If you use a warehouse in the gold layer, look to utilise tables where possible and avoid views. Views will disable Directlake (or, rather, cause fallback behaviour) resulting in poorer performance

What Next?

I appreciate this has been focused on creating the starting point and I just wanted to add som personal opinions for how I’ve seen this work effectively. First of all, I think the decision around environment setup is more has a bigger effect than the trade off between a higher or lower number of workspaces. What I mean by that is that without having environments separated by a workspace boundary, utilising deployment pipelines and version control are either difficult or impossible, so it’s crucial to have different workspaces per environment.

Next, I believe that the key driver for creating workspaces per medallion layer is around data governance and access controls. For me, the most logical way to balance that with the administration overhead is to use option 3 the monolithic approach, and add additional “reporting” workspaces for each to allow governance and access control management between accessing source data and consuming reports without having a massive number of workspaces to manage.

Microsoft Fabric Workspace Structure and Medallion Architecture Recommendations
Option…4?
]]>
<![CDATA[Microsoft Fabric Roundup - Build 2024]]>Summary

The intent from Microsoft during Build 2024 was quite clear where Fabric was concerned; show the significant and ongoing progress since the General Availability in late 2023, land the message around Fabric’s ease of access, and focus on the SaaS nature of the product by demonstrating a

]]>
https://blog.alistoops.com/microsoft-fabric-roundup-build-2024/665724103f0d860001248b1aWed, 29 May 2024 13:06:08 GMTSummaryMicrosoft Fabric Roundup - Build 2024

The intent from Microsoft during Build 2024 was quite clear where Fabric was concerned; show the significant and ongoing progress since the General Availability in late 2023, land the message around Fabric’s ease of access, and focus on the SaaS nature of the product by demonstrating a “single pane of glass” view of all your data and analytical workloads. To that effect, I think there was a lot covered, and so much potential or opportunity, but there is no doubt that the devil is in the details and which can only be properly understood through hands on development.

Intro

Microsoft Build 2024 is ran from May 21-23 in Seattle and Online, and there was a significant focus on all-things data and AI, as I mentioned in last week’s blog about the keynote session. Since all the sessions I was tracking have been made available on-demand, and there were so many Fabric updates or announcements, I wanted to share the relevant talks, and highlight a few key updates as well as relevant opinions.

Here are the various Fabric-specific sessions from MS Build that inform the rest of this blog post:

Market & Roadmap Update

Microsoft shared that they have 11,000 customers now using Fabric and over 350,000 customers using PowerBI including 97% of Fortune500 organisations. Though seeing an update on the growth of the customer base is interesting, the thing I would call out here is that I think Microsoft would be targeting at least the same target audience for Fabric as those that are currently using PowerBI. In fact, they made a direct comparison with the customer growth over the time since launch between the two products.

An update to the Fabric roadmap was announced. There’s far too much to dissect here, but it’s worth having a browse as there were a few interesting items such as data exfiltration protection.

Key Announcements

I’ll note that this is heavily subjective, but I’ve highlighted what I see to be the “bigger” announcements either based on potential impact or Microsoft’s messaging 

  1. Snowflake Interoperability and Apache Iceberg compatibility. Though much of the focus to date has been on delta tables in Fabric, Microsoft announced Iceberg compatibility during Build which is great in terms of extending to an additional open table format, and increasing flexibility, but the main thing to call out here is interoperability with Snowflake. Microsoft talked about and demonstrated the seamless two-way integration with Snowflake. This seems awesome, and alongside other announcements such as mirroring, will massively reduce the overhead in managing multiple data sources within Fabric. However, a word of warning is to consider the implications in terms of cost depending on storage and compute as this would likely be split across two products / services
  2. Fabric real-time intelligence. Aside from the nuts and bolts of this announcement, including integration with sources like Kinesis and Fabric Workspace Events in real time, it’s worth calling out that real-time event data will be a Fabric use case that is of interest to a wide range of users and something that would potentially have ruled Fabric out as a primary analytics platform before now. Microsoft had a representative from McLaren racing on stage to discuss how they were using real-time intelligence to ingest sensor data that, as an F1 fan, was really interesting. I’m sure I will have a lot of questions as I investigate this in more detail, but an obvious one is how cost compares to batch workloads and performance comparison to other options in this space
  3. AI Skills (see around halfway down this page). Released in preview, this is essentially a Custom Copilot type capability for business users that utilised Generative AI to convert natural language to SQL, enabling rapid Q&A of Fabric data sources and straightforward prompt engineering. I think an important consideration here is where this sits in terms of your overall user experience or use case in the sense that I think more regularly asked questions would be surfaced through PowerBI reports. Perhaps this would have a similar place to Athena in AWS for ad-hoc queries where more regular questions are answered via Quicksight. Nonetheless, this seems interesting in terms of time to value and Fabric GenAI integration
  4. Fabric Workload Development for ISVs. This was one I didn’t see coming at all. Microsoft announced the availability (in preview) of the workload development SDK and shared a demo that walked through how developers can create offerings via the Azure Marketplace and allow Fabric customers to utilise their solutions without leaving their Fabric environment. Some workloads mentioned include Informatica, neo4j, SAS, Quantexa, Profisee MDM, and Teradata. One really interesting thing here, is that Microsoft have essentially provided opened their React front-end so that workloads can have a “Fabric look and feel” for customers. I’m looking forward to getting hands-on with the SDK, and seeing how customers can accurately estimate pricing with ISV Workloads.
Microsoft Fabric Roundup - Build 2024
Fabric Workload Development Architecture. Source: Microsoft

Other Announcements 

It’s difficult to have complete coverage across all the news shared, but I still thought it to be worthwhile to call out that there was so much covered through these sessions so I’ve listed below a number of additional things that piqued my interest:

  • External data sharing (cross tenant)
  • Shortcuts to on premises sources for OneLake (search here for “lake-centric) available in public preview (relies on data in open format so also brought in mirroring of external databases) - mirroring is free for storage replicas and minimises duplication, but we need to be careful around managing cost as it would, presumably, mean egress and compute costs from the source provider
  • new real time intelligence Fabric MSLearn module 
  • Azure Databricks and Fabric integration - soon you will be able to access Azure Databricks Unity Catalog tables directly in Fabric
  • Anomaly detection preview 
  • Fabric User Data Functions & functions hub - think about Azure functions but tightly integrated with Fabric. One example discussed was adding forecast projections to warehouse data, which is a common use case in PowerBI that would otherwise require some ETL-specific coding
  • VSCode Microsoft Fabric Extension enabling workspace integration, local debugging, and user data functions
  • Copilot for Data Warehouse - noting this produces the query but doesn’t run it so allows human in the loop adjustment and reduces unnecessary CU consumption. This also includes IDE-style (like GitHub copilot) autocomplete
  • Fabric API for GraphQL
Microsoft Fabric Roundup - Build 2024
Source: Microsoft

Of course these will only be relevant for those reading shortly after publishing, but I wanted to share links and MS Forms in case people are looking for sign up links to public previews or a Microsoft Fabric trial after the event. 

]]>
<![CDATA[Microsoft Build 2024 - Keynote Key Takeaways]]>Microsoft Build 2024 is running from May 21-23 in Seattle and Online. Over the weekend, I will look to distil something more detailed around the Fabric and/or AI sessions from Build, but following a Keynote session that lasted over 2 hours, I thought I’d share my top

]]>
https://blog.alistoops.com/microsoft-build-2024-keynote-key-takeaways/664f93723f0d860001248b01Thu, 23 May 2024 19:25:51 GMT

Microsoft Build 2024 is running from May 21-23 in Seattle and Online. Over the weekend, I will look to distil something more detailed around the Fabric and/or AI sessions from Build, but following a Keynote session that lasted over 2 hours, I thought I’d share my top 5 takeaways below. For those interested, the full Day 1 is available here, and Microsoft also posted a 90 second recap.

Key Takeaways

  • Copilots everywhere! Unsurprisingly, Copilot was front and centre for a significant portion of the keynote and there’s a lot to unpack. Outside the announcements, Satya also equated the Copilot runtime to being a fundamental shift similar to what Win32 did for GUIs, and shared that nearly 60% Fortune 500 companies using copilot as well as calling out around half a dozen organisations with over 10k seats. I’ve included links to the announcements below, but I’m particularly interested to use Team Copilot
  • GPT-4o is Generally Available in Azure OpenAI (demo). LLMs and SLMs (namely GPT-4o and Phi-3) were peppered throughout the keynote, including a number of product launches and case studies (link). Recurring themes seemed to be around optimisation and efficiency in either cost or performance, and multi-modality including image, text, and speech. It’s also worth calling out a moment where Open AI CTO Sam Altman made a point of referencing AI as an enabler. I’ll take any opportunity to reiterate the importance of focusing on providing value through solving business problems (customer / user first), not using specific technology “just because”
  • Khan Academy announced a partnership with Microsoft focused on utilising AI to support educators, and Sal Khan highlighted that teaching will be an area that will see real change through the use of technology, something that I was interested to see during the keynote having recently been involved in a STEM Learning roundtable with many industry and education leaders on exactly this topic. A big part of the Khan Academy presentation was around the intention to make Khanmigo available to all teachers in the US for free 
  • Fabric Real Time Intelligence is in preview- I’m excited to see more detail on this, but integrating with sources like Kinesis, Blob Storage Events, Kafka, CDC Events from Cosmos DB, and Fabric Workspace Events in real time will be critical to a number of prospective and existing Fabric customers, and opens up a number of new use cases
  • Continued investment in AI infrastructure through AMD MI300X Instinct accelerators and Cobalt VMs

I would also add that it was fantastic to see Kevin Scott (Microsoft CTO) on stage, he was a wonderful addition, and his personal anecdotes around the use of technology and AI having the power to enable real change in fields of medicine and education were poignant reminders of why I love working in this space.

As someone who’s played video games my whole life, there was one added bonus - Copilot might finally help me understand the world of Minecraft!

Microsoft Build 2024 - Keynote Key Takeaways
]]>
<![CDATA[MS Fabric Copilot - Recommendations and Pricing Considerations]]>TLDR

Ultimately, Fabric Copilot looks to be a really simple way to integrate the use of OpenAI services into a developers workflow, with one single method of billing and usage monitoring through your Fabric SKU. There are some assumptions that will need to be made and tested against when it

]]>
https://blog.alistoops.com/ms-fabric-copilot-recommendations-and-pricing-considerations/663666c43f0d860001248af1Sat, 04 May 2024 16:50:18 GMTTLDRMS Fabric Copilot - Recommendations and Pricing Considerations

Ultimately, Fabric Copilot looks to be a really simple way to integrate the use of OpenAI services into a developers workflow, with one single method of billing and usage monitoring through your Fabric SKU. There are some assumptions that will need to be made and tested against when it comes to accurately baselining cost, and certainly some considerations around highest value use cases, but the cost model is appealing and I consider Fabric Copilot worth using or at least trialling / assessing for your use cases, with appropriate considerations.

As with Github Copilot, Amazon CodeWhisperer, and other tools in this space, I think the focus should be on accelerating development and shifting focus of skilled developers to more complex or higher value tasks.

Context

In March, Ruixin Xu shared a community blog detailing a Fabric Copilot pricing example, building on the February announcement of Fabric Copilot pricing. It’s exceptionally useful, and details how Fabric Copilot works, the maths behind calculating consumption through an end-to-end example. I’m aiming to minimise any regurgitation here, but I’m keen to add my view on key considerations for rolling out the use of Fabric copilot. For the purpose of this blog, I will focus on cost and value considerations, with any views on technical application and accuracy to be compiled in a follow up blog.

MS Fabric Copilot - Recommendations and Pricing Considerations
Source: https://blog.fabric.microsoft.com/en-us/blog/fabric-copilot-pricing-an-end-to-end-example-2/

Enabling Fabric Copilot

First, it’s worth sharing some “mechanical” considerations:

  • Copilot in Fabric is limited to F64 or higher capacities. These start at $11.52/hour or $8,409.60/month, excluding storage cost in OneLake and pay-as-you-go
  • Copilot needs to be enabled at the tenant level
  • AI services in Fabric are in preview currently
  • If your tenant or capacity is outside the US or France, Copilot is disabled by default. From MS docs - “Azure OpenAI Service is powered by large language models that are currently only deployed to US datacenters (East US, East US2, South Central US, and West US) and EU datacenter (France Central). If your data is outside the US or EU, the feature is disabled by default unless your tenant admin enables Data sent to Azure OpenAI can be processed outside your capacity's geographic region, compliance boundary, or national cloud instance tenant setting”
MS Fabric Copilot - Recommendations and Pricing Considerations
Source: microsoft.com (https://learn.microsoft.com/en-us/fabric/get-started/copilot-fabric-consumption)

Recommendations

  1. My first recommendation is to “give it a go” yourself. Based on how Fabric Copilot pricing works at the time of writing, I can see clear value in developers utilising Fabric Copilot - even for a simple example that optimistically might only take 30 minutes of development, a $0.63 cost feels pretty hard to beat. It’s hard to imagine Fabric Copilot costing more than a developer’s time.
  2. Consider who actually needs or benefits from access to Copilot. From what I’ve seen so far, the primary use case is around accelerating development for analysts and engineers, so those consuming reports might see much less value in comparison. Personally, I would also recommend any outputs are appropriately tested and reviewed by someone with the capability of building or developing without Copilot (for now, at least).
  3. Unsurprisingly, it feels as though Copilot could really accelerate analytics and engineering development, but I think it’s crucial that organisations considering adoption consider rolling out the use of copilot in stages, starting with small user groups. This is for two reasons; both to support the next recommendation on my list, but also as it will help in managing and monitoring resources at a smaller scale before considering impact on your capacity.
  4. I think it will be important to build out your internal best practice guidance / knowledge base. For example, in Ruixin’s blog, the CU consumption for creating a dataflow to keep European customers was about 50% more than what was required for creating a PowerBI page around customer distribution by geography. In my opinion, the benefit in terms of time saved is larger in the PowerBI use case than a simple dataflow column filter. In this example, I would also suggest that it’s best practice to use Copilot to generate a starting point for PowerBI reports, rather than something ready for publication / consumption. As is the case in many applications of Generative AI, there’s likely additional value in standardising Fabric Copilot inputs in terms of more consistent costs, so developing a prompt library could be useful.
  5. You need to plan for expected CU consumption based on users and capacity SKU. Admittedly this seems obvious and is likely only a potential issue at a scale where multiple people are utilising Fabric Copilot all at once against the same capacity. For context, although organisations with more than 20 developers may be on a larger capacity than F64, a team of 20 developers submitting Fabric Copilot queries of a similar cost to those detailed in the blog (copied below) would be more than 100% (20*12,666.8 = 253,336) of available F64 CU capacity (64 * 3600 = 230,400) in a given hour. Admittedly, this needs considered over a 24-hour period, and it’s unlikely that parallel requests will be submitted outside standard working hours, but it should be evaluated in alongside other processes such as pipelines and data refreshes.
  6. Though I believe much of the Microsoft guidance around data modelling for using copilot is generally considered good practice for PowerBI modelling, I would recommend assessing your data model and adapting in line with the Microsoft guidance in order to maximise effectiveness of utilising Fabric (or PowerBI) Copilot.
MS Fabric Copilot - Recommendations and Pricing Considerations
Source: https://learn.microsoft.com/en-us/power-bi/create-reports/copilot-create-report-service
]]>
<![CDATA[Microsoft Fabric Analytics Engineer Associate (DP-600) Certification - My Experience & Tips]]>TLDR

The Fabric Analytics Engineer Associate certification is intended to cover almost all generally available functionality of Fabric at the time of writing. That said, it’s not surprising that it covers lots of modules / collateral that exists in other data engineer and data analyst learning, and I expect

]]>
https://blog.alistoops.com/microsoft-fabric-analytics-engineer-associate-dp-600-certification-my-experience-tips/6632087f3f0d860001248ad0Wed, 01 May 2024 09:18:29 GMTTLDRMicrosoft Fabric Analytics Engineer Associate (DP-600) Certification - My Experience & Tips

The Fabric Analytics Engineer Associate certification is intended to cover almost all generally available functionality of Fabric at the time of writing. That said, it’s not surprising that it covers lots of modules / collateral that exists in other data engineer and data analyst learning, and I expect the material in the exam will change as the product matures and additional functionality is added - this will also be interesting to review over time. It’s currently quite PowerBI focused, and I don’t expect that to change a great deal.

While it’s easy to naturally question where an “Analytics Engineer” sits in terms of role alignment given that data analyst and data engineer roles are more common in the industry than analytics engineers, I do feel that it covers enough topics across both areas to live up to the name. Any potential candidates will need to have in depth knowledge of SQL, Spark, and DAX, as well as developing Fabric analytics workloads and managing or administrating Fabric capacities. Some topics are covered with a purely Fabric lens (e.g. Version control, deployment best practices, and orchestration), which is totally reasonable, but it’s worth any potential candidates considering building considering undertaking broader foundational analytics engineering training that aren’t covered in the core DP-600 learning.  

I think that any data practitioners interested in, or needing to prepare for, working with Fabric would benefit from undertaking the DP-600 learning and exam.

What is the Analytics Engineer Certification and Who is it Aimed at

Microsoft Fabric Analytics Engineer Associate (DP-600) Certification - My Experience & Tips
Fabric Capabilities and Components

Before getting in to details it’s worth mentioning that the Fabric Analytics Engineer Associate Certification follows a similar structure to other Azure certifications in that the name of the certification and name of the exam are different, but ultimately undertaking the DP-600 exam grants the certification. As such, I will use the two terms synonymously.

The exam details do a good job of describing the experience needed, and skills measured.

Experience Needed:

  • Data modeling
  • Data transformation
  • Git-based source control
  • Exploratory analytics
  • Languages, including Structured Query Language (SQL), Data Analysis Expressions (DAX), and PySpark

Skills measured:

  • Plan, implement, and manage a solution for data analytics (10–15%)
  • Prepare and serve data (40–45%)
  • Implement and manage semantic models (20–25%)
  • Explore and analyze data (20–25%)

To anyone already operating in the domain, much of the skills and experience assessed won’t be surprising. I think much of the material will feel familiar to those working as data practitioners in Azure, especially PowerBI experts, but it’s relevant to consider that most candidates will feel stronger or more closely aligned to either the analytics or engineering topics. The interesting thing here is that I don’t think it’s immediately obvious as to whether a more analytics focused background or engineering focused background would be beneficial, but at first glance, I would suggest this reads as slightly more focused on analytics. I will add my views and explain specifics in later recommendations but, ultimately, I feel as though the exam is aimed at people who want to have a rounded view of managing Fabric capacities and building and deploying Fabric workloads, whatever their title may be.

Exam Preparation

For context, I took the DP-600 in early April 2024, and given it was only in beta in early 2024 there aren’t many high quality courses from reputable trainers. I used Microsoft Learn’s self-paced learning (available here) and practice assessment. 

I’ve since seen a number of Reddit posts (r/MicrosoftFabric) and Youtube videos (from users such as Guy in a Cube) that look promising, and I’m sure popular trainers in the Azure space such as John Cavill and James Lee will have DP-600 training content in future. 

There are a couple of additions worth pointing out here. Though I was confident in my SQL, PySpark, and PowerBI skills, I would recommend that if you’re new to the domain or feel room for improvement in one of those areas that it would be worthwhile considering resources such as Udemy or Coursera SQL courses, Cloudera or Databricks PySpark training, MSLearnPowerBI training, or your favourite training provider to close any gaps. It’s worth noting that the exam expects a level of SQL and PySpark knowledge beyond just what is covered in the MSLearn labs.

My Experience & Recommendations

Before sharing focus areas for learning, and key topics from the exam, there are a few mechanical pieces to call out:

  • Do the practice assessment Microsoft offer (on the Dp-600 page linked above) - none of the questions here came up not he exam, but I felt like they were exactly the right level of difficulty and with a good speed of topic areas to assess readiness prior to booking the exam. I did experience the webpage crashing, and not remembering my progress (I’m assuming it’s a randomised question set / structure) so I would suggest doing this in one sitting
  • Do all the labs / hands on demos - it’s very easy to want to move past these as much of the text will be a repeat of the theoretical material covered during the learning modules, but there really is no substitute for hands on experience. I also would suggest that the only topics I struggle with in the practice assessment were ones that I had not done the lab for
  • Get used to navigating MSLearn - you can open an MSLearn window during the exam. It’s not quite “open book” and it’s definitely trickier to navigate MSLearn using only the search bar rather than search engine optimised results, but effectively navigating MSLearn means not always needing to remember the finest intricate details. That said, it is time consuming, so I aimed to use it sparingly and only when I knew where I could find the answer quickly during the exam. Note that this won’t help for most PySpark or SQL syntax / code questions
  • Though it’s not a pre-requisite, I would firmly recommend undertaking the PL-300(PowerBI Data Analyst Associate) exam before DP-600. So much of the DP-600 learning is based on understanding DAX, PowerBI semantic modelling and best practices, and PowerBI external tools (Tabular Editor, DAX Studio, ALM). Those who have undertaken PL-300 will have a much easier time with the DP-600 exam.

As for topic areas covered during the exam:

  • Intermediate to advanced SQL knowledge - most SQL questions rely on knowledge that isn’t covered via MSLearn beyond a few examples. These include understanding SQL joins, where, group by and having clauses, CTEs, partitioning (row number, rank, dense rank), LEAD and LAG, and date/time functions
  • PowerBI - in addition to the above pieces noted in relation to PL-300, I would also call out creating measures, using VAR and SWITCH, Loading data to PowerBI, using time intelligence functions, using data profiling tools in PowerBI, and using non-active relationships (functions such as USERELATIONSHIP). Managing datasets using XMLA endpoints and general model security are also key topic areas
  • Beginner to intermediate PySpark (as well as Jupyter Notebooks and Delta Lake Tables) - in all honesty, I think the more hands on PySpark experience you have, the better. But, I was only specifically asked about thinks like the correct syntax for reading / writing data to tables, profiling dataframes, filtering and summarising dataframes, and utilising visualisation libraries (matplotlib, plotly)
  • Fabric Shortcuts - I think this was relatively clear in the learning for basic examples, but it’s worth understanding how this would work in more complex scenarios such as multi-workspace or cross-account examples (same applies to Warehouses), and how shortcuts relate to the underlying data (e.g. what happens when a shortcut is deleted)
  • Core data engineering concepts such as Change Data Capture (CDC), batch vs. Real-time / streaming data, query optimisation, data flows and orchestration, indexes, data security [Row Level Security (RLS) and Role-Based Access Control (RBAC)]
  • Storage optimisation techniques such as Optimize and Vacuum 
  • Fabric licenses and capabilities (see here)
  • Source control best practices for Fabric and PowerBI
  • Knowledge of carrying out Fabric admin tasks - for example, understanding where capacity increased occur (Azure portal), where XMLA endpoints are disabled (Tenant settings), enabling high concurrency (Workspace settings)

What next?

I have no certifications planned for now, but I plan to build some thing around MS Purview, similar to what I did with AWS DataZone. I will plan over the next few months what the next certification step is, but given the recent additions to AI-102 (AI Engineer Associate) related to OpenAI, I will likely start there.


]]>
<![CDATA[AWS Data Engineer Associate and Data Analytics Specialty Certifications Compared]]>Intro and Context

Let me start by saying that I usually wouldn’t compare two AWS certifications beyond considering common services (e.g. how much of my exam prep for X will have helped for certification Y) as I think AWS do a good in logically separating the certifications

]]>
https://blog.alistoops.com/aws-data-engineer-associate-and-data-analytics-specialty-certifications-compared/6595edaa3c1634000146e0f0Mon, 29 Jan 2024 22:39:36 GMTIntro and ContextAWS Data Engineer Associate and Data Analytics Specialty Certifications Compared

Let me start by saying that I usually wouldn’t compare two AWS certifications beyond considering common services (e.g. how much of my exam prep for X will have helped for certification Y) as I think AWS do a good in logically separating the certifications as long as you understand the differences in practitioner, associate, professional, and specialty levels. I also typically take each learning experience independently and try to cover all material even if it’s part of a previous certification.

However, alongside the announcement of the new Certified Data Engineer Associate (DEA-C01) certification in November 2023, AWS announced they will be retiring AWS Certified Data Analytics - Specialty (DAS-C01) in April 2024, stating “This retirement is part of a broader effort to focus on certifications that align more closely to key cloud job roles, such as data engineer.”

Having undertaken the Certified Data Engineer Associate beta exam in December, I thought it would be worth sharing my views on comparing the two certifications.

Exam Guide Comparison

Reading through the exam guide for DEA-C01, it felt familiar, and I don’t think this is just due to domain experience. The reason it felt familiar to the Data Analytics Specialty (DAS-C01) content was evident when looking at the list of services, but even at a higher level of certification domain you can see there would be a large degree of overlap.

Domain Number

Data Analytics Specialty

Data Engineering Associate

1

Collection

Data Ingestion and transformation

2

Storage and Data Management

Data Store Management

3

Processing

Data Operations and Support

4

Analytics and Visualisation

Data Security and Governance

5

Security


In terms of services in the exam guides, the notable differences are the additions of AppFlow & Managed Workflow for Apache Airflow (MWAA), Cloud Financial Management (Budgets, Cost Explorer), AWS Batch & Serverless Application Model, Amazon Keyspaces (for Apache Cassandra), and AWS Developer tools (Cloud9, CDK, Code*) for the associate data engineer certification.

The overlap is unsurprising given the link between data analytics and engineering, as well as the fact that the data analytics specialty focuses on a lot more than just analytics. That said, the real reason I wanted to call this out is that the data analytics specialty certification recommends 5 years domain experience (compared to 2-3 in DE and 1-2 in AWS for data engineer associate).

It’s worth mentioning that there is a distinct difference between associate level and specialty level certifications. As mentioned above, that starts with suggested experience, which I think this Adrian Cantrill page covers quite well, but also in the exam format. The data engineer exam guide indicates 65 questions (50 scored) and although the exam duration isn’t specified, it’s safe to assume it will be 130 minutes as is the case for the other associate exams, compared to the same question structure but a 180 minute duration. As you might expect, that typically means lower complexity questions for the associate level exam.

Exam Experience

I appreciate this is only useful to anyone who has previously undertaken DAS-C01, or already prepared for it, and is considering DEA-C01 in future. If that’s not you, just skip this section.

Though the maximum durations above indicate less time needed for the data engineer associate certification, I don’t think it’s completely representative of the jump in difficulty. I found time not to be tight for associate level, finishing each with a good amount of time to spare, but only finished the data analytics specialty exam with a few minutes to share. Difficulty aside, I found the exam experience a lot more taxing and needing a lot of mental stamina

In general, the certified data engineer associate questions are centred on less complex solutions (i.e. fewer interacting components or requirements to assess)

In line with the FAQs around why AWS are retiring the data analytics specialty certification, it felt like questions around overarching solution design were much less common in the data engineer associate exam and replaced with more developer type questions (e.g. what would you do as a DE, how would you configure something or code you would write)

Data engineer associate questions follow a more typical format (e.g. least operational overhead, lowest cost). Though these terms also appear in the specialty certification, the ties from question to correct answer aren’t always as clear in the sense that things are less obvious (e.g. could me multiple correct answers, but a specific requirement for queue ordering, throughput, or access controls is slightly different)

There were specific topics with different levels of focus. Compared to the data analytics specialty, my beta exam for certified data engineer associate had

  • A similar level of focus on data types, formats, and storage
  • Less focus on DB scaling and deployment mechanisms
  • Slightly less detail on Streaming solutions - still a key topic, but less depth required. Specifically, nothing on MSK or KCL, and very little on error handling
  • Similar areas covered in security and governance such as row level security, IAM, cross account access, but less around encryption
  • Much less (very little) on EMR and open-source spark applications
  • More focus on Glue and Lambda
  • Additional questions covering AppFlow, MWAA, AWS Batch, Cloud9, and Cost Explorer
  • Similar coverage on database services
  • Much more focus on handling PII
  • Some specific questions around SQL, Regex, and hashing

This list is not exhaustive, just the areas I found more notable.

My Thought on The Change

In all honesty, I don’t see this as particularly positive or negative. I personally found the experience of achieving the data analytics specialty certification to be rewarding, but I often comment on the fact that it covers a large range of topics within the data domain, not just analytics, and although it requires lot of solution architecture knowledge that not all data practitioners may have, or want to have, it will add value in some way to most people working in the data domain. Naturally, the question that jumps to mind is; what certifications should data practitioners take if they aren’t engineers or scientists?

However, role-specific / aligned certifications are certainly clearer in terms of deciding the appropriate learning and certification pathway to follow and recognising the relevance related to your day job, and I think there good options, both AWS and non-AWS, for those working in data that may not be well aligned to the AWS developer, data engineer, or ML specialty certifications. Also, the collateral and training for the data analytics specialty certification isn’t going anywhere, and will still be valuable for those who need that level of depth, which isn’t covered as part of the new data engineer associate certification.

With all that said, AWS’ intention of aligning a certification to the role of a data engineer is clear and easy to understand. I’m very interested to see what this means for potential changes or additions to other associate and specialty level AWS certifications as well as how the data engineer associate exam matures over time. If my beta exam is anything to go by, AWS certified data engineer associate will be a worthwhile certification for existing or aspiring AWS Data Engineers.

]]>
<![CDATA[AWS Certified Data Engineer Associate Beta - My Experience & Tips]]>TLDR;

The AWS Certified Data Engineer Associate certification is intended to cover much of the same material as the data analytics specialty certification. In that vein, it felt more domain-aligned than the existing associate level certifications, and covered a large range of topics to a reasonable level of depth. Having

]]>
https://blog.alistoops.com/aws-certified-data-engineer-associate-beta-my-experience-tips/658dfde53c1634000146e023Thu, 28 Dec 2023 23:39:41 GMTTLDR;AWS Certified Data Engineer Associate Beta - My Experience & Tips

The AWS Certified Data Engineer Associate certification is intended to cover much of the same material as the data analytics specialty certification. In that vein, it felt more domain-aligned than the existing associate level certifications, and covered a large range of topics to a reasonable level of depth. Having worked in a data engineering capacity in previous roles, it’s clear the exam covers relevant subject areas and I think much of the material will feel familiar to those working as data practitioners in AWS or existing data engineering roles

There were some very specific questions that I'm not sure would fairly reflect the ability to operate as a data engineer on AWS, but these were few and far between. It's worth noting that undertaking the beta exam means there was a larger spread of difficulty in questions, a longer exam experience, and some difficulty in exam prep. (not knowing where to focus), all of which would likely change the future exam experience.

Ultimately, I feel that when the exam is available in April 2024, it will be of value for those in data engineering roles, aspiring data engineers, or those in other data roles such as scientists and analysts looking to broaden their understanding.

What is a beta exam?

So I think it’s worth calling out that the beta Associate Data Engineer exam (DEA-C01) was available between November 27, 2023 and January 12, 2024 - I undertook it in December 2023 - which is relevant context for what will follow in this blog as much of it will be subject to change when the exam is generally available in April 2024. AWS use beta exams to test exam item performance before use in a live exam, but there are a few key differences that are worth calling out, though these are specific to the associate DE exam, they are applicable in some way to all beta exams;

  • The exam cost is reduced by 50% (to 75 USD plus VAT)
  • The duration and number of questions are different. In this case, all other associate exams* are 130 minutes in duration and 65 questions (50 scored, 15 not scored), whereas the associate data engineer exam is 170 minutes in duration with 85 questions. The DEA-C01 exam guide indicates the same format of the other associate level exams (once it moves from beta to generally available) in terms of question number but doesn't call out duration - for now, I would expect it to also be 130 minutes
  • You don’t get results within the typical 5 day window following exam completion. Instead, your results are available 90 days after the beta exam closes. In this case, that would mean early April, so I do not know how I scored yet
  • Though not specific to the exam itself, the nature of the available preparation material being limited to an exam guide and a practice question set containing 20 questions, as well as 3-4 skill builder links, means that it’s difficult to be confident around exam readiness

*The SysOps Associate exam is in this format from March 2023 when the labs were removed until further notice.

What is the DE associate cert and who is it aimed at?

I’m not going to regurgitate the exam guide beyond the bullet points below, but unsurprisingly the exam is aimed at data engineers or aspiring data engineers. I would say, in terms of the other 3 associate level certifications, it’s a little closer to the developer associate than solutions architect or SysOps administrator on the basis that I think the latter two can add a lot of value to people working in AWS that aren’t fulfilling administrative or solutions architect roles. In my opinion, the certified data engineer certification is not just aimed at data engineers, but is only of real value to those working in data, or those hoping to.

As in the exam guide, the exam also validates a candidate’s ability to complete the following tasks:

  • Ingest and transform data, and orchestrate data pipelines while applying programming concepts.
  • Choose an optimal data store, design data models, catalog data schemas, and manage data lifecycles.
  • Operationalize, maintain, and monitor data pipelines. Analyze data and ensure data quality.
  • Implement appropriate authentication, authorization, data encryption, privacy, and governance. Enable logging.

I plan to write a brief follow-up post sharing a comparison between the data engineer associate and data analytics specialty certifications, but it is worth mentioning that there is a significant overlap in domains and services on the exam guides, so those who have previously passed the data analytics specialty will be well positioned to pass the associate data engineer certification.

Exam Preparation

This will be short and sweet. With the exam in beta, I didn’t have much to go on for preparation. I used 3 resources alongside the exam guide;

  • AWS Skillbuilder - for the 20 question exam set
  • Big data analytics whitepaper - I only used this to brush up on a couple of services on the basis that I found it exceptionally useful when studying for the data analytics specialty exam. In retrospect, I think the same applies here, and I would have relied on it more
  • Specific parts of the data analytics udemy course by Stephane Maarek and Frank Kane - similar to the whitepaper, for overlapping services

Admittedly, my preparation could have been better. You'll note I didn't reference anything under "Step 2" on the exam page. It's worth noting that since undertaking the exam, I stumbled across this training course. I've not used it, but thought it worth drawing attention to. Adrian Cantrill is also planning to release a course that is due at the end of January 2024.

My Experience & Recommendations

You might ask yourself why all of the earlier preamble on this being a beta exam matters. The short answer is that the below topic areas and observations should be read with a few key considerations in that the questions themselves will be subject to change as is the case with all certification exams over time, but those changes are likely to be a little more high frequency in the earliest days the exam is offered, the increased number of questions means any of the below could be examined, but it’s not to say all will be, and I also don’t know if I passed the exam yet so take this with a pinch of salt.

In addition to the above, I felt as though the spread in question difficulty was much larger than other associate level exams. I’ve undertaken all 3 from 2021 to 2023 and the nature of those being well established means the question difficulty tends to be on an even keel - there aren’t many gimmes or higher difficulty questions. I found that not to be the case during the beta exam, and that’s likely to be the case during the first few months the exam is available.

Finally, my experience with Pearson Vue online was quite negative. I had to queue after check-in for 50 minutes and when I was next in queue it reset me to position 70. I did use the chat function to contact support but got nothing productive in return. Given there are no breaks allowed, this will have definitely had some effect on my concentration. This was the first time something like this has happened in 7 (I think) exam experiences, so I'm hoping it's a one-off, but next time I would likely use the support / chat to reschedule.

Now on to the focus areas;

  • Data formats - namely JSON, csv, avro, parquet. Existing native service integrations (e.g. can Quicksight import csv), specific data type limitations such as compression types or ability to handle nulls / missing values, and performance implications such as correct use of columnar data types are all inportant
  • PII redaction - understand implementation of redaction through Sagemaker, Glue, Databrew, Comprehend during transformation, redaction at the consumption layer through RLS in Quicksight, or understanding foundational concepts in hashing or salting
  • Orchestration - I’d recommend understanding the differences between Glue workflows, step functions, and Managed Workflows for Apache Airflow (MWAA)
  • DMS, data sync, app flow, and data exchange
  • Data catalogues and metastores. Not just Glue, but Hive and external metastores too
  • Analytics - surfacing in Quicksight, Quicksight commections to data in other services such as redshift and S3, and appropriate user or service access controls
  • Lakeformation - I wouldn't say theres a need to be a lakeformation expert, but certainly be familiar with it and understand the different elements of access control it grants over other methods
  • Hands on SQL queries (e.g. CTAS and group by, where / having clauses) are important to understand so be sure you're comfortable with basic SQL
  • Most DB services are likely to be tested - aurora, MSSQL, PostgreSQL, DynamoDB, DocumentDB, Redshift. I found the Data Analytics Soecialty preparation I had done oreviously to be helpful - it covered things like Redshift key types, federated queries, WLM, Vacuum types, indexes, and hashing
  • Networking - security groups vs NACL and cross region as well as cross account redshift access
  • Serverless - what serverless solutions exist, serverless stacks being aligned to lowest cost, and how / when serverless should be a preference
  • As with all associate exams, some critical areas include preparing for questions types of assessing “least operational overhead” or “lowest cost” options, understanding IAM, and implementing cross-VPC and cross-account solutions

A few topic areas that I wasn’t totally expecting;

  • Regex. I wasn’t surprised to see something around pattern or text matching, but I would recommend reciewing key operators such as starting with, ending in, and upper / lower case
  • Data mesh - its important to have fundamental understanding of the concept, and terminologies (data products, federated data, distributed teams)

Finally, I didn’t see much appearing on;

  • Code commit, deploy, build, or pipeline
  • Machine Learning. Consider data supply for ML use cases and Sagemaker functionality only
  • Containers

What’s next?

I’m planning to go for the solution architect professional exam in 2024 but, otherwise, I will update this in April once I get the results back.

]]>
<![CDATA[AWS AI & Data Conference - September 2023]]>AWS held their AI & Data conference in Kilkenny this September with topical sessions including “Using AI to tackle our planet’s most urgent problems” and “Game Changers” run by Dr. Werner Vogels (Amazon CTO) and Miriam McLemore (AWS Enterprise Strategy). A similar event ran

]]>
https://blog.alistoops.com/aws-ai-data-conference-september-2023/651dd8229b6ad20001888c64Wed, 04 Oct 2023 23:46:48 GMT

AWS held their AI & Data conference in Kilkenny this September with topical sessions including “Using AI to tackle our planet’s most urgent problems” and “Game Changers” run by Dr. Werner Vogels (Amazon CTO) and Miriam McLemore (AWS Enterprise Strategy). A similar event ran in Cork in 2022, and I’m hopeful this will continue to run annually given my experience.

AWS customers and partners were there talking and learning about all the latest developments in AI & Data, including how GenAI is being used in production and driving value. Showcases were presented on using GenAI to action event feedback resulting in a direct revenue increase, and utilising GenAI to automate generation of bid (or RFP) responses based on previous bids using Bedrock and Retrieval Augmented Generation (RAG, or adding context to a foundation model through external data), which was claimed to result in 150 customer conversations in 3 months.


Keynote


Werner covered a lot of information during the session, including talking about a popular airline’s panini predictor and GenAI meaning no longer needing to engage with customers on blockchain (half-joking), but most of the discussion was centred on using technology and data to solve problems related to climate change, sustainability, and helping the vulnerable. This ranged from healthcare in remote areas and transport, to food sources (it’s worth checking out Now Go Build). A few things were emphasised or repeated throughout;

  • People solve problems, not AI alone. Disruption will be a result of people tackling our biggest / hardest problems, with GenAI opening new doors
  • Good data needs good AI. This likely isn’t surprising to anyone in the industry, but it was interesting that Werner positioned the opposite as true. Good data needs good AI as a crucial part of extracting value from your data
  • Having the right data and AI skills and capability is crucial. Or, as Werner put it… “Make sure you’re trained - don’t believe what’s in the papers and media”
  • AI is all around us, and it’s not going anywhere. That’s not to say we shouldn’t be considerate of risks and implementation challenges, especially with new technology, but GenAI is another major development such as those we’ve seen in the last six (Sagamaker), ten (Alexa), and twenty-five (Amazon.com) years

I will add that it was amazing to see how encapsulated the audience was during Werner’s presentation. There was, of course, a lot of interesting and valuable information shared, but the way it delivered and landed was fascinating. Even when sharing some messaging that would be well known to the audience already (data being a foundation for all AI, for example), it was often provided with such a story that it felt new.

AWS GenAI Takeaways

AWS AI & Data Conference - September 2023

It’s natural to gravitate towards all-things GenAI right now, given the potential in that space. Below are a few focus areas from the conference:

  • Architectural patterns and production recommendations to improve value & reduce cost and risk
    • Addressing GenAI Foundation Model (FM) challenges, traceability and recency, through customisation. Though various methods were discussed, including fine-tuning and prompt engineering, RAG was identified as the most significant growth area in this space. AWS patterns including using Kendra (deep learning) and embedding reference data (pgvector or OpenSearch)
    • Considerations for pre-trained vs. Training your own FM, and patterns for fully managed vs self-managed pre-trained model services (Bedrock, Sagemaker Jumpstart). AWS discussed an upcoming tool to support customers in decision making, but also suggested that using a (customised) pre-trained FM would meet the needs of the majority of use cases
    • Security monitoring (centrally through AWS Security Hub), content filtering, and other current AWS GenAI-in-production recommendations
  • Amazon Quicksight Generative BI - alongside NLP report interaction, fact-checking, and quick calculations features, AWS showed an awesome “story building” live demo that produced an adaptable PPT-slide summary and blog post, from a Quicksight report, in seconds
  • Codewhisperer - AI coding companion for expediting development activities

Braoder AWS Data Takeaways

AWS AI & Data Conference - September 2023

I thought it was important this AWS event covered so much across the whole AI & Data landscape. Admittedly, there would be plenty more to add if I were able to attend all analytics and database sessions, but a few broader data highlights I captured include:

  • Data governance, strategy and ethics are not just “as important” as ever, but more so;
    • Defining governance tenets and principles before policies & procedures, empowering development teams, and data product approaches were just a selection of covered topics. Modern data architecture and solutions require changes to how we govern and manage data
    • There was good conversation on ethics and Responsible AI (RAI) aligned to the AWS RAI core dimensions (see here, also similar to Microsoft’s). Personally, I believe it's not uncommon to focus more on specific dimensions than others, such as privacy & security or explainability, and often there is room to improve ethics considerations across the data lifecycle (e.g. collecting, analysing, and disseminating). I’m encouraged to see increasignly more active conversation around data ethics
  • Zero ETL - I’m still not sold on zero ETL but I understand where it’s pitched, and that message was clear; reduce high effort, low value work. It was interesting that of the time on this topic during the conference was dedicated to service integrations between AWS services such as those between Aurora and Redshift, Redshift and Glue Data Catalog, Glue Data Catalog and downstream tools for consumption (EMR, Quicksight, Sagemaker). I thought the examples presented were those where I think reducing overhead to focus engineering efforts in other spaces (e.g. 3rd party / non-AWS sources, complex spark pipelines) would be beneficial
  • Culture & working backwards were referenced many times. Key cultural talking points were around engaging executives, data guiding decision making, and data proficiency as a core skill
    • There was a simple phrase that cropped up a few times - “start business backwards, not data forwards” which I think is a helpful reminder
    • AWS shared two interesting figures. Though I’d be keen to understand how they were gathered, I still thought they warranted a mention;
      • 79% of challenges delivery teams see are cultural
      • CDOs tend to spend the majority of their time on challenges around culture (69%), not data

Wrapping Up

I’m appreciative to have the opportunity to meet peers, customers, other AWS partners, and people in the community as well as seeing representation from senior AWS team members across all sessions. I thought it was clear that AWS were invested in their partners and customers throughout the island of Ireland.
It was really exciting to see what AWS are working on in this space and what’s coming up. There’s a lot to look forward to.

]]>
<![CDATA[AWS Data Analytics Specialty Certification - My Experience & Tips]]>Key Takeaways

The AWS Data Analytics Specialty certification was a challenging learning experience that requires broad and deep understanding of all designing, building, securing and maintaining of data analytics solutions on AWS. There is a good split between testing of domain expertise (databases, visualisation, storage, ETL, etc.) alongside solution architecture

]]>
https://blog.alistoops.com/my-experience/64c40c70f022a1000115ffecFri, 28 Jul 2023 19:34:21 GMTKey TakeawaysAWS Data Analytics Specialty Certification - My Experience & Tips

The AWS Data Analytics Specialty certification was a challenging learning experience that requires broad and deep understanding of all designing, building, securing and maintaining of data analytics solutions on AWS. There is a good split between testing of domain expertise (databases, visualisation, storage, ETL, etc.) alongside solution architecture and design. Though I found much of the information tested on the exam to be beneficial to learn and understand, it’s worth noting that some some specific elements will not be valuable on a day to day basis (looking at you, DynamoDB capacity unit calculations). I do, however, feel as though undertaking this certification has had an overall positive impact on my day job.

I personally found the most challenging element of the certification to be mental stamina sitting the 3 hour long exam, so do try some practice exams beforehand.

Background

AWS have a number of specialty exams (6 at the time of writing). I decided to target the Data Analytics Specialty as my third AWS certification after cloud practitioner and solution architect associate certificates. I highly recommend Adrian Cantrill’s guidance around certification pathways as I found it really useful and would echo the sentiment around level of difficulty for the Data Analytics Specialty exam - their was definitely some above-associate-level solution architecture knowledge required that I felt equipped for given my existing analytics domain knowledge, so I would see either Solution Architect Professional (with little to no analytics experience) or Solution Architect Associate (with a lot of analytics experience) as pre-requisite paths.

I have some information on my background in my “About ” page but I would say the most relevant information is that I had been working with AWS for about a year prior to undertaking this certification, I had at least 5 years of experience across analytics-related technologies (databases, visualisation, distributed computing, and ETL), and already had foundational knowledge around storage, compute, and networking. All of these were essential, but some would be covered through the SAA certification.

Exam Topics & Structure

The exam content is fully described in the exam guide. At a glance, I think the structure of the data analytics specialty content was very clear and likely to make the learning easy to align. I found this to be the case, mostly. The exception is that I felt as though it was more helpful to focus specifically on services and solutions rather than the domains (Collection, Storage & Management, Processing, Analytics, & Visualisation, Security) when shifting from theory and hands on labs to exam preparation. This is somewhat arbitrary, but I thought it worth mentioning for anyone who has a little difficulty when switching to exam prep.

Typical guidance for AWS exams, such as paying attention to key terms like “most cost effective” or “least operational overhead” still apply here, as does ensuring you have a good understanding of single-AZ, multi-AZ and multi-region implications, and I found the next most useful thing in selecting a question response to be specific awareness of integrations (or inability to integrate as may be the case) - often with multiple services involved in a single question it can be tricky to remember all available combinations.

Given that AWS scale scoring so that more difficult / complex questions contribute more to your overall score, of course learning has to consider all aspects. However, I also think this means it’s important to be really confident on the less complex questions. In my exam, these included:

  • Quicksight - choosing the appropriate Quicksight visualisation type given a certain set of data (identifying when a scatter plot might be more useful than a bar chart, for example), and implementing AI-driven forecasting
  • How to flatten JSON data in Glue
  • Calculating DynamoDB WCU and RCU
  • Performance and log analytics (i.e. knowing when to use ELK / OpenSearch)

Though all aspects of a relatively large number of services are examined, I found some topic areas to come up slightly more  regularly during the exam, including:

  • File type limitations (orc, avro, parquet, csv) including, for example, which options are best suited based on partitioning or formatting requirements, or reading parquet into Quicksight (via Athena)
  • Scaling DBs for HA or DR, auto scaling vs manual scaling, scaling out and up (or horizontally and vertically)
  • Redshift Workload Management, concurrency, node types
  • DynamoDB came up a lot, but I had multiple questions specifically around partition keys
  • Streaming data probably came up most often. I didn’t get many Kafka / MSK questions (only 2-3 I can remember), but did encounter a number of Kinesis questions, probably as it relates to multiple exam domains when you consider integration with Kinesis Firehose, Analytics, OpenSearch, etc. understanding all integrations and limitations is critical, but you’ll also need to know information around consumers and producers, and error handling / retry config.
  • Message queues (SQS , SNS), message ordering, visibility timeout etc
  • Security - encryption, row and column level security, policies and roles, cross account access
  • EMR provisioning and open source spark applications (use of sqoop, hive, pig, NiFi, flink) and what they do

If I had to pick a handful of services in which a lack of expertise would have had a greater impact in my exam experience, I would say these are Kinesis (+integrations), Redshift, EMR, and DynamoDB, Glue, and relevant security services (KMS, HSM, and others). This is, of course, entirely subjective and will vary for everyone’s question set. This is just what I observed from my exam, but it will not be a surprising list to anybody already working in AWS data analytics.

Admittedly, Glue didn’t feature as much on my set of questions as I was expecting from my exam prep compared to the others listed above. I also didn’t get any ML questions based around Sagemaker, but did have questions that were focused on use of EMR for ML applications / solutions.

All that said, ultimately, if a service is data-specific, you’ll be expected to have expert knowledge on it. These are well described in the big data analytics whitepaper.

Exam Preparation

As with any AWS exam, working with AWS on a daily basis will almost certainly make the experience easier - though working with some of the exact services in the exam would have the most benefit, even if you’re not doing so, having a foundational understanding of networking fundamentals (VPC, CIDR notation and blocks), common integrations, endpoints, High Availability and Disaster Recovery configurations, etc.,will also be of some help. All that said, I undertook the exam when I was not using AWS day-to-day. It’s totally possible, but I would suggest that the learning will likely take a little longer.

Though I know people that have spent a matter of days or a couple of weeks preparing for this exam full time, I was dedicating 2-4 hours per week for studying. It took me 2.5-3 months to prep for and pass the exam.

I have used a variety of learning collateral for previous certifications, but tend to use Adrian Cantrill’s (cantrill.io) and/or Stephane Maarek’s learning material, and Tutorialsdojo’s practice exams. For the Data Analytics specialty, Adrian did not have a course so I used Stephane’s Udemy course. I also read (a few times) the AWS Big Data Analytics Whitepaper - though it was published in 2021, I believe this is still very relevant.

Taking the exam

First of all, if you’ve not taken a professional or specialty level AWS exams, it’s a vastly different experience from any other certification I’ve undertaken in that the question structure and length are different and there are much fewer questions that feel as though there is only one more obviously correct answer. The exam is not only longer (180 mins compared to 130 at associate), but I did need all of that time to complete the exam - I finished with about 4 minutes remaining. This makes it a real challenge in maintaining focus, and I would highlight the importance of taking practice exams to adjust to this.

I think both points above are reflective of the fact that the exam requires specialist domain knowledge across all data domains and services (as described in Adrian’s video, the “T” shaped knowledge).

In all honesty, I feel as though many technical specialists may still struggle with the exam despite great analytics domain knowledge due to the fact that an understanding of architectural considerations, how services and solutions fit together and integrate, are essential. The exam will be much easier for those working in technical or data architect roles (or with experience in these roles).

In taking the exam virtually (I used Pearson), I had no issues but if you’ve had any challenges with this previously (being warned for looking away from camera, or mouthing words as you read questions etc.), again, the likely challenge will be in exam duration.

One exception to the above, that depends on how you typically work, is that you don’t have the ability to use a calculator or pen and paper either to draw out a solution or to do calculations. There will almost certainly be some mental arithmetic required for some specific capacity based questions (throughput  and DynamoDB WCU and RCU calculations, for example).

What next?

Go, dominate the world with your AWS Data Analytics expertise… as for me, I decided to take a small break before going after the remaining associate levels certifications (SysOps and Developer). Once I decide whether to go for the solution architect professional or database specialty, I will be sure to capture my process and experience in a future post.

]]>
<![CDATA[AWS Data Governance - DataZone Preview Thoughts]]>Amazon DataZone is an AWS data governance solution announced at 2022 (November) re invent and currently in preview. Focused on democratisation of data by domain and exploring pub-sub self-service access to data. In addition to data access, it covers data catalog, business glossary, and metadata capture functionality. It’s

]]>
https://blog.alistoops.com/aws-data-governance/64ad7f15d2dc710001d4cad7Tue, 11 Jul 2023 17:23:24 GMT

Amazon DataZone is an AWS data governance solution announced at 2022 (November) re invent and currently in preview. Focused on democratisation of data by domain and exploring pub-sub self-service access to data. In addition to data access, it covers data catalog, business glossary, and metadata capture functionality. It’s worth noting that the positioning of DataZone is closely linked to some key areas, including, data mesh architecture.

A couple of things to point out about the preview are that it's currently free while in preview, the preview is limited to Ireland, US East, and US West regions, and as per the banner of the UI, it's recommended to avoid using for production purposes while in preview.

Why I’m Interested

Data management and governance is something that has been vital to success of data teams for as long as I have been working in the field (and much longer), but I believe that it’s beginning to to have a renewed focus in the previous couple of years. There are a number of reasons for this that I won’t be diving into here, such as the simple fact that many key pillars in technology and data tend to go through cycles of perceived importance. That said, I feel it worth mentioning that the move to cloud, huge increase in production and availability of data, and treatment of data as a product do make the value proposition of a service like AWS DataZone potentially massive.

AWS Data Governance - DataZone Preview Thoughts

Key Enablers

I’m keen to get to my experience with the DataZone preview as soon as possible, but I think there are a couple of key things worth calling out in terms of key enablers or success factors - i.e. the things that may maximise the value of a DataZone deployment, or things that I think might result in challenges if you compare your experience to the slick tech demos seen online.

  • Data stewards & Product owners - roles and responsibilities for self-service
  • Understanding of existing data / domain structure - it’s important to consider how things should be structured for ease of management and use
  • Maturity of datasets being published - to maximise value of AWS DataZone, the data must be usable and of high quality for consumers


Cost

DataZone is free for 3 months of use during preview, with some reasonable limits described in the pricing, so I think it’s a great opportunity to go and try using the service. After that, there is both a per month per user cost ($7.20-9USD) and a small cost for storing metadata. I’m unsure what the metadata cost will look like in real terms for large deployments, but it seems as though most of the cost will be on a per user basis.

Experience

I have seen some demos of DataZone where a series of steps were conducted to create data in Athena and bring in as net-new. I wanted to try pulling in already existing data to see what the process looked like as I imagine most deployments I would be likely to see will fit this use case. As it has been a few months since I’ve seen any demos or docs, I also wanted to try to go through the process as a completely new user to get a feel for what the learning curve would be like. Fortunately, I had some data from Kaggle (formula one dataset) in S3 already, so the only pre-requisite steps I conducted were to set up a Glue crawler for that s3 data to a new database.

AWS Data Governance - DataZone Preview Thoughts
Glue tables after crawler run

The creation of a domain, was fairly straightforward and the same can be said for creating DataZone projects, but this was where I felt I hit my first stumbling block as I couldn’t see any option to create the project linked to an existing Glue DB. I tried going back to the AWS console and adding my existing Glue DB as a data source, and voila, I could use an existing DB within the publisher project.

After running the crawler, publishing it to DataZone and making it active was really simple. I also liked that I had the option to either automatically publish as active, or draft (e.g. to remove table prefixes or similar) then active.

I then went through a couple of steps to create business glossary terms and metadata forms. Again, the creation was self explanatory, as was adding links to existing published data. The only challenge here is that it feels as though it would be quite manual to set up for a large data estate. It’s something I could see being improved over time such as adding a glossary term for “pit stop” to any data set with that in the title or columns.

AWS Data Governance - DataZone Preview Thoughts
Searching the new domain after publishing initial assets and adding metadata ("F1 Data Labels"

After doing all of this setup, creating the subscription elements didn’t take long, there was just one added step where the access wasn’t granted after subscription approval. The issue was related to Lakeformation permissions and DataZone was really clear in sharing what needed to be granted, but in combination with AWS documentation, this is easily resolved. I followed the steps to add myself as datalake admin, remove the allowed principals (link, point 1), and grant specific access to the S3 location, Glue database, and tables (via lakeformation UI), re linked in the DataZone UI, and that was all the setup done. I even tested with an additional IAM user to see what it might look like for non-admins and I felt this was all quite seamless.

AWS Data Governance - DataZone Preview Thoughts
Consumer project view after subscription approval and access granting

It may have taken a little longer first time round, but I think beginning the setup to end user querying the published data in Athena took around an hour. All-in-all, it felt like a brilliant experience.

AWS Data Governance - DataZone Preview Thoughts
Athena query from consumer view

Pros:

  • AWS-native solution for data governance meaning seamless AWS integration
  • Very user-friendly
  • Business glossary, and the fact that search and filtering includes business glossary rather than just data assets
  • Straightforward access management
  • Can see benefit at large scale or multiple domains and data mesh environments where ownership and stewardship is democratised, but also in centralised data teams / functions with a broader group of users / consumers

Cons:

  • Data Lineage not a key feature / use case
  • Each data asset has to be subscribed to individually where the option to multi select could be helpful. The same applies to revoking subscriptions and deleting published assets
  • A couple of specific issues in UI navigation - “open data portal” link didn’t seem to work on first click every time, query data (Athena) not working on iPad but did in safari on MacBook. This can be expected in preview, and didn't effect the overall experience

Future Considerations:

  • Currently, notifications of subscription requests only appear in the DataZone portal. It would be cool to see email or text notifications (e.g. via SNS) so it's not reliant on admins or data stewards being in the data portal for visibility
  • Data assets are limited to datasets - I would love to see this extended, for example, to include QuickSight dashboards
  • It would be nice to see future identification of some business glossary term links automatically / dynamically

The big question - would I recommend it?

I think the short answer here is yes. There are a few things I’d like to dig deeper into including cross-account data sharing, integration with partner solutions (sales force, service now), what the workflow looks like for updating data assets from source, and learning the common pitfalls as with any new tool (e.g. don’t delete the glue tables before unsubscribing consumer jobs), but I think the potential for DataZone is huge, and it’s a large step in the right direction in AWS-native services supporting data management and governance. I’m really excited to see how DataZone develops going forward.

]]>
<![CDATA[Hosting a Ghost Blog on AWS using Lightsail and docker]]>Intro & Context

For a long time now, I've been self-hosting many of services at home for personal use (more on that in another post yet to be written), which has been a key part of my personal learning across a range of things including networking, docker, and

]]>
https://blog.alistoops.com/test-post/646d3de43d42190001e8f67dTue, 23 May 2023 22:27:54 GMT

Intro & Context

For a long time now, I've been self-hosting many of services at home for personal use (more on that in another post yet to be written), which has been a key part of my personal learning across a range of things including networking, docker, and Linux command line. I’ve previously had installations of Wordpress and Ghost on my home network, but decided not to continue further after initial exploration as I didn’t want to keep an externally available blog on my local network.

I typically like to have control over the things I deploy and work on so, despite the simplicity of setup, I had a preference not to utilise one of the available paid hosting services for Ghost and after some personal and professional learning around AWS, I decided it was finally time to utilise some AWS services for creating this blog and sharing my experience.

In terms of prerequisites, the only proper requirements are to have an AWS account and a domain and a hosted zone for your domain (I used Route53, or R53, for both). Though I will make reference to how I set this up, I would recommend starting with purchasing and setting up your domain as it can take a little time for registration to complete. Similarly, I will highlight all relevant steps I carried out for deployment, buy some basic understanding of AWS Lightsail, AWS Cloudformation, Linux command line, Docker, and yaml would be beneficial.

Design Decisions

Before progressing any further, there are a couple of important design or architecture decisions that I should describe. It’s worth noting there are rarely “right” answers for these, I just wanted to share my own thinking.

  • First, why Lightsail? The short answer is for simplicity, but rather than regurgitate AWS docs, it's better to just provide the link - https://aws.amazon.com/lightsail/features/.
  • Next, Ghost as a package (AWS documentation available here and here) or docker container? This might be a little trickier as it will really depend on your comfort working with both Linux command line and docker, but the main reasons I went with docker were portability should I decide to move hosting elsewhere (should be simpler), and configuration as I felt more confident configuring reverse proxy, SSL and updating configuration options via docker-compose.
  • There are a few options for reverse proxy, but two popular ones I considered are nginx and caddy. I went with Caddy as I felt as though it had the most straightforward setup for a single web service (caddy reverse proxy docs).
  • Finally, AWS Console vs Cloudformation deployment. On this one, the primary driver was for experience and learning. Beyond that, there are a couple of smaller benefits should you choose Cloudformation such as making it a little easier to create multiple environments such as a development deployment as well as making the process of tearing down and redeploying simpler and quicker.

Lightsail Deployment

Now into the deployment itself. I created the below Cloudformation template iteratively by first writing a basic template to instantiate a Lightsail instance, then writing a docker-compose.yaml and testing locally, then putting the two together so that the template carries out some actions when the instance is first launched. These can be seen under the metadata, but essentially include

  • Installing docker
  • Creating the docker-compose file
  • Creating the Caddyfile for caddy configuration

It’s worth noting that including restart:always will make sure the docker compose is run after an instance restart.

Description: AWS CloudFormation template for lightsail instance
Parameters:  
  AvailabilityZone:    
    Type: 'AWS::EC2::AvailabilityZone::Name'
    Description: Availability Zone
  InstanceName:
    Type: String
    Description: Instance Name
    Default: ubuntu-blog
  BlueprintID:
    Type: String
    AllowedValues:
      - ubuntu_22_04
      - ubuntu_20_04
      - ubuntu_18_04
    Description: Blueprint ID allowing only ubuntu blueprint ids from May 23
    Default: ubuntu_22_04
  BundleID:
    Type: String
    AllowedValues:
      - nano_2_0
      - micro_2_0
      - small_2_0
      - medium_2_0
      - large_2_0
      - xlarge_2_0
      - 2xlarge_2_0
      - nano_win_2_0
      - micro_win_2_0
      - small_win_2_0
      - medium_win_2_0
      - large_win_2_0
      - xlarge_win_2_0
      - 2xlarge_win_2_0
    Description: Bundle ID
    Default: nano_2_0
  ProjectTag:
    Type: String
    Description: Project tag attribute value
    Default: da-blog-test
  EnvironmentTag:
    Type: String
    Description: Environment tag attribute value
    Default: development

Resources:
  #Lightsail deployment
  LightsailInstance:
    Type: 'AWS::Lightsail::Instance'
    Properties:
      AvailabilityZone: !Ref AvailabilityZone
      BlueprintId: !Ref BlueprintID
      BundleId: !Ref BundleID
      InstanceName: !Ref InstanceName
      Tags:
        - Key: project
          Value: !Ref ProjectTag
        - Key: environment
          Value: !Ref EnvironmentTag
      UserData: |
        #!/bin/bash
        sudo apt-get update -y
        sudo apt-get upgrade
        curl -fsSL https://get.docker.com -o get-docker.sh
        sudo sh get-docker.sh
        sudo systemctl enable docker.service
        sudo systemctl enable containerd.service
        sudo usermod -aG docker ubuntu
        sudo curl -L "https://github.com/docker/compose/releases/download/v2.18.1/docker-compose-$(uname -s)-$(uname -m)"  -o /usr/local/bin/docker-compose
        sudo mv /usr/local/bin/docker-compose /usr/bin/docker-compose
        sudo chmod +x /usr/bin/docker-compose
        mkdir -p /home/ubuntu/docker/blog
        mkdir -p /home/ubuntu/docker/blog/secrets
        mkdir -p /home/ubuntu/docker/blog/caddy/data
        mkdir -p /home/ubuntu/docker/blog/caddy/config
        mkdir -p /home/ubuntu/docker/blog/ghost/content
        cat << 'EOF' > /home/ubuntu/docker/blog/caddy/Caddyfile
        {
                email youremail@domain.com
        }
        yoursub.domain.com {
                reverse_proxy ghost:2368
        }
        EOF
        cat << 'EOF' > /home/ubuntu/docker/blog/docker-compose.yml
        version: '3.8'
        services:
          ghost:
            image: ghost:5-alpine
            restart: always
            ports:
              - 2368:2368
            environment:
              # see https://ghost.org/docs/config/#configuration-options
              database__connection__filename: '/var/lib/ghost/content/data/ghost.db'
              # this url value is just an example, and is likely wrong for your environment!
              url: https://yoursub.domain.com
              # contrary to the default mentioned in the linked documentation, this image defaults to NODE_ENV=production (so development mode needs to be explicitly specified if desired)
              NODE_ENV: development
            volumes:
              - ./ghost/content:/var/lib/ghost/content
          caddy:
            image: caddy:2.6.4-alpine
            restart: always
            container_name: caddy
            ports:
              - 443:443
              - 80:80
            volumes:
              - ./caddy/Caddyfile:/etc/caddy/Caddyfile
              - ./caddy/data:/data
              - ./caddy/config:/config
        EOF
Cloudformation template

In the above, make sure to adjust any relevant config, but noticeably the domain for which you have configured the appropriate A record in your hosted zone.

Next, I navigated to AWS Cloudformation in the AWS console to launch the template. I had previously uploaded the template to S3, but you can also directly upload it. Then follow the GUI steps as relevant.

Hosting a Ghost Blog on AWS using Lightsail and docker
Hosting a Ghost Blog on AWS using Lightsail and docker
Using the Cloudformation template, most options will be prefilled. Be sure to at least adjust stack name, availability zone, and tags

If you prefer, you could also conduct most of the steps to this point steps, aside from the metadata steps,  via the Lightsail GUI. Should you choose this option, the commands in the user metadata could be conducted via SSH. Essentially, we are creating the cheapest instance with OS only.  

Hosting a Ghost Blog on AWS using Lightsail and docker
Hosting a Ghost Blog on AWS using Lightsail and docker

Networking & Domain

After the template launch has completed, I configured the instance to attach a static IP and added the relevant networking configuration to open the ports for HTTP and HTTPS access as well as limit SSH access to my individual IP.

Hosting a Ghost Blog on AWS using Lightsail and docker
Creating a static IP - on the Lightsail instance, click the hamburger icon, Manage, then Networking
Hosting a Ghost Blog on AWS using Lightsail and docker
Only open ports for HTTP and HTTPS traffic, and be sure to restrict SSH to your IP

At this point, you need to open R53 and add the relevant A record with the aforementioned static IP.

Finally, I connected to the instance to run the docker compose and create the stack. The commands here first change directory to where the docker-compose yaml was created, then lists the subdirectory contents, and creates the relevant containers.

cd /home/ubuntu/docker/blog
ls
docker-compose up -d

After all this, you should be able to go to yoursub.domain.com to see your new blog website, or yoursub.domain.com/ghost and begin configuring your blog.

Next steps

There are some things I haven’t discussed in this post in the interest of keeping it to the point, but I just want to call out some considerations after deployment.

  • Security - we’ve mostly considered SSL certification to enable HTTPS, but there are some other next steps such as configuring a web application firewall, SSH security (key only), DNS healthchecks, DNSSEC, and others.
  • I’ve set up some basic budget monitoring and alerts, but if you haven’t done so already these should definitely be considered.
  • Setting up Ghost itself (well described at their website).
  • Updating the docker container(s) manually, scheduled, or automatically using something like Portainer.
]]>