AWS Certified Data Engineer Associate Beta - My Experience & Tips

AWS Certified Data Engineer Associate Beta - My Experience & Tips

TLDR;

The AWS Certified Data Engineer Associate certification is intended to cover much of the same material as the data analytics specialty certification. In that vein, it felt more domain-aligned than the existing associate level certifications, and covered a large range of topics to a reasonable level of depth. Having worked in a data engineering capacity in previous roles, it’s clear the exam covers relevant subject areas and I think much of the material will feel familiar to those working as data practitioners in AWS or existing data engineering roles

There were some very specific questions that I'm not sure would fairly reflect the ability to operate as a data engineer on AWS, but these were few and far between. It's worth noting that undertaking the beta exam means there was a larger spread of difficulty in questions, a longer exam experience, and some difficulty in exam prep. (not knowing where to focus), all of which would likely change the future exam experience.

Ultimately, I feel that when the exam is available in April 2024, it will be of value for those in data engineering roles, aspiring data engineers, or those in other data roles such as scientists and analysts looking to broaden their understanding.

What is a beta exam?

So I think it’s worth calling out that the beta Associate Data Engineer exam (DEA-C01) was available between November 27, 2023 and January 12, 2024 - I undertook it in December 2023 - which is relevant context for what will follow in this blog as much of it will be subject to change when the exam is generally available in April 2024. AWS use beta exams to test exam item performance before use in a live exam, but there are a few key differences that are worth calling out, though these are specific to the associate DE exam, they are applicable in some way to all beta exams;

  • The exam cost is reduced by 50% (to 75 USD plus VAT)
  • The duration and number of questions are different. In this case, all other associate exams* are 130 minutes in duration and 65 questions (50 scored, 15 not scored), whereas the associate data engineer exam is 170 minutes in duration with 85 questions. The DEA-C01 exam guide indicates the same format of the other associate level exams (once it moves from beta to generally available) in terms of question number but doesn't call out duration - for now, I would expect it to also be 130 minutes
  • You don’t get results within the typical 5 day window following exam completion. Instead, your results are available 90 days after the beta exam closes. In this case, that would mean early April, so I do not know how I scored yet
  • Though not specific to the exam itself, the nature of the available preparation material being limited to an exam guide and a practice question set containing 20 questions, as well as 3-4 skill builder links, means that it’s difficult to be confident around exam readiness

*The SysOps Associate exam is in this format from March 2023 when the labs were removed until further notice.

What is the DE associate cert and who is it aimed at?

I’m not going to regurgitate the exam guide beyond the bullet points below, but unsurprisingly the exam is aimed at data engineers or aspiring data engineers. I would say, in terms of the other 3 associate level certifications, it’s a little closer to the developer associate than solutions architect or SysOps administrator on the basis that I think the latter two can add a lot of value to people working in AWS that aren’t fulfilling administrative or solutions architect roles. In my opinion, the certified data engineer certification is not just aimed at data engineers, but is only of real value to those working in data, or those hoping to.

As in the exam guide, the exam also validates a candidate’s ability to complete the following tasks:

  • Ingest and transform data, and orchestrate data pipelines while applying programming concepts.
  • Choose an optimal data store, design data models, catalog data schemas, and manage data lifecycles.
  • Operationalize, maintain, and monitor data pipelines. Analyze data and ensure data quality.
  • Implement appropriate authentication, authorization, data encryption, privacy, and governance. Enable logging.

I plan to write a brief follow-up post sharing a comparison between the data engineer associate and data analytics specialty certifications, but it is worth mentioning that there is a significant overlap in domains and services on the exam guides, so those who have previously passed the data analytics specialty will be well positioned to pass the associate data engineer certification.

Exam Preparation

This will be short and sweet. With the exam in beta, I didn’t have much to go on for preparation. I used 3 resources alongside the exam guide;

  • AWS Skillbuilder - for the 20 question exam set
  • Big data analytics whitepaper - I only used this to brush up on a couple of services on the basis that I found it exceptionally useful when studying for the data analytics specialty exam. In retrospect, I think the same applies here, and I would have relied on it more
  • Specific parts of the data analytics udemy course by Stephane Maarek and Frank Kane - similar to the whitepaper, for overlapping services

Admittedly, my preparation could have been better. You'll note I didn't reference anything under "Step 2" on the exam page. It's worth noting that since undertaking the exam, I stumbled across this training course. I've not used it, but thought it worth drawing attention to. Adrian Cantrill is also planning to release a course that is due at the end of January 2024.

My Experience & Recommendations

You might ask yourself why all of the earlier preamble on this being a beta exam matters. The short answer is that the below topic areas and observations should be read with a few key considerations in that the questions themselves will be subject to change as is the case with all certification exams over time, but those changes are likely to be a little more high frequency in the earliest days the exam is offered, the increased number of questions means any of the below could be examined, but it’s not to say all will be, and I also don’t know if I passed the exam yet so take this with a pinch of salt.

In addition to the above, I felt as though the spread in question difficulty was much larger than other associate level exams. I’ve undertaken all 3 from 2021 to 2023 and the nature of those being well established means the question difficulty tends to be on an even keel - there aren’t many gimmes or higher difficulty questions. I found that not to be the case during the beta exam, and that’s likely to be the case during the first few months the exam is available.

Finally, my experience with Pearson Vue online was quite negative. I had to queue after check-in for 50 minutes and when I was next in queue it reset me to position 70. I did use the chat function to contact support but got nothing productive in return. Given there are no breaks allowed, this will have definitely had some effect on my concentration. This was the first time something like this has happened in 7 (I think) exam experiences, so I'm hoping it's a one-off, but next time I would likely use the support / chat to reschedule.

Now on to the focus areas;

  • Data formats - namely JSON, csv, avro, parquet. Existing native service integrations (e.g. can Quicksight import csv), specific data type limitations such as compression types or ability to handle nulls / missing values, and performance implications such as correct use of columnar data types are all inportant
  • PII redaction - understand implementation of redaction through Sagemaker, Glue, Databrew, Comprehend during transformation, redaction at the consumption layer through RLS in Quicksight, or understanding foundational concepts in hashing or salting
  • Orchestration - I’d recommend understanding the differences between Glue workflows, step functions, and Managed Workflows for Apache Airflow (MWAA)
  • DMS, data sync, app flow, and data exchange
  • Data catalogues and metastores. Not just Glue, but Hive and external metastores too
  • Analytics - surfacing in Quicksight, Quicksight commections to data in other services such as redshift and S3, and appropriate user or service access controls
  • Lakeformation - I wouldn't say theres a need to be a lakeformation expert, but certainly be familiar with it and understand the different elements of access control it grants over other methods
  • Hands on SQL queries (e.g. CTAS and group by, where / having clauses) are important to understand so be sure you're comfortable with basic SQL
  • Most DB services are likely to be tested - aurora, MSSQL, PostgreSQL, DynamoDB, DocumentDB, Redshift. I found the Data Analytics Soecialty preparation I had done oreviously to be helpful - it covered things like Redshift key types, federated queries, WLM, Vacuum types, indexes, and hashing
  • Networking - security groups vs NACL and cross region as well as cross account redshift access
  • Serverless - what serverless solutions exist, serverless stacks being aligned to lowest cost, and how / when serverless should be a preference
  • As with all associate exams, some critical areas include preparing for questions types of assessing “least operational overhead” or “lowest cost” options, understanding IAM, and implementing cross-VPC and cross-account solutions

A few topic areas that I wasn’t totally expecting;

  • Regex. I wasn’t surprised to see something around pattern or text matching, but I would recommend reciewing key operators such as starting with, ending in, and upper / lower case
  • Data mesh - its important to have fundamental understanding of the concept, and terminologies (data products, federated data, distributed teams)

Finally, I didn’t see much appearing on;

  • Code commit, deploy, build, or pipeline
  • Machine Learning. Consider data supply for ML use cases and Sagemaker functionality only
  • Containers

What’s next?

I’m planning to go for the solution architect professional exam in 2024 but, otherwise, I will update this in April once I get the results back.