GCP PDE Exam Principles

Intro

This is a post in a series recording some notes as I’m studying for Google’s Professional Data Engineer Certification.

In this post, I provide some principles I am keeping in mind as I prepare for the exam.

Disclaimer

Please read this disclaimer.

GCP PDE Exam Principles

[security] Follow the Principle of Least Privilege
When between multiple options, the simplest answer is probably right
When between multiple options, the cheapest answer is probably right
An answer’s complexity is likely proporitional to the questions’ complexity
- For example, simple questions like: “What is your favorite colour” probably has a simple answer
- A more complex question (e.g. “What is the Air-Speed Velocity of an Unladen Swallow on a Tuesday in a head-wind of 30 kmph and 70% relative humidity”) probably requires a more complex answer
When creating Bigtable row keys, the key focus is to avoid hot spots
- DON’T USE timestamp as start of rowkey for bigtable
Prefer pre-trained ML models to custom-trained
- These solutions are typically good enough for basic use-cases and can be implemented more quickly
[security] Assign roles to groups and add users to groups (don’t apply roles to users)
If something is self-joined and running into problems when scaling, the answer likely involves normalizing
If question mentions “realtime” and pub/sub, it probably requires a push from pub/sub rather than a pull
[BQ] W/ multiple, wildcard tables, partitioning is best (over sharding)
Typically, serverless/cloud-native solution is favorable over managed services or dedicated machines/VMs
HDFS files should be stored in GCS
It is not uncommon for multiple services to be used together (b/c each service is often optimized to do one thing very well)
If the question involves “ANSI SQL”, the answer probably involves BigQuery