GCP PDE Recipies

Intro

This is a post in a series recording some notes as I’m studying for Google’s Professional Data Engineer Certification.

In this post, I provide some common recipies which are commonly used in Google Cloud.

Disclaimer

Please read this disclaimer.

GCP PDE Recipies

The Classic

Pub/Sub -> Dataflow -> BQ (batch inserts)

Variations:

  • Pub/Sub -> Dataflow -> BQ (streaming inserts)
  • Pub/Sub -> Dataflow -> Bigtable -> BQ (querying Bigtable using federated query)
    • For use-cases which require analytics and the low-latency afforded by Bigtable

Uploading Data to Google Cloud

  • gsutil
    • on-prem. (if practical based on network bandwidth and data size)
    • Good for < 1TB
  • Storage Transfer Service
    • From another cloud/on-prem. data center w/ sufficient bandwidth
    • Good for > 1TB
  • Transfer Appliance
    • Physical hard-drive you fill and send back
    • For large amounts of data on-prem. and/or in a low-bandwidth location which makes gsutil impractical