Dataflow in a Nutshell

Intro

This post is part of a series of posts with notes as I’m studying for Google’s Professional Data Engineer Certification.

This particular post covers Dataflow in a nutshell.

Disclaimer

Please read this disclaimer.

Dataflow

  • Serverless
  • Wraps apache beam
    • Handles batch and streaming data
  • Typically, more effecient scaling than Dataproc
    • Therefore, it’s typically cheaper than Dataproc
  • Used internally at Google
    • Improved by Google over time
  • Know about windowing
    • fixed window in java/python ≈ tumbling in SQL
    • slidinghoping in SQL
    • session (is the same for both SQL and java/python)
  • Know about watermarks