Dataflow in a Nutshell
Intro
This post is part of a series of posts with notes as I’m studying for Google’s Professional Data Engineer Certification.
This particular post covers Dataflow in a nutshell.
Disclaimer
Please read this disclaimer.
Dataflow
- Serverless
- Wraps apache beam
- Handles batch and streaming data
- Typically, more effecient scaling than Dataproc
- Therefore, it’s typically cheaper than Dataproc
- Used internally at Google
- Improved by Google over time
- Know about windowing
fixed
window in java/python ≈tumbling
in SQLsliding
≈hoping
in SQL- session (is the same for both SQL and java/python)
- Know about watermarks