Dataflow in a Nutshell
This post is part of a series of posts with notes as I’m studying for Google’s Professional Data Engineer Certification.
This particular post covers Dataflow in a nutshell.
Please read this disclaimer.
- Wraps apache beam
- Handles batch and streaming data
- Typically, more effecient scaling than Dataproc
- Therefore, it’s typically cheaper than Dataproc
- Used internally at Google
- Improved by Google over time
- Know about windowing
fixedwindow in java/python ≈
- session (is the same for both SQL and java/python)
- Know about watermarks