BigQuery in a Nutshell
Intro
This post is part of a series of posts with notes as I’m studying for Google’s Professional Data Engineer Certification.
This particular post covers BigQuery in a nutshell.
Disclaimer
Please read this disclaimer.
BigQuery
- Fully managed, petabyte-scale data warehouse/analysis service
- Has two components:
- Storage (built on “Colossus”)
- Query/Processing Engine
- Good for federated searching
- Column-based storage
- Allows ANSI SQL compliant queries
- Includes good support for:
- Building (basic) ML models
- Logistic regression - classifying categories
- Linear regression - forcasting # values
- GeoViz
- Structs
- Allows horizontal storage
- Look like:
event.status
- Identifiable where type=
RECORD
- Arrays
- Allows vertical storage
- Identifiable where mode=
REPEATED
- Analytic window functions using
LAG
- Named subqueries using
WITH
- Pre-processes data which can be cached for future queries
- Building (basic) ML models
- Optimization:
- Partition tables
- Group data into sections allowing BQ to scan less data
- Partition pruning occurs before the query is run, allowing you to know costs up-front
- Common partition source is dates/times
- ≈ an index row wrapping the entire table
- Clustering
- Can be done against multiple columns
- Clustering is applied while the query is run, so you don’t know how much the query will cost and how much cluster will save you
- More granular than partitioning
- Partition tables
- Typically populated in batch inserts
- Offers streaming inserts to allow smaller queries, run more often, w/ lower latency, and at a higher cost
- Max streaming capability ≈ 100,000 rows/table/sec.
- Offers access control at project, dataset, and table/view levels
Similar Systems
BigQuery vs. BigTable
Among many other differences:
BigTable is designed for long, narrow tables.
BigQuery typically has short(er), wide(r) tables.