BigQuery in a Nutshell

Intro

This post is part of a series of posts with notes as I’m studying for Google’s Professional Data Engineer Certification.

This particular post covers BigQuery in a nutshell.

Fully managed, petabyte-scale data warehouse/analysis service
Has two components:
- Storage (built on “Colossus”)
- Query/Processing Engine
Good for federated searching
Column-based storage
Allows ANSI SQL compliant queries
Includes good support for:
- Building (basic) ML models
  - Logistic regression - classifying categories
  - Linear regression - forcasting # values
- GeoViz
- Structs
  - Allows horizontal storage
  - Look like: event.status
  - Identifiable where type=RECORD
- Arrays
  - Allows vertical storage
  - Identifiable where mode=REPEATED
- Analytic window functions using LAG
- Named subqueries using WITH
  - Pre-processes data which can be cached for future queries
Optimization:
- Partition tables
  - Group data into sections allowing BQ to scan less data
  - Partition pruning occurs before the query is run, allowing you to know costs up-front
  - Common partition source is dates/times
  - ≈ an index row wrapping the entire table
- Clustering
  - Can be done against multiple columns
  - Clustering is applied while the query is run, so you don’t know how much the query will cost and how much cluster will save you
  - More granular than partitioning
Typically populated in batch inserts
- Offers streaming inserts to allow smaller queries, run more often, w/ lower latency, and at a higher cost
- Max streaming capability ≈ 100,000 rows/table/sec.
Offers access control at project, dataset, and table/view levels

Among many other differences:

BigTable is designed for long, narrow tables.

BigQuery typically has short(er), wide(r) tables.