Step 4: Route to Storage
Route each table's data to organized cloud storage paths with compression and appropriate storage classes.
The Goalβ
- Route records to table-specific paths
- Organize by date for easy recovery
- Compress with Parquet for cost efficiency
- Use Nearline storage class for backups
Implementationβ
step-4-route.yaml
output:
switch:
cases:
# Orders table
- check: this._table == "orders"
output:
gcp_cloud_storage:
bucket: "${GCS_BACKUP_BUCKET}"
path: "backups/orders/${!this._backup_metadata.backup_date}/orders-${!timestamp_unix()}.parquet"
content_type: application/octet-stream
storage_class: NEARLINE
batching:
count: 10000
period: 60s
parquet_encoding:
compression: SNAPPY
# Inventory table
- check: this._table == "inventory"
output:
gcp_cloud_storage:
bucket: "${GCS_BACKUP_BUCKET}"
path: "backups/inventory/${!this._backup_metadata.backup_date}/inventory-full.parquet"
content_type: application/octet-stream
storage_class: NEARLINE
batching:
count: 50000
period: 120s
parquet_encoding:
compression: SNAPPY
# Order items table
- check: this._table == "order_items"
output:
gcp_cloud_storage:
bucket: "${GCS_BACKUP_BUCKET}"
path: "backups/order_items/${!this._backup_metadata.backup_date}/items-${!timestamp_unix()}.parquet"
content_type: application/octet-stream
storage_class: NEARLINE
batching:
count: 10000
period: 60s
parquet_encoding:
compression: SNAPPY
# Fallback for unknown tables
- output:
gcp_cloud_storage:
bucket: "${GCS_BACKUP_BUCKET}"
path: "backups/unknown/${!this._table}/${!this._backup_metadata.backup_date}/data-${!timestamp_unix()}.json"
content_type: application/json
storage_class: NEARLINE
batching:
count: 1000
period: 30s
Understanding the Codeβ
| Component | Purpose |
|---|---|
switch.cases | Route based on _table field |
${!this._backup_metadata.backup_date} | Dynamic path from record |
${!timestamp_unix()} | Unique file suffix |
storage_class: NEARLINE | Cost-optimized cold storage |
parquet_encoding.compression: SNAPPY | Fast compression |
Storage Class Comparisonβ
| Class | Cost/GB/month | Access Fee | Use For |
|---|---|---|---|
| Standard | $0.020 | None | Active data |
| Nearline | $0.010 | $0.01/GB | Monthly access |
| Coldline | $0.004 | $0.02/GB | Quarterly access |
| Archive | $0.0012 | $0.05/GB | Yearly access |
For backups: Nearline is idealβaccessed only during recovery.
Why Parquet?β
| Format | 1M Rows Size | Query Speed |
|---|---|---|
| JSON | 500 MB | Slow (full scan) |
| CSV | 300 MB | Slow (full scan) |
| Parquet | 50-100 MB | Fast (columnar) |
70-90% compression + columnar queries = major cost savings.
Path Organizationβ
gs://backup-bucket/
βββ backups/
βββ orders/
β βββ 2024-01-14/
β β βββ orders-1705190400.parquet
β β βββ orders-1705194000.parquet
β βββ 2024-01-15/
β βββ orders-1705276800.parquet
βββ inventory/
β βββ 2024-01-15/
β βββ inventory-full.parquet
βββ order_items/
βββ 2024-01-15/
βββ items-1705276800.parquet
Production Considerationsβ
AWS S3 Alternativeβ
output:
aws_s3:
bucket: "${S3_BACKUP_BUCKET}"
path: "backups/orders/${!this._backup_metadata.backup_date}/orders-${!timestamp_unix()}.parquet"
storage_class: GLACIER_IR # Similar to Nearline
batching:
count: 10000
Encryption at Restβ
Enable server-side encryption:
gcp_cloud_storage:
# GCS encrypts by default, but for customer-managed keys:
kms_key_name: "projects/.../cryptoKeys/backup-key"
Retention with Object Lifecycleβ
# Set via gsutil (delete after 1 year)
gsutil lifecycle set lifecycle.json gs://${GCS_BACKUP_BUCKET}
Complete Pipelineβ
You've built all 4 backup steps! See the complete configuration: