Skip to main content

Step 4: Route to Storage

Route each table's data to organized cloud storage paths with compression and appropriate storage classes.

The Goal​

  • Route records to table-specific paths
  • Organize by date for easy recovery
  • Compress with Parquet for cost efficiency
  • Use Nearline storage class for backups

Implementation​

step-4-route.yaml
output:
switch:
cases:
# Orders table
- check: this._table == "orders"
output:
gcp_cloud_storage:
bucket: "${GCS_BACKUP_BUCKET}"
path: "backups/orders/${!this._backup_metadata.backup_date}/orders-${!timestamp_unix()}.parquet"
content_type: application/octet-stream
storage_class: NEARLINE
batching:
count: 10000
period: 60s
parquet_encoding:
compression: SNAPPY

# Inventory table
- check: this._table == "inventory"
output:
gcp_cloud_storage:
bucket: "${GCS_BACKUP_BUCKET}"
path: "backups/inventory/${!this._backup_metadata.backup_date}/inventory-full.parquet"
content_type: application/octet-stream
storage_class: NEARLINE
batching:
count: 50000
period: 120s
parquet_encoding:
compression: SNAPPY

# Order items table
- check: this._table == "order_items"
output:
gcp_cloud_storage:
bucket: "${GCS_BACKUP_BUCKET}"
path: "backups/order_items/${!this._backup_metadata.backup_date}/items-${!timestamp_unix()}.parquet"
content_type: application/octet-stream
storage_class: NEARLINE
batching:
count: 10000
period: 60s
parquet_encoding:
compression: SNAPPY

# Fallback for unknown tables
- output:
gcp_cloud_storage:
bucket: "${GCS_BACKUP_BUCKET}"
path: "backups/unknown/${!this._table}/${!this._backup_metadata.backup_date}/data-${!timestamp_unix()}.json"
content_type: application/json
storage_class: NEARLINE
batching:
count: 1000
period: 30s

Understanding the Code​

ComponentPurpose
switch.casesRoute based on _table field
${!this._backup_metadata.backup_date}Dynamic path from record
${!timestamp_unix()}Unique file suffix
storage_class: NEARLINECost-optimized cold storage
parquet_encoding.compression: SNAPPYFast compression

Storage Class Comparison​

ClassCost/GB/monthAccess FeeUse For
Standard$0.020NoneActive data
Nearline$0.010$0.01/GBMonthly access
Coldline$0.004$0.02/GBQuarterly access
Archive$0.0012$0.05/GBYearly access

For backups: Nearline is idealβ€”accessed only during recovery.

Why Parquet?​

Format1M Rows SizeQuery Speed
JSON500 MBSlow (full scan)
CSV300 MBSlow (full scan)
Parquet50-100 MBFast (columnar)

70-90% compression + columnar queries = major cost savings.

Path Organization​

gs://backup-bucket/
└── backups/
β”œβ”€β”€ orders/
β”‚ β”œβ”€β”€ 2024-01-14/
β”‚ β”‚ β”œβ”€β”€ orders-1705190400.parquet
β”‚ β”‚ └── orders-1705194000.parquet
β”‚ └── 2024-01-15/
β”‚ └── orders-1705276800.parquet
β”œβ”€β”€ inventory/
β”‚ └── 2024-01-15/
β”‚ └── inventory-full.parquet
└── order_items/
└── 2024-01-15/
└── items-1705276800.parquet

Production Considerations​

AWS S3 Alternative​

output:
aws_s3:
bucket: "${S3_BACKUP_BUCKET}"
path: "backups/orders/${!this._backup_metadata.backup_date}/orders-${!timestamp_unix()}.parquet"
storage_class: GLACIER_IR # Similar to Nearline
batching:
count: 10000

Encryption at Rest​

Enable server-side encryption:

gcp_cloud_storage:
# GCS encrypts by default, but for customer-managed keys:
kms_key_name: "projects/.../cryptoKeys/backup-key"

Retention with Object Lifecycle​

# Set via gsutil (delete after 1 year)
gsutil lifecycle set lifecycle.json gs://${GCS_BACKUP_BUCKET}

Complete Pipeline​

You've built all 4 backup steps! See the complete configuration: