Step 4: Route to Storage

Route each table's data to organized cloud storage paths with compression and appropriate storage classes.

The Goal

Route records to table-specific paths
Organize by date for easy recovery
Compress with Parquet for cost efficiency
Use Nearline storage class for backups

Implementation

step-4-route.yaml
output:
  switch:
    cases:
      # Orders table
      - check: this._table == "orders"
        output:
          gcp_cloud_storage:
            bucket: "${GCS_BACKUP_BUCKET}"
            path: "backups/orders/${!this._backup_metadata.backup_date}/orders-${!timestamp_unix()}.parquet"
            content_type: application/octet-stream
            storage_class: NEARLINE
            batching:
              count: 10000
              period: 60s
            parquet_encoding:
              compression: SNAPPY

      # Inventory table
      - check: this._table == "inventory"
        output:
          gcp_cloud_storage:
            bucket: "${GCS_BACKUP_BUCKET}"
            path: "backups/inventory/${!this._backup_metadata.backup_date}/inventory-full.parquet"
            content_type: application/octet-stream
            storage_class: NEARLINE
            batching:
              count: 50000
              period: 120s
            parquet_encoding:
              compression: SNAPPY

      # Order items table
      - check: this._table == "order_items"
        output:
          gcp_cloud_storage:
            bucket: "${GCS_BACKUP_BUCKET}"
            path: "backups/order_items/${!this._backup_metadata.backup_date}/items-${!timestamp_unix()}.parquet"
            content_type: application/octet-stream
            storage_class: NEARLINE
            batching:
              count: 10000
              period: 60s
            parquet_encoding:
              compression: SNAPPY

      # Fallback for unknown tables
      - output:
          gcp_cloud_storage:
            bucket: "${GCS_BACKUP_BUCKET}"
            path: "backups/unknown/${!this._table}/${!this._backup_metadata.backup_date}/data-${!timestamp_unix()}.json"
            content_type: application/json
            storage_class: NEARLINE
            batching:
              count: 1000
              period: 30s

Understanding the Code

Component	Purpose
`switch.cases`	Route based on `_table` field
`${!this._backup_metadata.backup_date}`	Dynamic path from record
`${!timestamp_unix()}`	Unique file suffix
`storage_class: NEARLINE`	Cost-optimized cold storage
`parquet_encoding.compression: SNAPPY`	Fast compression

Storage Class Comparison

Class	Cost/GB/month	Access Fee	Use For
Standard	$0.020	None	Active data
Nearline	$0.010	$0.01/GB	Monthly access
Coldline	$0.004	$0.02/GB	Quarterly access
Archive	$0.0012	$0.05/GB	Yearly access

For backups: Nearline is ideal—accessed only during recovery.

Why Parquet?

Format	1M Rows Size	Query Speed
JSON	500 MB	Slow (full scan)
CSV	300 MB	Slow (full scan)
Parquet	50-100 MB	Fast (columnar)

70-90% compression + columnar queries = major cost savings.

Path Organization

gs://backup-bucket/
└── backups/
    ├── orders/
    │   ├── 2024-01-14/
    │   │   ├── orders-1705190400.parquet
    │   │   └── orders-1705194000.parquet
    │   └── 2024-01-15/
    │       └── orders-1705276800.parquet
    ├── inventory/
    │   └── 2024-01-15/
    │       └── inventory-full.parquet
    └── order_items/
        └── 2024-01-15/
            └── items-1705276800.parquet

Production Considerations

AWS S3 Alternative

output:
  aws_s3:
    bucket: "${S3_BACKUP_BUCKET}"
    path: "backups/orders/${!this._backup_metadata.backup_date}/orders-${!timestamp_unix()}.parquet"
    storage_class: GLACIER_IR  # Similar to Nearline
    batching:
      count: 10000

Encryption at Rest

Enable server-side encryption:

gcp_cloud_storage:
  # GCS encrypts by default, but for customer-managed keys:
  kms_key_name: "projects/.../cryptoKeys/backup-key"

Retention with Object Lifecycle

# Set via gsutil (delete after 1 year)
gsutil lifecycle set lifecycle.json gs://${GCS_BACKUP_BUCKET}

Complete Pipeline

You've built all 4 backup steps! See the complete configuration:

📖 Complete Pipeline →

The Goal​

Implementation​

Understanding the Code​

Storage Class Comparison​

Why Parquet?​

Path Organization​

Production Considerations​

AWS S3 Alternative​

Encryption at Rest​

Retention with Object Lifecycle​

Complete Pipeline​