Step 5: Standardize Schema

Convert DB2's UPPERCASE column names to BigQuery's lowercase snake_case convention and add partition fields. This replaces DataStage Transformer field mapping.

The Goal

Convert TRANSACTION_ID → transaction_id
Remove original uppercase fields
Add BigQuery partition field _partition_date
Carry forward all transformed fields from previous steps

Why This Matters

BigQuery Best Practices: Lowercase, snake_case column names Query Simplicity: SELECT transaction_id vs SELECT "TRANSACTION_ID" Partitioning: Date partitions dramatically improve query performance and reduce costs

DataStage Equivalent

In DataStage, schema mapping requires:

Transformer Stage with manual field-by-field assignment
Metadata propagation to ensure downstream stages see new names
Separate logic for partition field generation

Expanso simplifies this with explicit field mapping in a single processor.

Implementation

Add the schema standardization processor after categorization:

step-5-schema.yaml
pipeline:
  processors:
    # Steps 1-4 from previous...
    
    # Step 5: Standardize field names for BigQuery
    - mapping: |
        # Map DB2 columns to BigQuery format
        root.transaction_id = this.TRANSACTION_ID
        root.customer_id = this.CUSTOMER_ID
        root.transaction_date = this.TRANSACTION_DATE
        root.transaction_type = this.TRANSACTION_TYPE
        root.merchant_name = this.MERCHANT_NAME
        root.merchant_category_code = this.MERCHANT_CATEGORY_CODE
        root.source_system = this.SOURCE_SYSTEM
        root.created_at = this.CREATED_AT
        
        # Carry forward transformed fields from previous steps
        root.amount_usd = this.amount_usd
        root.original_amount = this.original_amount
        root.original_currency = this.original_currency
        root.account_number_masked = this.account_number_masked
        root.account_number_hash = this.account_number_hash
        root.transaction_category = this.transaction_category
        root._lineage = this._lineage
        
        # Add BigQuery partition field
        root._partition_date = this.TRANSACTION_DATE.ts_parse("2006-01-02").ts_format("2006-01-02")

Understanding the Code

Expression	What It Does
`root.transaction_id = this.TRANSACTION_ID`	Rename field to lowercase
`root = ...` (starting fresh)	Only include explicitly mapped fields
`.ts_parse("2006-01-02")`	Parse date string to timestamp
`.ts_format("2006-01-02")`	Format timestamp back to date string

Go Time Format

Expanso uses Go's time format convention where 2006-01-02 is the reference date. This is different from YYYY-MM-DD in other languages.

Expected Output

Input (after steps 1-4):

{
  "TRANSACTION_ID": "TXN-2024-00123456",
  "CUSTOMER_ID": "CUST-789012",
  "TRANSACTION_DATE": "2024-01-15",
  "TRANSACTION_TYPE": "PURCHASE",
  "amount_usd": 135.54,
  "account_number_masked": "****-****-9012",
  "transaction_category": "GROCERY",
  "_lineage": {...}
}

Output:

{
  "transaction_id": "TXN-2024-00123456",
  "customer_id": "CUST-789012",
  "transaction_date": "2024-01-15",
  "transaction_type": "PURCHASE",
  "merchant_name": "ACME Electronics GmbH",
  "merchant_category_code": "5411",
  "source_system": "CORE_BANKING_EU",
  "created_at": "2024-01-15T14:32:17Z",
  "amount_usd": 135.54,
  "original_amount": 125.50,
  "original_currency": "EUR",
  "account_number_masked": "****-****-9012",
  "account_number_hash": "a1b2c3d4e5f67890",
  "transaction_category": "GROCERY",
  "_partition_date": "2024-01-15",
  "_lineage": {...}
}

Production Considerations

Automatic Case Conversion

For tables with many columns, automate the conversion:

# Convert all top-level fields to lowercase
root = this.map_each_key(k -> k.lowercase())

Preserving Unknown Fields

If DB2 schema might change:

# Start with all fields lowercased
root = this.map_each_key(k -> k.lowercase())

# Then override specific fields
root._partition_date = this.transaction_date
root._lineage = this._lineage  # Preserve as-is

Multiple Partition Fields

For time and region partitioning:

root._partition_date = this.TRANSACTION_DATE
root._partition_region = match this.SOURCE_SYSTEM {
  this.contains("EU") => "eu",
  this.contains("US") => "us",
  this.contains("APAC") => "apac",
  _ => "global"
}

BigQuery Clustering

Add clustering keys for query performance:

# These fields should be declared as clustering keys in BigQuery
# Order matters: most filtered first
root._cluster_1 = this.transaction_category
root._cluster_2 = this.customer_id

Next Step

📖 Step 6: Validate Required Fields →

The Goal​

Why This Matters​

DataStage Equivalent​

Implementation​

Understanding the Code​

Expected Output​

Production Considerations​

Automatic Case Conversion​

Preserving Unknown Fields​

Multiple Partition Fields​

BigQuery Clustering​

Next Step​