Skip to main content

Step 5: Standardize Schema

Convert DB2's UPPERCASE column names to BigQuery's lowercase snake_case convention and add partition fields. This replaces DataStage Transformer field mapping.

The Goal

  • Convert TRANSACTION_IDtransaction_id
  • Remove original uppercase fields
  • Add BigQuery partition field _partition_date
  • Carry forward all transformed fields from previous steps

Why This Matters

BigQuery Best Practices: Lowercase, snake_case column names Query Simplicity: SELECT transaction_id vs SELECT "TRANSACTION_ID" Partitioning: Date partitions dramatically improve query performance and reduce costs

DataStage Equivalent

In DataStage, schema mapping requires:

  1. Transformer Stage with manual field-by-field assignment
  2. Metadata propagation to ensure downstream stages see new names
  3. Separate logic for partition field generation

Expanso simplifies this with explicit field mapping in a single processor.

Implementation

Add the schema standardization processor after categorization:

step-5-schema.yaml
pipeline:
processors:
# Steps 1-4 from previous...

# Step 5: Standardize field names for BigQuery
- mapping: |
# Map DB2 columns to BigQuery format
root.transaction_id = this.TRANSACTION_ID
root.customer_id = this.CUSTOMER_ID
root.transaction_date = this.TRANSACTION_DATE
root.transaction_type = this.TRANSACTION_TYPE
root.merchant_name = this.MERCHANT_NAME
root.merchant_category_code = this.MERCHANT_CATEGORY_CODE
root.source_system = this.SOURCE_SYSTEM
root.created_at = this.CREATED_AT

# Carry forward transformed fields from previous steps
root.amount_usd = this.amount_usd
root.original_amount = this.original_amount
root.original_currency = this.original_currency
root.account_number_masked = this.account_number_masked
root.account_number_hash = this.account_number_hash
root.transaction_category = this.transaction_category
root._lineage = this._lineage

# Add BigQuery partition field
root._partition_date = this.TRANSACTION_DATE.ts_parse("2006-01-02").ts_format("2006-01-02")

Understanding the Code

ExpressionWhat It Does
root.transaction_id = this.TRANSACTION_IDRename field to lowercase
root = ... (starting fresh)Only include explicitly mapped fields
.ts_parse("2006-01-02")Parse date string to timestamp
.ts_format("2006-01-02")Format timestamp back to date string
Go Time Format

Expanso uses Go's time format convention where 2006-01-02 is the reference date. This is different from YYYY-MM-DD in other languages.

Expected Output

Input (after steps 1-4):

{
"TRANSACTION_ID": "TXN-2024-00123456",
"CUSTOMER_ID": "CUST-789012",
"TRANSACTION_DATE": "2024-01-15",
"TRANSACTION_TYPE": "PURCHASE",
"amount_usd": 135.54,
"account_number_masked": "****-****-9012",
"transaction_category": "GROCERY",
"_lineage": {...}
}

Output:

{
"transaction_id": "TXN-2024-00123456",
"customer_id": "CUST-789012",
"transaction_date": "2024-01-15",
"transaction_type": "PURCHASE",
"merchant_name": "ACME Electronics GmbH",
"merchant_category_code": "5411",
"source_system": "CORE_BANKING_EU",
"created_at": "2024-01-15T14:32:17Z",
"amount_usd": 135.54,
"original_amount": 125.50,
"original_currency": "EUR",
"account_number_masked": "****-****-9012",
"account_number_hash": "a1b2c3d4e5f67890",
"transaction_category": "GROCERY",
"_partition_date": "2024-01-15",
"_lineage": {...}
}

Production Considerations

Automatic Case Conversion

For tables with many columns, automate the conversion:

# Convert all top-level fields to lowercase
root = this.map_each_key(k -> k.lowercase())

Preserving Unknown Fields

If DB2 schema might change:

# Start with all fields lowercased
root = this.map_each_key(k -> k.lowercase())

# Then override specific fields
root._partition_date = this.transaction_date
root._lineage = this._lineage # Preserve as-is

Multiple Partition Fields

For time and region partitioning:

root._partition_date = this.TRANSACTION_DATE
root._partition_region = match this.SOURCE_SYSTEM {
this.contains("EU") => "eu",
this.contains("US") => "us",
this.contains("APAC") => "apac",
_ => "global"
}

BigQuery Clustering

Add clustering keys for query performance:

# These fields should be declared as clustering keys in BigQuery
# Order matters: most filtered first
root._cluster_1 = this.transaction_category
root._cluster_2 = this.customer_id

Next Step