Step 5: Standardize Schema
Convert DB2's UPPERCASE column names to BigQuery's lowercase snake_case convention and add partition fields. This replaces DataStage Transformer field mapping.
The Goal
- Convert
TRANSACTION_ID→transaction_id - Remove original uppercase fields
- Add BigQuery partition field
_partition_date - Carry forward all transformed fields from previous steps
Why This Matters
BigQuery Best Practices: Lowercase, snake_case column names
Query Simplicity: SELECT transaction_id vs SELECT "TRANSACTION_ID"
Partitioning: Date partitions dramatically improve query performance and reduce costs
DataStage Equivalent
In DataStage, schema mapping requires:
- Transformer Stage with manual field-by-field assignment
- Metadata propagation to ensure downstream stages see new names
- Separate logic for partition field generation
Expanso simplifies this with explicit field mapping in a single processor.
Implementation
Add the schema standardization processor after categorization:
pipeline:
processors:
# Steps 1-4 from previous...
# Step 5: Standardize field names for BigQuery
- mapping: |
# Map DB2 columns to BigQuery format
root.transaction_id = this.TRANSACTION_ID
root.customer_id = this.CUSTOMER_ID
root.transaction_date = this.TRANSACTION_DATE
root.transaction_type = this.TRANSACTION_TYPE
root.merchant_name = this.MERCHANT_NAME
root.merchant_category_code = this.MERCHANT_CATEGORY_CODE
root.source_system = this.SOURCE_SYSTEM
root.created_at = this.CREATED_AT
# Carry forward transformed fields from previous steps
root.amount_usd = this.amount_usd
root.original_amount = this.original_amount
root.original_currency = this.original_currency
root.account_number_masked = this.account_number_masked
root.account_number_hash = this.account_number_hash
root.transaction_category = this.transaction_category
root._lineage = this._lineage
# Add BigQuery partition field
root._partition_date = this.TRANSACTION_DATE.ts_parse("2006-01-02").ts_format("2006-01-02")
Understanding the Code
| Expression | What It Does |
|---|---|
root.transaction_id = this.TRANSACTION_ID | Rename field to lowercase |
root = ... (starting fresh) | Only include explicitly mapped fields |
.ts_parse("2006-01-02") | Parse date string to timestamp |
.ts_format("2006-01-02") | Format timestamp back to date string |
Expanso uses Go's time format convention where 2006-01-02 is the reference date. This is different from YYYY-MM-DD in other languages.
Expected Output
Input (after steps 1-4):
{
"TRANSACTION_ID": "TXN-2024-00123456",
"CUSTOMER_ID": "CUST-789012",
"TRANSACTION_DATE": "2024-01-15",
"TRANSACTION_TYPE": "PURCHASE",
"amount_usd": 135.54,
"account_number_masked": "****-****-9012",
"transaction_category": "GROCERY",
"_lineage": {...}
}
Output:
{
"transaction_id": "TXN-2024-00123456",
"customer_id": "CUST-789012",
"transaction_date": "2024-01-15",
"transaction_type": "PURCHASE",
"merchant_name": "ACME Electronics GmbH",
"merchant_category_code": "5411",
"source_system": "CORE_BANKING_EU",
"created_at": "2024-01-15T14:32:17Z",
"amount_usd": 135.54,
"original_amount": 125.50,
"original_currency": "EUR",
"account_number_masked": "****-****-9012",
"account_number_hash": "a1b2c3d4e5f67890",
"transaction_category": "GROCERY",
"_partition_date": "2024-01-15",
"_lineage": {...}
}
Production Considerations
Automatic Case Conversion
For tables with many columns, automate the conversion:
# Convert all top-level fields to lowercase
root = this.map_each_key(k -> k.lowercase())
Preserving Unknown Fields
If DB2 schema might change:
# Start with all fields lowercased
root = this.map_each_key(k -> k.lowercase())
# Then override specific fields
root._partition_date = this.transaction_date
root._lineage = this._lineage # Preserve as-is
Multiple Partition Fields
For time and region partitioning:
root._partition_date = this.TRANSACTION_DATE
root._partition_region = match this.SOURCE_SYSTEM {
this.contains("EU") => "eu",
this.contains("US") => "us",
this.contains("APAC") => "apac",
_ => "global"
}
BigQuery Clustering
Add clustering keys for query performance:
# These fields should be declared as clustering keys in BigQuery
# Order matters: most filtered first
root._cluster_1 = this.transaction_category
root._cluster_2 = this.customer_id