Skip to main content

Step 4: Batch and Encode to Parquet

This is where the magic happens. Individual JSON transactions become compressed, columnar Parquet — 10-50x smaller, infinitely faster to query.

Batching Strategy

Parquet files work best at 50-200MB. Too small and you get "small file problem" overhead in DuckLake. Too large and you lose parallelism. We use micro-batching:

    # Batch: collect 1000 transactions or flush every 10 seconds
- batch:
count: 1000
period: 10s

At 10 TPS per store, that's roughly one Parquet file every 100 seconds (1000 rows), or a flush every 10 seconds during slow periods. Both produce well-sized files for DuckLake.

Flatten for Parquet

Parquet handles flat columns best. We serialize nested items as JSON strings — DuckDB can parse these natively with json_extract:

    # Prepare for Parquet encoding
- mapping: |
root = this
# Serialize nested items array as JSON string for Parquet
root.items_json = this.items.format_json()
# Drop the original nested array
root = root.without("items")

Encode to Parquet

    # Encode batch to Parquet with Zstandard compression
- parquet_encode:
schema:
- name: txn_id
type: UTF8
- name: store_id
type: INT32
- name: terminal_id
type: INT32
- name: timestamp
type: UTF8
- name: type
type: UTF8
- name: payment_method
type: UTF8
- name: employee_id
type: UTF8
- name: items_json
type: UTF8
- name: item_count
type: INT32
- name: subtotal
type: DOUBLE
- name: tax_rate
type: DOUBLE
- name: tax_amount
type: DOUBLE
- name: total_amount
type: DOUBLE
- name: store_region
type: UTF8
- name: store_format
type: UTF8
- name: store_city
type: UTF8
- name: store_state
type: UTF8
- name: store_sqft
type: INT32
- name: hour_of_day
type: INT32
- name: day_of_week
type: UTF8
- name: is_weekend
type: BOOLEAN
- name: basket_size
type: INT32
- name: avg_item_price
type: DOUBLE
- name: anomaly_flags
type: UTF8
- name: is_anomaly
type: BOOLEAN
- name: quality_score
type: UTF8
default_compression: zstd

Compression Results

FormatSize per 1000 txnsCompression
Raw JSON~2.1 MB1x (baseline)
Parquet (uncompressed)~280 KB7.5x
Parquet (Zstandard)~85 KB25x

That's 25x compression before the data even leaves the store. A retail chain doing 43M transactions/day goes from ~90GB of raw JSON to ~3.6GB of Parquet.

Why Parquet + DuckLake?

  • Columnar: Query SELECT sum(total_amount) FROM ... only reads the total_amount column, not the entire row
  • Typed: No runtime type parsing — DuckDB knows every column type at read time
  • Compressed: Zstandard on columnar data compresses extremely well (repeated store IDs, payment methods, etc.)
  • DuckLake inlining: Small batches get inlined into the catalog, periodically flushed to Parquet — no small-file overhead

Next Step

Parquet-encoded batches are ready. Next, we write them to S3 with Hive-style partitioning for automatic DuckLake discovery.

Step 5: Ship to S3 with Partitioning