Step 5: Ship to S3 with Partitioning

The final edge step: write Parquet files to S3 with Hive-style partitioning so DuckLake can auto-discover and index them.

Hive-Style Partitioning

DuckLake expects partitioned paths like:

s3://my-retail-data/transactions/
  region=NW/date=2026-02-11/batch_1707616800_001.parquet
  region=NW/date=2026-02-11/batch_1707616810_002.parquet
  region=SW/date=2026-02-11/batch_1707616800_001.parquet
  ...

This lets DuckLake skip entire partitions when filtering by region or date — massive query speedups on large datasets.

S3 Output

output:
  aws_s3:
    bucket: "${S3_BUCKET}"
    path: >-
      transactions/region=${! json("store_region") }/date=${! now().ts_format("2006-01-02") }/batch_${! timestamp_unix() }_${! count("s3_files") }.parquet
    content_type: application/octet-stream
    max_in_flight: 4
    batching:
      count: 1
      processors: []

Local Fallback for Development

For local development or testing without S3, write to the filesystem:

output:
  file:
    path: >-
      ./retail_data/region=${! json("store_region") }/date=${! now().ts_format("2006-01-02") }/batch_${! timestamp_unix() }_${! count("files") }.parquet
    codec: all-bytes

Dual Output: S3 + Local Buffer

In production, write to both S3 and a local buffer. If S3 connectivity drops, the local buffer retains data for retry:

output:
  broker:
    pattern: fan_out
    outputs:
      # Primary: S3
      - aws_s3:
          bucket: "${S3_BUCKET}"
          path: >-
            transactions/region=${! json("store_region") }/date=${! now().ts_format("2006-01-02") }/batch_${! timestamp_unix() }_${! count("s3_files") }.parquet
          content_type: application/octet-stream
          max_in_flight: 4

      # Backup: local filesystem (7-day retention)
      - file:
          path: >-
            /var/expanso/retail-buffer/batch_${! timestamp_unix() }.parquet
          codec: all-bytes

What Lands in S3

After running for a few minutes with 50 stores:

s3://my-retail-data/transactions/
├── region=MW/
│   └── date=2026-02-11/
│       ├── batch_1707616800_001.parquet  (85 KB, ~1000 txns)
│       ├── batch_1707616810_002.parquet  (82 KB, ~1000 txns)
│       └── ...
├── region=NE/
│   └── date=2026-02-11/
│       └── ...
├── region=NW/
│   └── date=2026-02-11/
│       └── ...
├── region=SE/
│   └── date=2026-02-11/
│       └── ...
└── region=SW/
    └── date=2026-02-11/
        └── ...

Each file: ~85KB, ~1000 transactions, Zstandard-compressed Parquet. DuckLake picks them up automatically.

Next Step

Data is landing in S3. Now we connect MotherDuck's DuckLake to auto-index it and run SQL queries across all stores.

→ Step 6: Connect DuckLake and Query

Hive-Style Partitioning​

S3 Output​

Local Fallback for Development​

Dual Output: S3 + Local Buffer​

What Lands in S3​

Next Step​