Step 6: Connect DuckLake and Query in MotherDuck
Data is landing in S3 as partitioned Parquet. Now we connect MotherDuck's DuckLake to auto-index it and start querying.
Set Up DuckLake
1. Connect to MotherDuck
-- In DuckDB CLI or MotherDuck UI
INSTALL ducklake;
LOAD ducklake;
-- Connect to MotherDuck
ATTACH 'md:';
2. Create a DuckLake Catalog
-- Create a DuckLake catalog pointing to your S3 bucket
ATTACH 'ducklake:retail_analytics'
(DATA_PATH 's3://my-retail-data/ducklake',
METADATA_PATH 's3://my-retail-data/ducklake_metadata');
3. Create the Transactions Table
USE retail_analytics;
CREATE TABLE transactions AS
SELECT *
FROM read_parquet('s3://my-retail-data/transactions/**/*.parquet',
hive_partitioning=true);
DuckLake now manages this table. New Parquet files landing in S3 are automatically detected and indexed — no manual refresh needed.
Query Examples
Revenue by Region (Today)
SELECT
store_region,
COUNT(*) as txn_count,
SUM(total_amount) as revenue,
AVG(total_amount) as avg_ticket,
AVG(basket_size) as avg_basket
FROM transactions
WHERE date = CURRENT_DATE
AND type = 'sale'
GROUP BY store_region
ORDER BY revenue DESC;
┌──────────────┬───────────┬───────────┬────────────┬────────────┐
│ store_region │ txn_count │ revenue │ avg_ticket │ avg_basket │
├──────────────┼───────────┼───────────┼────────────┼────────────┤
│ SW │ 18,432 │ 487,291.4 │ 26.43 │ 3.2 │
│ NW │ 16,891 │ 445,120.8 │ 26.35 │ 3.1 │
│ NE │ 15,203 │ 401,887.2 │ 26.43 │ 3.3 │
│ SE │ 12,876 │ 340,092.1 │ 26.41 │ 3.1 │
│ MW │ 11,598 │ 306,389.5 │ 26.42 │ 3.2 │
└──────────────┴───────────┴───────────┴────────────┴────────────┘
Hourly Sales Heatmap
SELECT
hour_of_day,
day_of_week,
COUNT(*) as txn_count,
SUM(total_amount) as revenue
FROM transactions
WHERE type = 'sale'
AND date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY hour_of_day, day_of_week
ORDER BY hour_of_day, day_of_week;
Top Categories by Store Format
SELECT
store_format,
item.category,
SUM(item.qty * item.unit_price) as category_revenue
FROM transactions,
LATERAL (SELECT * FROM json_extract_many(items_json)) as item
WHERE type = 'sale'
GROUP BY store_format, item.category
ORDER BY store_format, category_revenue DESC;
Anomaly Report
SELECT
quality_score,
anomaly_flags,
COUNT(*) as count,
SUM(total_amount) as total_value
FROM transactions
WHERE is_anomaly = true
GROUP BY quality_score, anomaly_flags
ORDER BY count DESC;
Store Performance Scorecard
SELECT
store_id,
store_city,
store_format,
store_sqft,
COUNT(*) as daily_txns,
SUM(total_amount) as daily_revenue,
ROUND(SUM(total_amount) / store_sqft * 365, 2) as annual_rev_per_sqft,
AVG(basket_size) as avg_basket,
SUM(CASE WHEN is_anomaly THEN 1 ELSE 0 END) as anomaly_count
FROM transactions
WHERE date = CURRENT_DATE AND type = 'sale'
GROUP BY ALL
ORDER BY annual_rev_per_sqft DESC;
DuckLake Data Inlining
For near-real-time queries, DuckLake's data inlining feature is key. Small Parquet files (like our micro-batches) get inlined directly into the catalog database rather than stored as separate files. DuckLake periodically flushes inlined data to full-size Parquet files — giving you:
- Instant availability: Data is queryable the moment it's ingested
- No small-file overhead: Automatic compaction handles the rest
- Best of both worlds: Real-time inserts with analytical performance
What You've Built
50 Stores → Expanso Edge → S3 (Partitioned Parquet) → DuckLake → MotherDuck SQL
│ │
├── Parse POS JSON ├── Auto-index new files
├── Enrich (store metadata) ├── Inline small batches
├── Validate (anomaly flags) ├── Partition pruning
├── Batch (1000 rows / 10s) └── Serverless scale
├── Encode (Zstandard Parquet)
└── Ship (Hive partitioned)
Edge processing: ~3.6 GB/day of Parquet (down from ~90 GB raw JSON) Query latency: Sub-second on MotherDuck for most analytical queries Data freshness: ~10 seconds from POS terminal to queryable in MotherDuck
Next Step
Want the complete pipeline in one file? Copy, paste, deploy.