Skip to main content

Step 1: Generate Test Data

Learn how to create realistic synthetic log data for developing and testing your log enrichment pipeline. This foundation step sets up consistent, controllable data generation that simulates real application logs.

What You'll Buildโ€‹

In this step, you'll create a log generator that produces realistic application logs with:

  • Unique event IDs and request tracking
  • Realistic timestamps and log levels
  • Service identification and user context
  • Variable message content and severity levels
  • Controlled generation rates for testing

Why Start with Generated Data?โ€‹

Consistency: Generated data provides predictable patterns for testing transformations Control: Adjust message rates, formats, and content to test different scenarios
Safety: Develop without exposing real user data or production logs Scalability: Test high-volume scenarios without impacting production systems

The Base Log Structureโ€‹

We'll generate logs that match common application logging patterns:

{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"service": "demo-service",
"message": "Demo log message from edge",
"user_id": "user_123",
"request_id": "b2c3d4e5-f6a7-8901-bcde-f12345678901"
}

Implementationโ€‹

Basic Log Generatorโ€‹

Start with a simple generator that produces logs every 2 seconds:

step1-basic-generator.yaml
input:
generate:
interval: 2s
mapping: |
root.id = uuid_v4()
root.timestamp = now()
root.level = "INFO"
root.service = "demo-service"
root.message = "Demo log message from edge"
root.user_id = "user_123"
root.request_id = uuid_v4()

output:
stdout: {}

Deploy and test this basic generator. The generate input creates messages continuously, which you can observe in stdout.

Expected output:

{"id":"a1b2c3d4-e5f6-7890-abcd-ef1234567890","timestamp":"2024-01-15T10:30:00Z","level":"INFO","service":"demo-service","message":"Demo log message from edge","user_id":"user_123","request_id":"b2c3d4e5-f6a7-8901-bcde-f12345678901"}

Enhanced Generator with Varietyโ€‹

Add realistic variation to make logs more representative of real applications:

step1-varied-generator.yaml
input:
generate:
interval: 1s
mapping: |
# Basic event identification
root.id = uuid_v4()
root.timestamp = now()
root.request_id = uuid_v4()

# Vary log levels with realistic distribution
root.level = [
"INFO", "INFO", "INFO", "INFO", "INFO", # 50% INFO
"WARN", "WARN", # 20% WARN
"ERROR", # 10% ERROR
"DEBUG", "DEBUG" # 20% DEBUG
].index(random_int() % 10)

# Rotate between different services
let services = ["auth-service", "payment-service", "user-service", "notification-service"]
root.service = $services.index(random_int() % $services.length())

# Generate varied messages based on service and level
root.message = match {
this.service == "auth-service" && this.level == "INFO" => "User authentication successful"
this.service == "auth-service" && this.level == "WARN" => "Failed login attempt detected"
this.service == "auth-service" && this.level == "ERROR" => "Authentication service timeout"
this.service == "payment-service" && this.level == "INFO" => "Payment processed successfully"
this.service == "payment-service" && this.level == "WARN" => "Payment processing delayed"
this.service == "payment-service" && this.level == "ERROR" => "Payment gateway connection failed"
this.service == "user-service" && this.level == "INFO" => "User profile updated"
this.service == "user-service" && this.level == "WARN" => "Profile validation warning"
this.service == "user-service" && this.level == "ERROR" => "Database connection error"
this.service == "notification-service" && this.level == "INFO" => "Email notification sent"
this.service == "notification-service" && this.level == "WARN" => "SMS rate limit exceeded"
this.service == "notification-service" && this.level == "ERROR" => "Notification service unavailable"
_ => "Generic log message from " + this.service
}

# Generate realistic user IDs
let user_ids = ["user_123", "user_456", "user_789", "user_abc", "user_def"]
root.user_id = $user_ids.index(random_int() % $user_ids.length())

output:
stdout: {}

Deploy the enhanced generator. You will see varied output in stdout.

Expected output (varied):

{"id":"...","timestamp":"...","level":"INFO","service":"auth-service","message":"User authentication successful","user_id":"user_123","request_id":"..."}
{"id":"...","timestamp":"...","level":"WARN","service":"payment-service","message":"Payment processing delayed","user_id":"user_456","request_id":"..."}
{"id":"...","timestamp":"...","level":"ERROR","service":"user-service","message":"Database connection error","user_id":"user_789","request_id":"..."}

Production-Like Generatorโ€‹

Create a generator that simulates real production patterns with additional fields and realistic data:

step1-production-generator.yaml
input:
generate:
interval: 500ms # Higher frequency for testing
mapping: |
# Core event data
root.id = uuid_v4()
root.timestamp = now()
root.request_id = uuid_v4()

# Realistic log level distribution
let level_random = random_int() % 100
root.level = if $level_random < 60 {
"INFO"
} else if $level_random < 80 {
"WARN"
} else if $level_random < 95 {
"ERROR"
} else {
"FATAL"
}

# Service identification
let services = [
{"name": "auth-service", "version": "1.2.3"},
{"name": "payment-service", "version": "2.1.0"},
{"name": "user-service", "version": "1.5.2"},
{"name": "notification-service", "version": "3.0.1"},
{"name": "analytics-service", "version": "1.0.0"}
]
let selected_service = $services.index(random_int() % $services.length())
root.service = $selected_service.name
root.service_version = $selected_service.version

# Generate contextual messages
root.message = match {
this.service == "auth-service" && this.level == "INFO" => "User login successful for session " + uuid_v4().slice(0, 8)
this.service == "auth-service" && this.level == "WARN" => "Multiple failed login attempts from IP " + random_int() % 255 + "." + random_int() % 255 + ".xxx.xxx"
this.service == "auth-service" && this.level == "ERROR" => "JWT token validation failed: expired token"
this.service == "payment-service" && this.level == "INFO" => "Payment transaction completed: $" + (random_int() % 1000 + 10)
this.service == "payment-service" && this.level == "WARN" => "Payment processing took longer than expected: " + (random_int() % 5000 + 1000) + "ms"
this.service == "payment-service" && this.level == "ERROR" => "Credit card validation failed: invalid card number"
this.service == "user-service" && this.level == "INFO" => "User profile updated successfully"
this.service == "user-service" && this.level == "WARN" => "Profile image upload size exceeded limit: " + (random_int() % 20 + 5) + "MB"
this.service == "user-service" && this.level == "ERROR" => "Database query timeout after " + (random_int() % 30 + 5) + " seconds"
this.service == "notification-service" && this.level == "INFO" => "Email sent successfully to user"
this.service == "notification-service" && this.level == "WARN" => "SMS delivery delayed due to carrier issues"
this.service == "notification-service" && this.level == "ERROR" => "Push notification service connection failed"
this.service == "analytics-service" && this.level == "INFO" => "Event tracking batch processed: " + (random_int() % 1000 + 100) + " events"
this.service == "analytics-service" && this.level == "WARN" => "High memory usage detected: " + (random_int() % 40 + 80) + "%"
this.service == "analytics-service" && this.level == "ERROR" => "Data ingestion pipeline failed"
_ => "Generic operation completed in " + this.service
}

# User context
let user_ids = ["user_001", "user_002", "user_003", "user_004", "user_005", "user_006", "user_007", "user_008", "user_009", "user_010"]
root.user_id = $user_ids.index(random_int() % $user_ids.length())

# Request context and timing
root.duration_ms = random_int() % 5000 + 50
root.status_code = match {
this.level == "INFO" => [200, 201, 202].index(random_int() % 3)
this.level == "WARN" => [400, 401, 403, 429].index(random_int() % 4)
this.level == "ERROR" => [500, 502, 503, 504].index(random_int() % 4)
_ => 500
}

# Environment context
root.environment = "production"
root.region = ["us-east-1", "us-west-2", "eu-west-1"].index(random_int() % 3)
root.instance_id = "i-" + uuid_v4().slice(0, 8)

# Add file output for inspection
output:
broker:
pattern: fan_out
outputs:
# Console output for monitoring
- stdout: {}

# File output for detailed inspection
- file:
path: /tmp/generated-logs.jsonl
codec: lines

# Metrics for monitoring generation
metrics:
prometheus:
prefix: log_generator

Deploy the production-like generator. Check the generated file:

tail -f /tmp/generated-logs.jsonl

Testing Different Scenariosโ€‹

High-Volume Testingโ€‹

Test how your pipeline handles high message volumes:

step1-high-volume.yaml
input:
generate:
interval: 10ms # 100 messages per second
mapping: |
root.id = uuid_v4()
root.timestamp = now()
root.level = "INFO"
root.service = "load-test-service"
root.message = "High volume test message " + uuid_v4().slice(0, 8)
root.user_id = "user_" + (random_int() % 1000)
root.request_id = uuid_v4()
root.sequence = counter("messages")

output:
stdout: {}

Error-Heavy Testingโ€‹

Test error handling by generating many error messages:

step1-error-heavy.yaml
input:
generate:
interval: 1s
mapping: |
root.id = uuid_v4()
root.timestamp = now()
root.request_id = uuid_v4()

# 70% errors for testing error handling
root.level = if random_int() % 10 < 7 { "ERROR" } else { "INFO" }

root.service = "error-test-service"
root.message = if this.level == "ERROR" {
"Simulated error: " + ["Database timeout", "Network unreachable", "Memory exhausted", "Invalid input", "Service unavailable"].index(random_int() % 5)
} else {
"Normal operation completed"
}

root.user_id = "user_" + (random_int() % 10)
root.error_code = if this.level == "ERROR" { random_int() % 5000 + 1000 } else { null }

output:
stdout: {}

Common Generation Patternsโ€‹

Time-Based Patternsโ€‹

Generate logs that follow realistic time-based patterns:

# Business hours simulation
root.timestamp = now()
let hour = $timestamp.format_timestamp("15", "UTC").number()
root.volume_multiplier = if $hour >= 9 && $hour <= 17 { 3 } else { 1 }

# Generate more messages during business hours
interval: if env("BUSINESS_HOURS") == "true" { "100ms" } else { "1s" }

User Behavior Simulationโ€‹

Create realistic user session patterns:

# Simulate user sessions
let session_ids = range(0, 50).map(i -> "session_" + $i)
root.session_id = $session_ids.index(random_int() % 50)

# Session-based user consistency
root.user_id = "user_" + (this.session_id.hash() % 100)

Service Dependenciesโ€‹

Model realistic service interaction patterns:

# Service call chains
root.parent_request_id = if random_int() % 3 == 0 { uuid_v4() } else { null }
root.trace_id = if this.parent_request_id != null { uuid_v4() } else { this.request_id }

Validation and Quality Checksโ€‹

Verify Generated Data Qualityโ€‹

Check that your generated data meets requirements:

# Count messages by level
cat /tmp/generated-logs.jsonl | jq -r '.level' | sort | uniq -c

# Check timestamp distribution
cat /tmp/generated-logs.jsonl | jq -r '.timestamp' | head -20

# Verify unique IDs
cat /tmp/generated-logs.jsonl | jq -r '.id' | sort | uniq | wc -l

# Check service distribution
cat /tmp/generated-logs.jsonl | jq -r '.service' | sort | uniq -c

Monitor Generation Performanceโ€‹

Track generation metrics by watching the output file and checking message counts.

Troubleshooting Generation Issuesโ€‹

Generator Not Startingโ€‹

Problem: Pipeline fails to deploy

Solutions:

  1. Check YAML syntax:
# Validate YAML
python -c "import yaml; yaml.safe_load(open('step1-production-generator.yaml'))"
  1. Test simplified version:
input:
generate:
interval: 1s
mapping: 'root = {"test": "simple"}'
output:
stdout: {}

Generated Data Issuesโ€‹

Problem: Fields missing or incorrect format

Solutions:

  1. Test mapping logic:
# Use Bloblang CLI to test mappings
echo '{}' | bloblang 'root.level = ["INFO", "WARN"].index(0)'
  1. Add debug output:
output:
broker:
pattern: fan_out
outputs:
- stdout: {}
- file:
path: /tmp/debug-generation.jsonl
codec: lines

Performance Issuesโ€‹

Problem: Generation too slow or too fast

Solutions:

  1. Adjust interval:
# Slower generation
interval: 10s

# Faster generation
interval: 100ms
  1. Optimize mapping:
# Pre-calculate arrays outside mapping
root.service = ["service1", "service2", "service3"].index(random_int() % 3)

Real-World Applicationsโ€‹

Development Environmentโ€‹

Use generated data to develop transformations without production access:

# Development data generator
input:
generate:
interval: 2s
mapping: |
# Mirror production log structure exactly
root = {
"timestamp": now(),
"level": "INFO",
"service": "dev-service",
"message": "Development log entry",
"version": "dev",
"environment": "development"
}

Load Testingโ€‹

Generate high volumes to test pipeline capacity:

# Load test generator
input:
generate:
interval: 1ms # Very high frequency
mapping: |
root.id = uuid_v4()
root.data = "x".repeat(1024) # 1KB message size
root.timestamp = now()

Edge Case Testingโ€‹

Generate problematic data to test error handling:

# Edge case generator
input:
generate:
interval: 5s
mapping: |
# Mix of valid and problematic data
let valid = random_int() % 10 < 8
root = if $valid {
{
"id": uuid_v4(),
"level": "INFO",
"message": "Normal message"
}
} else {
{
"malformed": true,
"special_chars": "รผรฑรฎรงรธdรฉ",
"large_field": "x".repeat(10000),
"null_value": null
}
}

Key Takeawaysโ€‹

After completing this step, you understand:

โœ… Data Generation: How to create realistic synthetic logs for testing โœ… Bloblang Basics: Using mapping functions for data transformation โœ… Testing Strategies: Different patterns for development and load testing โœ… Quality Validation: How to verify generated data meets requirements โœ… Performance Tuning: Adjusting generation rates for different scenarios

Next Stepsโ€‹

Your log generation foundation is ready! The next step adds lineage metadata to track data processing:


Next: Add lineage metadata to track data processing history