Skip to main content

Setup Environment for Event Deduplication

Before building the deduplication pipeline, you'll set up cache resources and prepare test data with realistic duplicates.

Prerequisites

This example requires the following services to be running:

Before you begin, please ensure these services are set up and running according to their respective guides. Additionally, ensure you have completed the Local Development Setup guide for general environment configuration.

Step 1: Configure Example-Specific Variables

After setting up the core services, configure deduplication-specific variables:

# Analytics endpoint for processed events (optional)
export ANALYTICS_ENDPOINT="https://analytics.example.com/api/v1"

# Cache configuration
export DEDUP_TTL="1h"
export DEDUP_CACHE_SIZE="100000"

# Test webhook endpoint
export WEBHOOK_PORT="8080"

# Verify configuration
echo "Redis URL: $REDIS_URL"
echo "Cache TTL: $DEDUP_TTL"

Step 2: Prepare Test Data

Create realistic test data that demonstrates different types of duplicates you'll encounter in production:

# Create test data directory
mkdir -p dedup-test-data
cd dedup-test-data

# Create exact duplicate events (network retry scenario)
cat > exact-duplicates.json << 'EOF'
{"event_id":"evt_001","event_type":"user_signup","timestamp":"2025-01-15T10:00:00Z","user":{"id":"user_123","email":"[email protected]","name":"Alice Smith"}}
{"event_id":"evt_001","event_type":"user_signup","timestamp":"2025-01-15T10:00:00Z","user":{"id":"user_123","email":"[email protected]","name":"Alice Smith"}}
EOF

# Create semantic duplicate events (load balancer retry scenario)
cat > semantic-duplicates.json << 'EOF'
{"event_id":"evt_002","event_type":"user_signup","timestamp":"2025-01-15T10:00:05Z","user":{"id":"user_123","email":"[email protected]","name":"Alice Smith"}}
{"event_id":"evt_003","event_type":"user_signup","timestamp":"2025-01-15T10:00:07Z","user":{"id":"user_123","email":"[email protected]","name":"Alice Smith"}}
EOF

# Create unique events (should not be deduplicated)
cat > unique-events.json << 'EOF'
{"event_id":"evt_004","event_type":"user_signup","timestamp":"2025-01-15T10:01:00Z","user":{"id":"user_456","email":"[email protected]","name":"Bob Jones"}}
{"event_id":"evt_005","event_type":"purchase","timestamp":"2025-01-15T10:02:00Z","user":{"id":"user_123","email":"[email protected]"},"product":{"id":"prod_789","amount":99.99}}
EOF

# Create mixed scenario test data
cat > mixed-test-events.json << 'EOF'
{"event_id":"evt_001","event_type":"user_signup","timestamp":"2025-01-15T10:00:00Z","user":{"id":"user_123","email":"[email protected]","name":"Alice Smith"}}
{"event_id":"evt_001","event_type":"user_signup","timestamp":"2025-01-15T10:00:00Z","user":{"id":"user_123","email":"[email protected]","name":"Alice Smith"}}
{"event_id":"evt_002","event_type":"user_signup","timestamp":"2025-01-15T10:00:05Z","user":{"id":"user_123","email":"[email protected]","name":"Alice Smith"}}
{"event_id":"evt_004","event_type":"user_signup","timestamp":"2025-01-15T10:01:00Z","user":{"id":"user_456","email":"[email protected]","name":"Bob Jones"}}
{"event_id":"evt_005","event_type":"purchase","timestamp":"2025-01-15T10:02:00Z","user":{"id":"user_123","email":"[email protected]"},"product":{"id":"prod_789","amount":99.99}}
EOF

echo "Test data created successfully!"

Step 3: Verify Cache Resource

Before proceeding, verify that the cache resource is working correctly for deduplication. You can use the redis-cli to interact with your Redis instance.

Ready for Step 1: Implement Hash-Based Deduplication