Skip to main content

Step 2: Fingerprint-Based Deduplication for Semantic Duplicates

Hash-based deduplication is great for exact copies, but it fails if even one character is different. A more powerful technique is fingerprint-based deduplication, where you create a hash from only the fields that define a unique business event.

This is perfect for catching "semantic" duplicates, like those from a load balancer retry, where the event_id and timestamp are different, but the core action is the same.

The Goal

You will modify your pipeline to create a "fingerprint" based only on the core business fields (event_type and user_id). This will allow it to detect that the two messages below are duplicates, even though they are not identical.

Event 1:

{"event_id": "abc-123", "timestamp": "10:30:01Z", "event_type": "login", "user_id": "alice"}

Event 2 (Semantic Duplicate):

{"event_id": "def-456", "timestamp": "10:30:03Z", "event_type": "login", "user_id": "alice"}

Implementation

  1. Start with the Previous Pipeline: Copy the deduplicator.yaml from Step 1 to a new file named fingerprint-dedup.yaml.

    cp deduplicator.yaml fingerprint-dedup.yaml
  2. Modify the Hashing Logic: Open fingerprint-dedup.yaml. You will modify the first mapping processor to create the hash from a subset of fields instead of the whole message.

    Modify the first 'mapping' processor in fingerprint-dedup.yaml
    # Change the hash creation logic
    - mapping: |
    root = this

    # 1. Create a new object containing ONLY the business-critical fields.
    let business_fingerprint = {
    "event_type": this.event_type,
    "user_id": this.user_id
    }

    # 2. Create the hash from this new object, not the whole message.
    root.dedup_hash = business_fingerprint.json_format().hash("sha256")

    The rest of the pipeline (cache, check, drop) remains exactly the same. It still uses dedup_hash to check for duplicates, but that hash is now much smarter.

  3. Deploy and Test:

    # --- Send two SEMANTICALLY identical messages ---
    curl -X POST http://localhost:8080/ingest \
    -H "Content-Type: application/json" \
    -d '{"event_id": "abc-123", "timestamp": "10:30:01Z", "event_type": "login", "user_id": "alice"}'

    curl -X POST http://localhost:8080/ingest \
    -H "Content-Type: application/json" \
    -d '{"event_id": "def-456", "timestamp": "10:30:03Z", "event_type": "login", "user_id": "alice"}'
  4. Verify: Check your logs. Even though the two messages were different, only the first one was processed. The second one was correctly identified as a semantic duplicate and dropped because its business fingerprint was identical to the first.

You have now implemented a more powerful deduplication strategy that can handle a wider range of real-world duplicate scenarios.