Step 2: Fingerprint-Based Deduplication for Semantic Duplicates
Hash-based deduplication is great for exact copies, but it fails if even one character is different. A more powerful technique is fingerprint-based deduplication, where you create a hash from only the fields that define a unique business event.
This is perfect for catching "semantic" duplicates, like those from a load balancer retry, where the event_id and timestamp are different, but the core action is the same.
The Goal
You will modify your pipeline to create a "fingerprint" based only on the core business fields (event_type and user_id). This will allow it to detect that the two messages below are duplicates, even though they are not identical.
Event 1:
{"event_id": "abc-123", "timestamp": "10:30:01Z", "event_type": "login", "user_id": "alice"}
Event 2 (Semantic Duplicate):
{"event_id": "def-456", "timestamp": "10:30:03Z", "event_type": "login", "user_id": "alice"}
Implementation
-
Start with the Previous Pipeline: Copy the
deduplicator.yamlfrom Step 1 to a new file namedfingerprint-dedup.yaml.cp deduplicator.yaml fingerprint-dedup.yaml -
Modify the Hashing Logic: Open
fingerprint-dedup.yaml. You will modify the firstmappingprocessor to create the hash from a subset of fields instead of the whole message.Modify the first 'mapping' processor in fingerprint-dedup.yaml# Change the hash creation logic
- mapping: |
root = this
# 1. Create a new object containing ONLY the business-critical fields.
let business_fingerprint = {
"event_type": this.event_type,
"user_id": this.user_id
}
# 2. Create the hash from this new object, not the whole message.
root.dedup_hash = business_fingerprint.json_format().hash("sha256")The rest of the pipeline (
cache,check,drop) remains exactly the same. It still usesdedup_hashto check for duplicates, but that hash is now much smarter. -
Deploy and Test:
# --- Send two SEMANTICALLY identical messages ---
curl -X POST http://localhost:8080/ingest \
-H "Content-Type: application/json" \
-d '{"event_id": "abc-123", "timestamp": "10:30:01Z", "event_type": "login", "user_id": "alice"}'
curl -X POST http://localhost:8080/ingest \
-H "Content-Type: application/json" \
-d '{"event_id": "def-456", "timestamp": "10:30:03Z", "event_type": "login", "user_id": "alice"}' -
Verify: Check your logs. Even though the two messages were different, only the first one was processed. The second one was correctly identified as a semantic duplicate and dropped because its business fingerprint was identical to the first.
You have now implemented a more powerful deduplication strategy that can handle a wider range of real-world duplicate scenarios.