Step 3: ID-Based Deduplication for Unique Identifiers
If your events are guaranteed to have a unique identifier (like a UUID or a database sequence), you can use the simplest, fastest, and most efficient deduplication pattern: ID-based deduplication.
This pattern skips the need for hashing content and instead uses the event's own ID as the unique key.
The Goal
You will simplify your deduplication logic by using the event_id field directly as the key for the cache, which is much faster than calculating a hash.
Implementation
-
Start with the Previous Pipeline: Copy the
fingerprint-dedup.yamlfrom Step 2 to a new file namedid-dedup.yaml.cp fingerprint-dedup.yaml id-dedup.yaml -
Simplify the Key Creation: Open
id-dedup.yaml. You will replace the complex fingerprintingmappingprocessor with a much simpler one that just uses theevent_id.Replace the first 'mapping' processor in id-dedup.yaml# This processor is now much simpler.
- mapping: |
root = this
# The key for deduplication is now just the event's own ID.
root.dedup_key = this.event_idYou will also need to update the
cacheprocessor to use this newdedup_keyfield.Update the 'cache' processor- cache:
resource: dedup_cache # You can rename this to id_cache if you prefer
operator: get
key: ${! this.dedup_key } # Use the new key fieldFinally, update the last
mappingprocessor to also use thededup_key.Update the last 'mapping' processor- mapping: |
let is_duplicate = meta("cache").exists()
root = this
if is_duplicate {
root = deleted()
} else {
_ = cache_set("dedup_cache", this.dedup_key, "seen")
} -
Deploy and Test:
# --- Send the SAME event twice ---
curl -X POST http://localhost:8080/ingest \
-H "Content-Type: application/json" \
-d '{"event_id": "abc-123", "message": "hello"}'
curl -X POST http://localhost:8080/ingest \
-H "Content-Type: application/json" \
-d '{"event_id": "abc-123", "message": "hello again"}' -
Verify: Check your logs. Even though the
messagewas different in the second request, it was dropped as a duplicate because it had the sameevent_id. This method is extremely fast and efficient when you can rely on a unique identifier in your events.