Skip to main content

Step 3: ID-Based Deduplication for Unique Identifiers

If your events are guaranteed to have a unique identifier (like a UUID or a database sequence), you can use the simplest, fastest, and most efficient deduplication pattern: ID-based deduplication.

This pattern skips the need for hashing content and instead uses the event's own ID as the unique key.

The Goal

You will simplify your deduplication logic by using the event_id field directly as the key for the cache, which is much faster than calculating a hash.

Implementation

  1. Start with the Previous Pipeline: Copy the fingerprint-dedup.yaml from Step 2 to a new file named id-dedup.yaml.

    cp fingerprint-dedup.yaml id-dedup.yaml
  2. Simplify the Key Creation: Open id-dedup.yaml. You will replace the complex fingerprinting mapping processor with a much simpler one that just uses the event_id.

    Replace the first 'mapping' processor in id-dedup.yaml
    # This processor is now much simpler.
    - mapping: |
    root = this
    # The key for deduplication is now just the event's own ID.
    root.dedup_key = this.event_id

    You will also need to update the cache processor to use this new dedup_key field.

    Update the 'cache' processor
    - cache:
    resource: dedup_cache # You can rename this to id_cache if you prefer
    operator: get
    key: ${! this.dedup_key } # Use the new key field

    Finally, update the last mapping processor to also use the dedup_key.

    Update the last 'mapping' processor
    - mapping: |
    let is_duplicate = meta("cache").exists()
    root = this

    if is_duplicate {
    root = deleted()
    } else {
    _ = cache_set("dedup_cache", this.dedup_key, "seen")
    }
  3. Deploy and Test:

    # --- Send the SAME event twice ---
    curl -X POST http://localhost:8080/ingest \
    -H "Content-Type: application/json" \
    -d '{"event_id": "abc-123", "message": "hello"}'

    curl -X POST http://localhost:8080/ingest \
    -H "Content-Type: application/json" \
    -d '{"event_id": "abc-123", "message": "hello again"}'
  4. Verify: Check your logs. Even though the message was different in the second request, it was dropped as a duplicate because it had the same event_id. This method is extremely fast and efficient when you can rely on a unique identifier in your events.