Skip to main content

Technique 3: Hashing with Metadata Extraction

This technique extends hashing. Instead of just replacing a field with its hash, you can also extract and preserve a non-sensitive piece of metadata from it before deletion. This is perfect for fields like email addresses, where the unique identifier is sensitive but the domain is useful for analytics.

The Goal

You will hash the full email address for user counting, extract the email_domain for organizational analytics, and delete the original email.

Implementation

  1. Start with the Previous Pipeline: Copy the hash-ip.yaml from Step 2 to a new file named hash-email.yaml.

    cp hash-ip.yaml hash-email.yaml

    Note: Remember to set the EMAIL_SALT environment variable as described in the setup guide.

  2. Add the Email Hashing Logic: Open hash-email.yaml and add the email logic to the bottom of the existing mapping processor.

    Add this to your 'mapping' processor in hash-email.yaml
    # --- Logic from previous steps ---
    # (The existing logic for payment deletion and IP hashing remains here)

    # --- START: New additions for Email Hashing ---

    # Hash the full email address for unique user tracking
    root.email_hash = this.email.hash("sha256", env("EMAIL_SALT"))

    # Extract the domain for organizational analytics
    root.email_domain = this.email.split("@").index(1)

    # Delete the original email address field
    root = this.without("email")

    # --- END: New additions ---
  3. Deploy and Test:

    # Send the sample event data
    curl -X POST http://localhost:8080/events/ingest \
    -H "Content-Type: application/json" \
    -d @~/expanso-remove-pii/sample-event.json
  4. Verify: Check your logs. The email field will be gone, replaced by email_hash and email_domain. This allows you to count unique users with COUNT(DISTINCT email_hash) and analyze user organizations with GROUP BY email_domain, all while remaining GDPR compliant.