Step 2: Convert Avro to Parquet

While Avro is excellent for streaming, Parquet is the industry standard for efficient analytics and data warehousing. It uses a columnar storage format that is highly optimized for analytical queries.

This step teaches you how to convert the Avro data from Step 1 into the Parquet format.

The Goal

You will add processors to your pipeline to take the binary Avro data, convert it back to a structured JSON object, and then re-encode it into the binary Parquet format.

The "De-serialize -> Re-serialize" Pattern

Converting between two binary formats is a two-step process:

De-serialize from Avro: Use the from_avro processor to convert the binary Avro data back into a temporary, structured JSON object.
Re-serialize to Parquet: Use the to_parquet processor to convert that JSON object into the binary Parquet format.

Implementation

Start with the Previous Pipeline: Copy the json-to-avro.yaml from Step 1 to a new file named avro-to-parquet.yaml.
```
cp json-to-avro.yaml avro-to-parquet.yaml
```

Add the Conversion Logic: Open avro-to-parquet.yaml. You will add two new processors to the end of the pipeline section.

Add these processors to your pipeline in avro-to-parquet.yaml
# --- Processors from Step 1 ---
# (The existing mapping and to_avro processors remain here)

# --- START: New additions for Parquet Conversion ---

# 1. DE-SERIALIZE: Convert the Avro binary data back to a structured object.
# We must use the same schema we used to encode it.
- from_avro:
    schema_path: "file://./sensor.avsc"

# 2. RE-SERIALIZE: Convert the structured object to the Parquet format.
# Parquet also uses the Avro schema to understand the data structure.
- to_parquet:
    schema_path: "file://./sensor.avsc"
    # For analytics, Snappy is a common and efficient compression choice.
    default_compression: snappy

# --- END: New additions ---

Note: The final output of this pipeline is now Parquet data, so you would typically change the output to write to a system that expects Parquet, like Amazon S3 or Google Cloud Storage.

Deploy and Observe: Watch the logs. The final output will be another stream of garbled binary text, but this time it will be in the highly compressed, columnar Parquet format.

Verification

The binary output in your console confirms the conversion was successful. A data lake query engine like AWS Athena or Google BigQuery would be able to read this Parquet data and perform extremely fast analytical queries on it, because it would only need to read the columns relevant to the query, not the entire row.

You have successfully transformed streaming data into a format optimized for big data analytics.

The Goal​

The "De-serialize -> Re-serialize" Pattern​

Implementation​

Verification​

The Goal

The "De-serialize -> Re-serialize" Pattern

Implementation

Verification