Step 2: Convert Avro to Parquet
While Avro is excellent for streaming, Parquet is the industry standard for efficient analytics and data warehousing. It uses a columnar storage format that is highly optimized for analytical queries.
This step teaches you how to convert the Avro data from Step 1 into the Parquet format.
The Goal
You will add processors to your pipeline to take the binary Avro data, convert it back to a structured JSON object, and then re-encode it into the binary Parquet format.
The "De-serialize -> Re-serialize" Pattern
Converting between two binary formats is a two-step process:
- De-serialize from Avro: Use the
from_avroprocessor to convert the binary Avro data back into a temporary, structured JSON object. - Re-serialize to Parquet: Use the
to_parquetprocessor to convert that JSON object into the binary Parquet format.
Implementation
-
Start with the Previous Pipeline: Copy the
json-to-avro.yamlfrom Step 1 to a new file namedavro-to-parquet.yaml.cp json-to-avro.yaml avro-to-parquet.yaml -
Add the Conversion Logic: Open
avro-to-parquet.yaml. You will add two new processors to the end of thepipelinesection.Add these processors to your pipeline in avro-to-parquet.yaml# --- Processors from Step 1 ---
# (The existing mapping and to_avro processors remain here)
# --- START: New additions for Parquet Conversion ---
# 1. DE-SERIALIZE: Convert the Avro binary data back to a structured object.
# We must use the same schema we used to encode it.
- from_avro:
schema_path: "file://./sensor.avsc"
# 2. RE-SERIALIZE: Convert the structured object to the Parquet format.
# Parquet also uses the Avro schema to understand the data structure.
- to_parquet:
schema_path: "file://./sensor.avsc"
# For analytics, Snappy is a common and efficient compression choice.
default_compression: snappy
# --- END: New additions ---Note: The final output of this pipeline is now Parquet data, so you would typically change the
outputto write to a system that expects Parquet, like Amazon S3 or Google Cloud Storage. -
Deploy and Observe: Watch the logs. The final output will be another stream of garbled binary text, but this time it will be in the highly compressed, columnar Parquet format.
Verification
The binary output in your console confirms the conversion was successful. A data lake query engine like AWS Athena or Google BigQuery would be able to read this Parquet data and perform extremely fast analytical queries on it, because it would only need to read the columns relevant to the query, not the entire row.
You have successfully transformed streaming data into a format optimized for big data analytics.