Step 1: Convert JSON to Avro
A common transformation is to convert from a text-based format like JSON to a schema-based, binary format like Avro. Avro is more compact and efficient for high-volume streaming, especially with systems like Kafka.
The Goal
You will create a pipeline that takes a JSON message, validates it against an Avro schema, and converts it to the binary Avro format.
The "Prepare -> Convert" Pattern
- Define Schema: Avro requires a schema that describes the data. You'll create a simple
.avscfile. - Prepare: The incoming JSON data must be structured to perfectly match the schema. This often involves a
mappingprocessor to ensure data types are correct (e.g., converting a timestamp string to an integer). - Convert: Use the
to_avroprocessor to perform the conversion.
Implementation
-
Define the Avro Schema: Create a file named
sensor.avsc. This schema defines the structure for our sensor data. Notice that thetimestampis along(a 64-bit integer), not a string.sensor.avsc{
"type": "record",
"name": "SensorReading",
"fields": [
{ "name": "sensor_id", "type": "string" },
{ "name": "temperature", "type": "double" },
{ "name": "timestamp", "type": "long", "logicalType": "timestamp-millis" }
]
} -
Create the Conversion Pipeline: Copy the following configuration into a file named
json-to-avro.yaml.json-to-avro.yamlname: json-to-avro-converter
description: A pipeline that converts JSON to Avro.
config:
input:
generate:
interval: 1s
mapping: |
root = {
"sensor_id": "sensor-1",
"temperature": 25.5,
"timestamp": now().ts_format_iso8601()
}
pipeline:
processors:
# 1. PREPARE: Ensure the data matches the Avro schema.
# The schema requires 'timestamp' to be a 'long' (integer),
# but our input is an ISO8601 string. We must convert it.
- mapping: |
root = this
root.timestamp = this.timestamp.parse_timestamp().unix_milli()
# 2. CONVERT: Convert the prepared JSON object to Avro binary format.
# This processor replaces the message content with the binary data.
- to_avro:
schema_path: "file://./sensor.avsc"
output:
stdout:
codec: lines -
Deploy and Observe: Watch the logs. The output will not be readable JSON. It will be a stream of binary Avro data, which is exactly what a system like Kafka would expect.
Verification
The output in your console will look like garbled text, because it is binary data. This confirms the conversion was successful. A downstream system with access to the same Avro schema would be able to read this binary data and understand its structure perfectly.
You have successfully transformed JSON into a more efficient, schema-enforced binary format.