Skip to main content

Setup Environment for Format Transformation

Before building the complete format transformation solution, you'll set up schema registries and configure format definitions.

Prerequisites

This example requires the following services to be running:

Before you begin, please ensure these services are set up and running according to their respective guides. Additionally, ensure you have completed the Local Development Setup guide for general environment configuration.

Step 1: Configure Environment Variables

Set up the environment variables needed for multi-format processing:

# Format transformation configuration
export SCHEMA_REGISTRY_URL="http://localhost:8081"
export CLOUD_STORAGE_BUCKET="your-format-bucket"
export CLOUD_REGION="us-east-1"

# Verify environment setup
echo "Schema Registry: $SCHEMA_REGISTRY_URL"
echo "Storage Bucket: $CLOUD_STORAGE_BUCKET"
echo "Region: $CLOUD_REGION"

Step 2: Start Schema Registry

Format transformation requires a schema registry to manage Avro schemas and ensure compatibility:

# Pull and start Confluent Schema Registry
docker run -d \
--name format-schema-registry \
-p 8081:8081 \
-e SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS=localhost:9092 \
-e SCHEMA_REGISTRY_HOST_NAME=schema-registry \
-e SCHEMA_REGISTRY_LISTENERS=http://0.0.0.0:8081 \
confluentinc/cp-schema-registry:latest

# Wait for startup
sleep 10

# Verify schema registry is running
curl -f http://localhost:8081/subjects || echo "Schema registry not ready, waiting..."

Step 3: Create Sample Format Definitions

Create the schema definitions that will guide our format transformations:

Avro Schema

Create sensor-data.avsc:

sensor-data.avsc
{
"type": "record",
"name": "SensorData",
"namespace": "com.example.formats",
"fields": [
{
"name": "sensor_id",
"type": "string"
},
{
"name": "location",
"type": "string"
},
{
"name": "temperature",
"type": "double"
},
{
"name": "humidity",
"type": "double"
},
{
"name": "timestamp",
"type": "long"
},
{
"name": "metadata",
"type": {
"type": "record",
"name": "Metadata",
"fields": [
{
"name": "device_type",
"type": "string"
},
{
"name": "firmware_version",
"type": "string"
}
]
}
}
]
}

Protobuf Schema

Create sensor-data.proto:

sensor-data.proto
syntax = "proto3";

package com.example.formats;

import "google/protobuf/timestamp.proto";

message SensorData {
string sensor_id = 1;
string location = 2;
double temperature = 3;
double humidity = 4;
google.protobuf.Timestamp timestamp = 5;
Metadata metadata = 6;
}

message Metadata {
string device_type = 1;
string firmware_version = 2;
}

Step 4: Register Schemas

Register the Avro schema with the schema registry:

# Register the sensor data schema
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data @sensor-data.avsc \
http://localhost:8081/subjects/sensor-data-value/versions

Step 5: Create Sample Data Files

Create sample data in different formats for testing:

JSON Sample Data

Create sample-sensor-data.json:

sample-sensor-data.json
[
{
"sensor_id": "temp_42",
"location": "warehouse_north",
"temperature": 72.5,
"humidity": 45.2,
"timestamp": "2025-10-20T14:23:45.123Z",
"metadata": {
"device_type": "DHT22",
"firmware_version": "1.2.3"
}
}
]

Next Steps

Your format transformation environment is now ready! Continue with:

  1. Step 1: JSON to Avro
  2. Step 2: Avro to Parquet
  3. Step 3: JSON to Protobuf
  4. Step 4: Auto-Detection