Step 4: Configure Batching
Sending one message at a time can be inefficient, especially when writing to files or sending data over a network. Batching is the process of collecting a number of messages and sending them as a single group, which is much more efficient.
The Goal
You will modify your pipeline's output to use a batching policy. Instead of writing every single log to a file as it arrives, the pipeline will collect 10 messages (or wait for 5 seconds) and then write them all at once.
The batching Block
Most output processors can be configured with a batching block that defines the policy. The two most common fields are:
count: The number of messages to collect before sending.period: The maximum amount of time to wait before sending, even if thecounthas not been reached.
Whichever condition is met first triggers the send.
Implementation
-
Start with the Previous Pipeline: Copy the
restructure-log.yamlfrom Step 3 to a new file namedbatched-output.yaml.cp restructure-log.yaml batched-output.yaml -
Add the Batching Logic: Open
batched-output.yamland replace the entireoutputsection with afileoutput that includes abatchingblock.Replace the 'output' section in batched-output.yamloutput:
file:
path: /tmp/batched_logs.jsonl
codec: lines
batching:
count: 10
period: 5s -
Deploy and Test:
# Create the output file's directory if it doesn't exist
mkdir -p /tmp -
Verify: Watch the output file. Instead of seeing a new line appear every second (the rate of the
generateinput), you will see a block of 10 lines appear all at once every 10 seconds.tail -f /tmp/batched_logs.jsonlIf you were to stop the input, any remaining messages in the batch would be flushed after the 5-second
periodexpires.
You have now implemented a batching policy, which is a critical technique for building high-throughput, cost-effective data pipelines.