Understanding Avro
Schema with the data — the Kafka format.
What Avro is, why the schema travels with the data, and how the schema registry handles evolution in distributed systems.
Schema-first, but inline.
Avro (Apache, 2009) is a row-oriented binary data format with schemas written in JSON. The trick: in Avro Object Container Files (.avro), the schema is embedded in the file header. A consumer can read the file without prior knowledge — schema and data ship together. In streaming systems like Kafka, the schema is replaced by a schema-ID that resolves through a Schema Registry.
The schema language.
A JSON document describing records, primitives, arrays, maps, unions, enums.{"type":"record","name":"User","fields":[{"name":"id","type":"long"},{"name":"email","type":["null","string"]}]}. Nullable fields are unions with null. Default values are required for evolution (without them, you can't add a new field to the schema without breaking old consumers). The schema becomes more verbose than Protobuf but more discoverable.
The wire format.
The data itself is compact binary: variable-length integers, length-prefixed strings, primitive types laid out in field order. No field tags; the decoder uses the schema to know which bytes are which. Result: ~30 % smaller than Protobuf for the same data, because there's no per-field tag overhead. The catch: decoding requires knowing the schema. Lose the schema, the file is unreadable.
Schema Registry.
Confluent's Schema Registry (now an open-standard) is the canonical way to ship Avro on Kafka. The producer publishes the schema to the registry, gets back a schema-ID (4 bytes), and prepends the ID to every message. Consumers look up the ID, fetch the schema, decode. Backward and forward compatibility checks happen at registration time — the registry refuses to register a breaking schema change.
A worked record.
Schema: User with id (long), email (nullable string), createdAt (long, logicalType timestamp-millis). Data: { id: 42, email: "alice@example.com", createdAt: 1748736000000 }. Encoded: ~25 bytes (varint 42 + length+string + varint timestamp). With Schema Registry: 4 byte schema ID + 25 bytes = 29 bytes on the wire. The same record in JSON: ~75 bytes. 3× compression with no entropy-coding compression layer.
User record
JSON 75B → Avro 29B
Schema-driven binary, registry ID prefix.
4B ID + 25B payload
= 3× smaller than JSON
vs Protobuf, JSON, Parquet.
Protobuf: schemas in .proto files, field numbers for evolution, slightly bigger wire format, no built-in registry. JSON: human-readable, much bigger, no schema-as-data. Parquet: column-oriented, optimised for analytics scans, terrible for row-by-row writing. Avro is the right answer for Kafka and other streaming row-oriented pipelines; Parquet is the right answer for the data lake those streams write into. Many pipelines convert Avro → Parquet at ingestion time.