Lake House

Delta Lake

Topics backed by JSON Schema, Apache Avro or Protocol Buffer schema can be written as Delta Lake tables.

Tansu automatically maps each type used in the schema into an equivalent in Apache Arrow. Using Apache Parquet as a mezzanine format with minor changes for Delta Lake.

The delta subcommand will automatically load environment variables from a file named .env in the current directory or any of its parents.

Usage: tansu broker delta [OPTIONS] --location <LOCATION>

Options:
      --location <LOCATION>
          Apache Parquet files are written to this location, examples are: file://./lake or s3://lake/ [env: DATA_LAKE=s3://lake]
      --database <DATABASE>
          Delta database [env: DELTA_DATABASE=] [default: tansu]
      --records-per-second <RECORDS_PER_SECOND>
          Throttle the maximum number of records per second
  -h, --help
          Print help

Example

Start the broker:

tansu \
    broker \
    --storage-engine=sqlite://tansu.db \
    --schema-registry=file://./schema \
    delta \
    --location file://$(pwd)/lake

Employee is a protocol buffer backed topic, with the following schema/employee.proto:

syntax = 'proto3';

message Key {
  int32 id = 1;
}

message Value {
  string name = 1;
  string email = 2;
}

Using the topic create command, create the employee topic:

tansu \
    topic \
    create \
    employee \
    --config tansu.lake.partition=meta.year,meta.month,meta.day \
    --config tansu.lake.normalize=true

The above command will create an tansu.employee table, that is normalized and partitioned on the meta.year, meta.month, meta.day from the Kafka message:

configvalue
tansu.lake.partitionmeta.year,meta.month,meta.day
tansu.lake.normalizetrue

Sample employee data is in data/employees.json:

[
  {
    "key": { "id": 12321 },
    "value": { "name": "Bob", "email": "bob@example.com" }
  },
  {
    "key": { "id": 32123 },
    "value": { "name": "Alice", "email": "alice@example.com" }
  }
]

Publish the sample data onto the employee topic (using tansu cat):

tansu \
    cat \
    produce \
    --schema-registry=file://./schema \
    employee \
    data/employees.json

We can view the files created in ./lake with:

ls -1 $(find lake -type f)

Note that the tansu.employee table, is partitioned on the meta.year, meta.month, meta.day:

lake/tansu.employee/_delta_log/00000000000000000000.json
lake/tansu.employee/_delta_log/00000000000000000001.json
lake/tansu.employee/meta.year=2025/meta.month=12/meta.day=19/part-00000-da02a70c-af10-4f6b-a5d7-b3751c9f51cc-c000.parquet

To view the Delta Lake table in DuckDB with delta_scan:

duckdb :memory: "select * from delta_scan('./lake/tansu.employee');"

Giving the following output:

meta.partitionmeta.timestampmeta.yearmeta.monthmeta.daykey.idvalue.namevalue.email
02025-12-19 13:59:06.2122025121912321Bobbob@example.com
02025-12-19 13:59:06.2122025121932123Alicealice@example.com

This and further examples of schema backed topics using delta lake can be found here.

Previous
topic