Lake House
Delta Lake
Topics backed by JSON Schema, Apache Avro or Protocol Buffer schema can be written as Delta Lake tables.
Tansu automatically maps each type used in the schema into an equivalent in Apache Arrow. Using Apache Parquet as a mezzanine format with minor changes for Delta Lake.
The delta subcommand will automatically load environment variables from a file named .env in the current directory or any of its parents.
Usage: tansu broker delta [OPTIONS] --location <LOCATION>
Options:
--location <LOCATION>
Apache Parquet files are written to this location, examples are: file://./lake or s3://lake/ [env: DATA_LAKE=s3://lake]
--database <DATABASE>
Delta database [env: DELTA_DATABASE=] [default: tansu]
--records-per-second <RECORDS_PER_SECOND>
Throttle the maximum number of records per second
-h, --help
Print help
Example
Start the broker:
tansu \
broker \
--storage-engine=sqlite://tansu.db \
--schema-registry=file://./schema \
delta \
--location file://$(pwd)/lake
Employee is a protocol buffer backed topic, with the following schema/employee.proto:
syntax = 'proto3';
message Key {
int32 id = 1;
}
message Value {
string name = 1;
string email = 2;
}
Using the topic create command, create the employee topic:
tansu \
topic \
create \
employee \
--config tansu.lake.partition=meta.year,meta.month,meta.day \
--config tansu.lake.normalize=true
The above command will create an tansu.employee table, that is normalized and partitioned on the meta.year, meta.month, meta.day from the Kafka message:
| config | value |
|---|---|
| tansu.lake.partition | meta.year,meta.month,meta.day |
| tansu.lake.normalize | true |
Sample employee data is in data/employees.json:
[
{
"key": { "id": 12321 },
"value": { "name": "Bob", "email": "bob@example.com" }
},
{
"key": { "id": 32123 },
"value": { "name": "Alice", "email": "alice@example.com" }
}
]
Publish the sample data onto the employee topic (using tansu cat):
tansu \
cat \
produce \
--schema-registry=file://./schema \
employee \
data/employees.json
We can view the files created in ./lake with:
ls -1 $(find lake -type f)
Note that the tansu.employee table, is partitioned on the meta.year, meta.month, meta.day:
lake/tansu.employee/_delta_log/00000000000000000000.json
lake/tansu.employee/_delta_log/00000000000000000001.json
lake/tansu.employee/meta.year=2025/meta.month=12/meta.day=19/part-00000-da02a70c-af10-4f6b-a5d7-b3751c9f51cc-c000.parquet
To view the Delta Lake table in DuckDB with delta_scan:
duckdb :memory: "select * from delta_scan('./lake/tansu.employee');"
Giving the following output:
| meta.partition | meta.timestamp | meta.year | meta.month | meta.day | key.id | value.name | value.email |
|---|---|---|---|---|---|---|---|
| 0 | 2025-12-19 13:59:06.212 | 2025 | 12 | 19 | 12321 | Bob | bob@example.com |
| 0 | 2025-12-19 13:59:06.212 | 2025 | 12 | 19 | 32123 | Alice | alice@example.com |
This and further examples of schema backed topics using delta lake can be found here.