βοΈFormats
Formats are the IO capabilities.
Formats make reading and writing trivial. Currently, the following physical formats are supported:
Iceberg (default Table format)
Delta Lake
Parquet Files
Json Files
CSV Files
Kafka Topics
Additionally Metabolic provides a virtual format named catalog.
Batch vs Streaming formats
Metabolic doesn't restrict the use of a specific format whereas you want to run your entities in Batch or Streaming, as other processing engines like Flink or KSQL do. Given said that, Iceberg, Delta and Kafka are preferred options if you plan on switching a lot between them (for example following a Kappa Architecture) or you plan on having
Iceberg
Iceberg is the default Metabolic Table format, as it provides atomic operations over a data lake, allowing file-based storages to behave like databases.
Iceberg operates by using a catalog for both reading and writing data, rather than relying directly on the underlying filesystem. This approach provides a more flexible and robust way to manage large-scale datasets, as the catalog stores metadata about the tables, including their schema, partitions, and versions.
How to read an Iceberg Source
How to write an Iceberg Table
Iceberg supports append, overwrite, upsert and delete write modes. Upsert need an idColumn param to identify matching rows. Delete write mode only allows full table deletions, so schema and data are fully removed.
Iceberg also supports Schema Evolution.
Schema Evolution support by Write Mode:
Append
β Yes
Overwrite (REPLACE)
β Yes
Upsert (MERGE INTO)
β οΈ Yes (only new columns)
Delete
N/A
Delta Lake
Delta Lake is another powerful table format supported by Metabolic, with automatic optimizations build in.
How to read a Delta Source
How to write a Delta Table
Delta supports append, overwrite, upsert and delete write modes. Upsert and delete need an idColumn param to identify matching rows. Upsert optionally supports eventDtColumn to also match identical updates.
For example to upsert on event_id, but maintaining historical evolution (event source without duplicates):
Parquet
Parquet is a very popular storage format ideal for data lakes, as it efficiently compresses columnar data for later analysis in Hadoop compatible ecosystems.
How to read a Parquet Source
How to write a Parquet Table
Json
Json is another storage format popular in the analytics community, as it safely serializes records while maintaining a very human-readable interface. Metabolic specifically uses the JSON Lines format.
How to read a JSON Source
Since JSON doesn't force validation when being written, some types can be malformed. In this case you can use the useStringPrimitives option to force to all columns as string type.
How to write a Json Table
CSV
CSV is another storage format popular in the analytics community, that stores records in a tabular way making it very compatible with broader audience that use Microsoft Excel and alternates. Metabolic requires CSV files to provide a header in the first line:
How to read a CSV Source
How to write a CSV Table
Kafka
Kafka topics are stream storages for events.
How to read a Kafka Source
If your Kafka is protected by some authentication you can add the variable kafkaSecret to pass a resolver. Currently Metabolic only support AWS SecretManager as a resolver, but in general you need a json with server, api key and secret.
How to write a Kafka Topic
name is required in streaming jobs in order to create a checkpoint.
Last updated