Formats
Formats are the IO capabilities.
Formats make reading and writing trivial. Currently, the following physical formats are supported:
Delta Tables (default)
Parquet Files
Json Files
CSV Files
Kafka Topics
Additionally Metabolic provides a virtual format named catalog.
Batch vs Streaming formats
Metabolic doesn't restrict the use of a specific format whereas you want to run your entities in Batch or Streaming, as other processing engines like Flink or KSQL do. Given said that, Delta and Kafka are preferred options if you plan on switching a lot between them (for example following a Kappa Architecture) or you plan on having
Delta
Delta is the default and preferred format to use Metabolic with, as it provides atomic operations over a data lake, allowing file based storages behave like databases.
How to read a Delta Source
How to write a Delta Table
Delta supports append, overwrite, upsert and delete write modes. Upsert and delete need an idColumn param to identify matching rows. Upsert optionally supports eventDtColumn to also match identical updates.
For example to upsert on event_id, but maintaining historical evolution (event source without duplicates):
Parquet
Parquet is a very popular storage format ideal for data lakes, as it efficiently compresses columnar data for later analysis in Hadoop compatible ecosystems.
How to read a Parquet Source
How to write a Parquet Table
Json
Json is another storage format popular in the analytics community, as it safely serializes records while maintaining a very human-readable interface. Metabolic specifically uses the JSON Lines subformat.
How to read a JSON Source
Since JSON doesn't force validation when being written, some types can be malformed. In this case you can use the useStringPrimitives option to force to all columns as string type.
Since useStringPrimitives does not discriminate columns, you can pair it with op.expr to convert other columns back to the original type.
How to write a Json Table
CSV
CSV is another storage format popular in the analytics community, that stores records in a tabular way making it very compatible with broader audience that use Microsoft Excel and alternates. Metabolic requires CSV files to provide a header in the first line:
How to read a CSV Source
How to write a CSV Table
CSV sinks can be extremely useful at the end of entity transformations, as a deliverable for business stakeholders, but it is not recommended as middle format when doing modular transformations.
Kafka
Kafka topics are stream storages for events
How to read a Kafka Source
If your Kafka is protected by some authentication you can add the variable kafkaSecret to pass a resolver. Currently Metabolic only support AWS SecretManager as a resolver, but in general you need a json with server, api key and secret.
How to write a Kafka Topic
name is streaming jobs required in order to create a checkpoint
Last updated