Models

A Model is a tool to expose a data domain or part of it.

A domain is exposed through one or more models and declared using a Config file.

entities: [
    {
        name: My Domain Entity
        sources: [
            {
                catalog: data_lake.my_operational_table
                name: data_lake_my_operational_table_last_month
                op.filter: {
                   onColumn: period_month
                   from: ${df.start_of_month}
                   to: ${df.now}
                }
            }
        ]
        mapping: {
            sql: "SELECT period_month, count(distinct my_category) FROM data_lake_my_operational_table_last_month GROUP BY 1"
        }
        sink: {
            outputPath: ${dp.dl_gold_bucket}/my_analytical_table
            format: PARQUET
            writeMode: replace
            op.date_partition: {
                col: period_month
            }
        }
    }
]

Every Config file will produce at least one entity. Each entity is defined by four sections: A Name, Source(s), Mapping(s) and a Sink.

Name

The human readable string that best defines the entity this model is producing. It is mostly used for logging and tracebility.

Sources

In the Sources section you describe the entities in the Data Lake you are reading from. Each one is defined as:

An input path
A name
(Optional) A format. By default it's Delta.
(Additionally) A list of source operations.

Before going into detail, it's important noting that everything in Metabolic is a CTE.

A CTE or Common Table Expression are temporal results of a query that exists only within the context of a larger query. Much like a derived table, the result of a CTE is not stored and exists only for the duration of the query.

Metabolic will create a CTE from an input path and the specified format, and assign it a name. If operations are defined, it will apply them in order. This is what get exposed to the Mapping.

Mappings

In the Mappings section you describe how to manipulate those sources into a final entity using SQL.

You can choose between inline SQL or through an external .sql file.

Because we generated all the sources as CTEs, the SQL syntax is standard, giving you extremely portability without sacrificing the extra power.

Be aware of the SQL flavours of advanced functions beyond SQL standard. Currently metabolic only support SparkSQL flavour.

Aditionally you can add a list of mapping operations.

Sink

In the Sink section you describe how to materialize the output. It is defined as:

An output path
A write mode.
(Optional) A time partition key.
(Optional) A format. By default it's Delta.
(Additionally) A list of operations

Overrides

When executing a model in historical mode, operations that constraint the temporal input or output are overiden.

Last updated 1 year ago