Operations
An Operation (Op for short) is a recipe used by Data Specialists (like Data Engineers or Data Analysts), made available to other technical profiles.
Source Ops
Filter Op
Restricts the input to a dynamic temporal window, based on a date time column. Work very well with Defaults placeholders.
op.filter = {
onColumn = "date_time"
from = ${df.start_of_yesterday}
to = ${df.now}
}
Prune Op
Restricts the input to a dynamic temporal window, based on a their temporal partitions. Work very well with Defaults placeholders.
op.prune = {
from = ${df.start_of_yesterday}
to = ${df.now}
}
Dedupe Op
Removes duplicates based on specified unique identifier Id criteria and by most recent Order criteria. This always returns one record per id group, so no
op.dedupe = {
idColumns = ["id"]
orderColumns = ["updated_at"]
order = "asc"
}
idColumns
Colum names to deduplicate your rows on.
Array with one value.
order
Order in which the single record gets choosen.
desc
orderColumns
Column names to order matching duplicates
Empty array
This recipie is know as deduplication and allows to obtain the last unique state of a source.
Expression Op (since 0.10.0+)
Adds a new column based on a select expression. A mini mapping before the mapping or any other ops.
op.expr = {
expressions = ["timestamp(replace(updated, 'Z', '.000Z')) as updated_at"]
}
Single expression is also supported, but not recommended (and might be deprecated in future versions)
op.expr = {
expression = "date(updated_at)"
}
Drop Op (since 0.10.0+)
Removes an existing column. A mini mapping before the mapping or any other ops.
op.expr = {
columns = ["email", "phone"]
}
Single drop is also supported, but not recommended (and might be deprecated in future versions)
drop = {
column = "id"
}
Demulti Op
Generates duplicates off a new specified time dimension (Day, Month or Year) named period.
op.demulti = {
dateColumn = "updated_at"
from = "${df.previous_first_day_of_year}"
to = ${df.now}
format = "Month"
idColumns = ["id"]
orderColumns = ["extracted_at"]
endOfMonth = true
}
dateColumn
Name of a timestamp column to generate the cohorts from
Required String
from
Timestamp of the first cohort.
Required (ISO)
to
Timestamp of the last cohort
Required (ISO)
format
Cohort period
"Day", "Month" or "Year"
idColumns
Colum names to deduplicate your rows on.
Array with one value.
orderColumns
Column names to order matching duplicates
Empty array
Flatten Op
Flattens structures into columns.
op.flatten {
column = "_airbyte_data"
}
If column is not specified it will flatten all the struct columns into parent_child column names with the right format. If specified it will flatten only that parent column naming new columns directly with child names.
Watermark Op
Constraints how further in time a source will be considered.
op.watermark {
onColumn: "created_at"
value: "60 seconds"
}
In general used to limit the scope of the source to a certain time window. This is useful for sources that are not append only, like a table that is updated with new data, but also has old data that is not relevant anymore. This is a must in streaming modes especially when joining with other sources so the join window is not too large, and the job runs out of memory waiting for a possible source delay.
Mapping Operations
Mapping Operations have all the context of the sources and can be exectued before (preOps section) or after (postOps section) the SQL Mapping.
Intervals Op (Beta)
Generates intervalic states off two sources with unrelated time dimensions.
op.intervals = {
leftTableName = "clean_companies"
rightTableName = "raw_subscriptions"
leftIdColumnName = "id"
rightIdColumnName = "id"
leftWindowColumnName = "updated_at"
rightWindowColumnName = "extracted_at"
result: "companies_x_subscriptions"
}
*TableName
name of source table
Required
*IdColumnName
name of joining column
Required
*WindowColumnName
name of a timestamp column
Required
result
name of resulting table
Required
This Operation produces a CTE with the two tables joined, yieliding a row for each state that happened in the past.
Sink Operations (0.15.0+)
There are two groups of sink operation, transformative and operational:
Transformative Ops modify the behaviour similarly to Source Ops:
Flatten Op
See source Flatten Opfor description an usage.
Operational Ops do not modify the contents but the behaviour of how sinks are applied:
Columns Partition Op
Defines which existing columns should be used for partitioning.
op.cols_partition {
cols = ["a","b","c"]
}
Date Partition Op
Defines which date columns should be used for partitioning into date time components automatically.
op.date_partition {
col = "updated_at"
}
Schema Management Op
If you want metabolic to make all changes backwards compatible you can use manage_schema op to automatically version your sinks.\
op.manage_schema {}
Last updated