title | weight | type | aliases | |
---|---|---|---|---|
Understand Flink CDC API |
1 |
docs |
|
If you are planning to build your own Flink CDC connectors, or considering contributing to Flink CDC, you might want to hava a deeper look at the APIs of Flink CDC. This document will go through some important concepts and interfaces in order to help you with your development.
An event under the context of Flink CDC is a special kind of record in Flink's data stream. It describes the captured changes in the external system on source side, gets processed and transformed by internal operators built by Flink CDC, and finally passed to data sink then write or applied to the external system on sink side.
Each change event contains the table ID it belongs to, and the payload that the event carries. Based on the type of payload, we categorize events into these kinds:
DataChangeEvent describes data changes in the source. It consists of 5 fields
Table ID
: table ID it belongs toBefore
: pre-image of the dataAfter
: post-image of the dataOperation type
: type of the change operationMeta
: metadata of the change
For the operation type field, we pre-define 4 operation types:
- Insert: new data entry, with
before = null
andafter = new data
- Delete: removal of data, with
before = removed
data andafter = null
- Update: update of existed data, with
before = data before change
andafter = data after change
- Replace:
SchemaChangeEvent describes schema changes in the source. Compared to DataChangeEvent, the payload of SchemaChangeEvent describes changes in the table structure in the external system, including:
AddColumnEvent
: new column in the tableAlterColumnTypeEvent
: type change of a columnCreateTableEvent
: creation of a new table. Also used to describe the schema of a pre-emitted DataChangeEventDropColumnEvent
: removal of a columnRenameColumnEvent
: name change of a column
As you may have noticed, data change event doesn't have its schema bound with it. This reduces the size of data change event and the overhead of serialization, but makes it not self-descriptive Then how does the framework know how to interpret the data change event?
To resolve the problem, the framework adds a requirement to the flow of events:
a CreateTableEvent
must be emitted before any DataChangeEvent
if a table is
new to the framework, and SchemaChangeEvent
must be emitted before any
DataChangeEvent
if the schema of a table is changed. This requirement makes
sure that the framework has been aware of the schema before processing any data
changes.
{{< img src="/fig/flow-of-events.png" alt="Flow of Events" >}}
Data source works as a factory of EventSource
and MetadataAccessor
,
constructing runtime implementations of source that captures changes from
external system and provides metadata.
EventSource
is a Flink source that reads changes, converts them to events
, then emits to downstream Flink operators. You can refer
to Flink documentation
to learn internals and how to implement a Flink source.
MetadataAccessor
serves as the metadata reader of the external system, by
listing namespaces, schemas and tables, and provide the table schema (table
structure) of the given table ID.
Symmetrical with data source, data sink consists of EventSink
and MetadataApplier
, which writes data change events and apply schema
changes (metadata changes) to external system.
EventSink
is a Flink sink that receives change event from upstream operator,
and apply them to the external system. Currently we only support Flink's Sink V2
API.
MetadataApplier
will be used to handle schema changes. When the framework
receives schema change event from source, after making some internal
synchronizations and flushes, it will apply the schema change to
external system via this applier.