-
Notifications
You must be signed in to change notification settings - Fork 138
Brooklin Architecture
Brooklin is a Java server application that is typically deployed to a cluster of machines. Each machine can host one or more instances of Brooklin, each of which offers the exact same set of capabilities.
- The most fundamental concept in Brooklin is a
Datastream
. - A
Datastream
is a description of a data pipe between 2 systems; a source system from which data is consumed and a destination system to which this data is produced. - Brooklin allows you to create as many
Datastreams
as you need to set up different data pipes between various source and destination systems. - To support high scalability, Brooklin expects the data streamed between source and destination systems to be partitioned. If the data is not partitioned, it is treated as a stream with a single partition.
- Likewise, Brooklin breaks every partitioned
Datastream
into multipleDatastreamTasks
, each of which is limited to a subset of the total partitions, that are all processed concurrently for higher throughput.
-
Connector
is the abstraction representing modules that consume data from source systems. - Different
Connector
implementations can be written to support consuming data from different source systems. - To support producing data to different destinations,
Connectors
employ a different abstraction —TransportProviders
. - An example
Connector
implementation isKafkaConnector
, which is intended for consuming data from Kafka.
-
TransportProvider
is the abstraction representing modules that produce data to destination systems. - Different
TransportProvider
implementations can be written to support producing data to different source systems. - An example
TransportProvider
implementation isKafkaTransportProvider
, which is intended for producing data to Kafka.
- Brooklin
Coordinator
is the module responsible for managing the differentConnector
implementations. - There is only a single
Coordinator
object in every Brooklin server app instance. - A
Coordinator
can either be a leader or non-leader. - In a Brooklin cluster, only one
Coordinator
is designated as leader while the rest are non-leaders. - Brooklin employs the ZooKeeper election recipe for electing the leader
Coordinator
. - In addition to managing
Connectors
, the leaderCoordinator
is responsible for dividing the work among all theCoordinators
in the cluster (including itself) by assigningDatastreamTasks
to them. - The leader
Coordinator
can be configured to doDatastreamTask
assignment using different strategies (implementations ofAssignmentStrategy
). - An example
AssignmentStrategy
offered by Brooklin is theLoadbalancingStrategy
, which causes the leaderCoordinator
to evenly distribute all the availableDatastreamTasks
across allCoordinator
instances.
- Brooklin server application is typically deployed to one or more machines, all using ZooKeeper as the source of truth for
Datastream
andDatastreamTask
metadata. - Information about the different instances of Brooklin server app as well as their
DatastreamTask
assignments is also stored in ZooKeeper. - Every Brooklin instance exposes a REST endpoint — aka
Datastream Management Service (DMS)
— that offers CRUD operations onDatastreams
over HTTP.
A good way to understand the architecture of Brooklin is to go through the workflow of creating a new Datastream
.
The figure below illustrates the main steps involved in Datastream
creation.
-
A Brooklin client sends a
Datastream
creation request to a Brooklin cluster. -
The request is routed to the
Datastream Management Service
REST endpoint of any instance of the Brooklin server app. -
The
Datastream
data is verified and written to ZooKeeper under a certain znode that the leaderCoordinator
is watching for changes. -
The leader
Coordinator
gets notified of the newDatastream
znode creation. -
The leader
Coordinator
reads the metadata of the newly createdDatastream
and breaks it down into one or moreDatastreamTasks
. It also uses theAssignmentStrategy
of theConnector
specified in theDatastream
to assign the differentDatastreamTasks
to the availableCoordinator
instances, by writing the breakdown and assignment info to ZooKeeper. -
The affected
Coordinators
get notified of the newDatastreamTask
assignments created under their respective znodes, which they read and process immediately using the relevantConnectors
(i.e. consume from source, produce to destination).
In addition to leader Coordinator
election, Brooklin maintains the following pieces of info in ZooKeeper:
- Registered
Connector
types -
Datastream
andDatastreamTasks
metadata -
DatastreamTask
progress information (e.g. offsets/checkpoints) - Brooklin app instances
-
DatastreamTask
assignments of Brooklin app instances
The figure below illustrates the overall structure of the data Brooklin maintains in ZooKeeper.
- Each Brooklin cluster is mapped to one top-level znode.
-
/dms
is where the definitions of individualDatastreams
are presisted (in JSON) -
/connectors
- A sub-znode for every registered
Connector
type in this cluster is created under this znode. -
/connectors/<connector-type>/
is where the definitions of all theDatastreamTasks
handled by thisConnector
type are located, one znode perDatastreamTask
. - There are two child nodes under every
/connectors/<connector-type>/<datastreamtask>
znode:config
andstate
. -
config
stores the definitions of theDatastreamTasks
-
state
stores the progress info of theConnectors
(offsets/checkpoints)
- A sub-znode for every registered
-
/liveinstances
- This is where the different Brooklin instances create ephemeral znodes with incremental sequence numbers (created using the EPHEMERAL_SEQUENTIAL mode) for the purposes of leader
Coordinator
election. - The value associated with each sub-znode is the hostname of the corresponding Brooklin instance.
- This is where the different Brooklin instances create ephemeral znodes with incremental sequence numbers (created using the EPHEMERAL_SEQUENTIAL mode) for the purposes of leader
-
/instances
- This is where every Brooklin instance creates a persistent znode for itself. Each znode has a name composed of the concatenation of the instance hostname and its unique sequence number under
/liveinstances
. - Two sub-znodes are created under each Brooklin instance znode —
assignments
anderrors
. -
assignments
contains the names of all theDatastreamTasks
assigned to this Brooklin instance. -
errors
is where messages about errors encountered by this Brooklin instance are persisted.
- This is where every Brooklin instance creates a persistent znode for itself. Each znode has a name composed of the concatenation of the instance hostname and its unique sequence number under
- The leader
Coordinator
watches the/dms
znode for changes in order to get notified whenDatastreams
are created, deleted, or altered. - Leader or not, every
Coorindator
watches its corresponding znode under/instances
in order to get notified when changes are made to itsassignments
child znode. - The leader
Coordinator
assignsDatastreamTasks
to individual Brooklin instances by changing theassignment
znode under/instances/<instance-name>
.
- Home
- Brooklin Architecture
- Production Use Cases
- Developer Guide
- Documentation
- REST Endpoints
- Connectors
- Transport Providers
- Brooklin Configuration
- Test Driving Brooklin