Sample to show how process different types of avro subjects in a single topic #98

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

#

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Jump to bottom

Open

buddhike wants to merge 1 commit into aws-samples:main from buddhike:main

buddhike commented Apr 3, 2025

No description provided.


          Sample to show how process different types of avro subjects in a sing…

a050abc

…le topic

buddhike requested a review from nicusX

April 3, 2025 05:37

nicusX requested changes

View reviewed changes

Contributor

nicusX left a comment

Many thanks for the contribution.
I would suggest some changes to make it simpler to understand, and also not to hint any not-so-good practice.

java/AvroOneTopicManySubjects/.gitignore

Contributor

nicusX Apr 15, 2025

.gitignore in the subfolder is not required.
There is a gitignore at top level. If there is anything missing there please update that one

java/AvroOneTopicManySubjects/README.md

+              * Flink API: DataStream API
+              * Language: Java (11)
+              This example demonstrates how to serialize/deserialize Avro messages in Kafka when one topic stores multiple subject types.

Contributor

nicusX Apr 15, 2025

Explain this is specific to Confluent Schema Registry

java/AvroOneTopicManySubjects/README.md

+              * Flink version: 1.20
+              * Flink API: DataStream API
+              * Language: Java (11)

Contributor

nicusX Apr 15, 2025

We usually add to the list the connectors used in the example. In this case, it's also important to add that the example uses AVRO Confluent Schema Registry

java/AvroOneTopicManySubjects/README.md


		This example uses Avro-generated classes (more details [below](#using-avro-generated-classes)).

		A `KafkaSource` produces a stream of Avro data objects (`SpecificRecord`), fetching the writer's schema from AWS Glue Schema Registry. The Avro Kafka message value must have been serialized using AWS Glue Schema Registry.

Contributor

nicusX Apr 15, 2025

I think you mean Confluent. Schema Registry?

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java

+                          env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
+                          env.enableCheckpointing(60000);
+                      }
+                      env.setRuntimeMode(RuntimeExecutionMode.STREAMING);

Contributor

nicusX Apr 15, 2025

This is not really required

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java

+                      env.execute("avro-one-topic-many-subjects");
+                  }
+                  private static void setupAirQualityGenerator(String bootstrapServers, String sourceTopic, String schemaRegistryUrl, Map<String, Object> schemaRegistryConfig, StreamExecutionEnvironment env) {

Contributor

nicusX Apr 29, 2025

Even though having the data generator within the same Flink app works, we are deliberately avoiding doing it in any of the examples. The reason is that building jobs with multiple dataflows is strongly discouraged.
We are avoiding using any bad practice in examples, not to suggest it may be a good idea doing it.

I reckon it's more complicated, but you can add a separate module with a standalone Java application which generates data. Something similar to what we do in this example, even though in that case it's Kinesis

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java

+               * strategies and event time extraction. However, for those scenarios to work
+               * all subjects should have a standard set of fields.
+               */
+              class Option {

Contributor

nicusX Apr 29, 2025

Maybe you can use org.apache.flink.types.SerializableOptional<T> that comes with Flink

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java

+              }
+              // Custom deserialization schema for handling multiple generic Avro record types
+              class OptionDeserializationSchema implements KafkaRecordDeserializationSchema<Option> {

Contributor

nicusX Apr 29, 2025

Please, move to a top level class for readability

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java

+                  }
+              }
+              class RecordNameSerializer<T> implements KafkaRecordSerializationSchema<T>

Contributor

nicusX Apr 29, 2025

Move to top level class

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java

+                      env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
+                      Properties applicationProperties = loadApplicationProperties(env).get(APPLICATION_CONFIG_GROUP);
+                      String bootstrapServers = Preconditions.checkNotNull(applicationProperties.getProperty("bootstrap.servers"), "bootstrap.servers not defined");

Contributor

nicusX Apr 29, 2025

The code building the dataflow is a bit hard to follow.
I would suggest to do what we tend to do in other examples

In runtime configuration, use a PropertyGroup for each source and sink, even if some configurations are repeated
Instantiate Source and Sink in a local method, the Properties which contains all configuration for that specific component. Extract specific properties, like topic name, within the method rather than in the main() directly
Build the dataflow just attaching the operators one after the others, using intermediate streams variables only when it helps readability
Avoid having methods that attach operators to the dataflow. Practically, any method which expects a DataStream or StreamingExecutionEnvironment as a parameter should be avoided.
If an operator implementation like a map for a filter is simple, try using a lambda and inlining it. If the operator implementation is complex externalize the implementation to a separate class

See examples here

We are not following these patterns in all examples, but we are trying to converge as possible

# for free to join this conversation on GitHub. Already have an account? # to comment

Labels

None yet