Skip to content

feat: Add Kafka Connect Cloud Bigtable sink connector #2466

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 76 commits into
base: main
Choose a base branch
from

Conversation

prawilny
Copy link

This PR adds Kafka Connect sink connector.

The code is to land in a different repository, but the repository hasn't been created yet, so we bring the code for early review here.
The fact that it's targetting another repo is the reason of the following:

  • modification of Github Actions CI so that it executes the sink's tests instead of adding it as a new maven submodule and plugging it into existing CI
  • the code is not a submodule in the root pom.xml nor does it use any information from outside of the directory
  • maven plugins' config is duplicated
  • there is another copy of the license in the directory

Things yet to be done (in future PRs):

  • Logical types support
  • More comprehensive integration tests
    • they might include more detailed compatibility checks against the Confluent sink
  • Use of kokoro in CI

@prawilny prawilny requested review from a team as code owners January 12, 2025 14:07
Copy link

google-cla bot commented Jan 12, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. api: bigtable Issues related to the googleapis/java-bigtable API. labels Jan 12, 2025
Copy link

Warning: This pull request is touching the following templated files:

  • .github/workflows/ci.yaml

@prawilny prawilny force-pushed the kafka_connect_bigtable_sink branch from e2f0361 to 40ae074 Compare January 13, 2025 11:20
Copy link

@brandtnewton brandtnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up nicely! All my comments are pretty minor. I did not review the Integration tests yet. I'll get back to you soon on how to handle logical types.

@prawilny
Copy link
Author

@brandtnewton
I didn't resolve any of the conversations since I'm used to the reviewer doing that.

I also have a question: how do you want to review further commits (mainly logical types support, some more integration tests, and some minor tweaks throughout the codebase)? In this PR? In a new one? Or maybe do you want to create a new repository and have the PR(s) sent there? Please let me know.

@brandtnewton
Copy link

@prawilny you can just add those commits to this PR. GitHub makes it easy to see only the changes since my last review

Copy link

@brandtnewton brandtnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes! Feel free to add more changes here or in another PR, whatever is easier for you. See comments:

@prawilny
Copy link
Author

@brandtnewton, please take another look.
The chagnes are mostly with the performance test setup and compatibility tests, but a few lines (documentation strings) are also changed in the library code.

@prawilny
Copy link
Author

@brandtnewton, please take a look.
Just about everything is done now. The main missing thing seems to be the CI config, but it waits for the creation of the new repository.

Copy link

@brandtnewton brandtnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some junk files in here but otherwise lgtm

@@ -0,0 +1,69 @@
diff --git a/google-cloud-bigtable-kafka-connect-sink/src/main/java/com/google/cloud/kafka/connect/bigtable/BigtableSinkTask.java b/google-cloud-bigtable-kafka-connect-sink/src/main/java/com/google/cloud/kafka/connect/bigtable/BigtableSinkTask.java
index 80bea508..3dcf2b27 100644
--- a/google-cloud-bigtable-kafka-connect-sink/src/main/java/com/google/cloud/kafka/connect/bigtable/BigtableSinkTask.java

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file intentional?

Copy link
Author

@prawilny prawilny Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see this section in the performance tests' README (google-cloud-bigtable-kafka-connect-sink/doc/performance/README.md):

### Tracing code
Tracing code is not present in the release code.
The code of tracing implementation is available in [`MessageTracer.java`](MessageTracer.java).
To enable tracing, put it into `sink/src/main/java/com/google/cloud/kafka/connect/bigtable/tracing` and apply [`enable_tracing.patch` git patch](enable_tracing.patch).

I think it is a better idea to show the diff rather than store the whole file since it can be more easily applied on changed file in the future to rerun the tests, but if you prefer to do it in some other way, I'll change it.

Copy link

@brandtnewton brandtnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this extra header so we can monitor usage. That goes in ~line 538

import com.google.api.gax.rpc.FixedHeaderProvider;
...
    dataStubSettings.setHeaderProvider(
        FixedHeaderProvider.create("user-agent", "bigtable-java/kafka-connect-sink/" + PackageMetadata.getVersion()));

@prawilny
Copy link
Author

prawilny commented Apr 1, 2025

Header added to both Admin and Data clients, hopefully it wasn't overzealous.

In one of your previous comments you mentioned "some junk files".
Let me explain the categories of not-strictly-necessary files, so that you might more easily point to the argument you disagree with so that I can remove the junk.

There are symlink present in integration-tests directory.
Some of them (log4j.properties, fake_service_key.json) are just resources.
Some of them (the config enums, *Config classes, PackageMetadata, the exception(s)) contain strings we want to use in the integration tests (be it "gcp.bigtable.project.id" or "com.google.cloud.kafka.connect.bigtable.exception.InvalidBigtableSchemaModificationException").
Some of them (JsonConverterFactory, NestedNullStructFactory, client constructors in BigtableSinkConfig) are helpers whose code is needed by both the integration tests and either release code or the unit tests.
The alternatives I came up with were:

  • the current symlink solution,
    • might make it impossible to run the integration tests on Windows,
    • does not touch the release code,
  • copypaste of the code,
    • changes to anything need to be manually propagated,
  • some additional maven module containing all the common code that both sink and integration-test modules would depend on,
    • complicates maven structure,
  • some kind of generation of the strings in a single place and populating the source files during the compilation
    • does not solve the problem with helpers,
    • risky since it touches the release code,

I think the symlinks are a surprisingly clean solution to this problem - I think we don't really care about the integration tests on Windows and the other solutions are more complicated in my opinion.
What do you think?

There are k8s manifests present in doc/performance directory.
We could possibly put them elsewhere, but I couldn't come up with anything better than the repo.
Do you have a better idea?

I think all the other (git ls-files | grep -v \.java) files are either:

  • code license, READMEs, .gitignores and other well-known files
  • maven (and its plugins') configs (pom.xmls, sink/src/main/assembly/package.xml, license.header)
  • performance test documentation (everything in doc/performance)
  • test resources (in */src/test/resources)
    • I rechecked in ConfluentCompatibilityIT.java that all the files from integration-tests/src/test/resources/compatibility_test_cases are used`
  • Java ServiceLoader provider file (sink/src/main/resources/META-INF/services/org.apache.kafka.connect.sink.SinkConnector)
  • the symlinks described above
    • I rechecked that all the symlinks are needed now (in process I found a single unused file, which I deleted).

So can I get an approval of this PR without any caveats or do I still need to do something? Apart from the Kokoro config, but it's going to happen only in the new repo, right?

@prawilny prawilny requested a review from brandtnewton April 1, 2025 13:28
Copy link

@brandtnewton brandtnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@brandtnewton
Copy link

@prawilny gotcha, I didn't realize that those were symlinked files

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
api: bigtable Issues related to the googleapis/java-bigtable API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants