Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: Add the MetricsCollector for client side metrics #1566

Merged
merged 194 commits into from
Feb 24, 2025

Conversation

danieljbruce
Copy link
Contributor

@danieljbruce danieljbruce commented Jan 20, 2025

Summary:

This PR adds the MetricsCollector which the rest of the client library will use to collect client side metrics to send to GCP backend for analysis. None of the methods will use MetricsCollector yet so this PR doesn't change client library behaviour in any way. It is just in the stage of building infrastructure to support client side metrics.

Background:

Below is a diagram of the new data structures that will be added.

CSM diagram 6 drawio

Changes:

Files that will contain utilities for unit tests and system tests:

common/client-side-metrics-attributes.ts: Contains interfaces for various client side metrics attributes.

common/logger.ts: Provides abstract classes for other test classes so that those test classes can easily log test results.

common/test-date-provider:: Exports a DateProvider class used in metrics collector tests that provides a predictable series of dates so that the metrics collector class used in the tests behaves consistently.

common/test-metrics-handler.ts: Exports a metrics handler for testing that simply collects results the metrics collector records instead of sending the recorded results to the GCP backend.

Files directly pertaining to the metrics collector:

src/client-side-metrics/metrics-collector.ts: Contains the new MetricsCollector class which will be created to collect metrics each time a method is called.

src/client-side-metrics/metrics-handler.ts: The metrics handler interface that the metrics collector can use. Users can also supply their own metrics handler for the metrics collector to use.

test/metrics-collector/metrics-collector.ts: A test file for the metrics collector that introduces a fake table with a fake method that records metrics using the metrics collector and ensures the right metrics from the method are getting recorded.

test/metrics-collector/typical-method-call.ts: The test in test/metrics-collector/metrics-collector.ts compares this file with the logger output of the metrics collector and ensures that these two series of logs are exactly the same.

test/metrics-collector/metrics-handler-fixture.ts: A file containing a constant that is compared against the data the metrics handler collects in the test.

Other changes:

package.json: Add new open telemetry packages

Suggestions for how to review this PR:

It may be a good idea to review test/metrics-tracer/typical-method-call.txt first because it contains a plain-text log of what the metrics tracer is actually recording for a method call. Looking at this file may reveal details regarding when the client is collecting certain metrics without having to understand the client metrics collector architecture itself.

Next Steps:

  1. The GCPMetricHandler was removed from this PR. Introduce it in another PR, pass the metrics-handler-fixture.ts constant into it and ensure it writes to the metrics dashboard with a system test.
  2. Write some system tests using this new test infrastructure that logs metrics collected and ensure the metrics collected are what we would expect.
  3. Add client side metrics to readRows calls
  4. Add client side metrics to other calls
  5. Add client_blocking_latencies, application_blocking_latencies, the only metrics not covered yet

*/
onOperationComplete?(
metrics: OnOperationCompleteMetrics,
attributes: OnOperationCompleteAttributes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is much cleaner! Although I'm still a little unsure why we need to divide the data up between OnOperationCompleteAttributes and OnOperationCompleteMetrics though, since it seems like they're all used in the same places. Could it be simplified a bit more if you combine them? Or am I missing something?

* @param {string} projectId The id of the project.
* @param {OperationOnlyAttributes} info Information about the completed operation.
*/
onOperationComplete(projectId: string, info: OperationOnlyAttributes) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as onAttemptComplete: don't we just need the status code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely pass in the streamingOperation parameter when we create the metrics collector instead so we don't need to pass it in here. projectId is a different story, because the projectId property on the client is only populated by calling auth just before the grpc call is made so it's not available when the metrics collector gets created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok streamingOperation has been removed.

* Called when status information is received. Extracts zone and cluster information.
* @param {object} status The received status information.
*/
onStatusReceived(status: {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused why onStatusReceived is different from onMetadataReceived. Aren't they both parsing metadata fields? (I noticed the argument is called metadata)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you create a stream in Node and make a grpc call, the stream receives status and metadata events. The shape of the status event received matches the first argument of onStatusReceived method and the shape of the metadata event received matches the first argument of onMetadataReceived.

* @param {string} projectId The id of the project.
* @param {OnAttemptCompleteInfo} info Information about the completed attempt.
*/
onAttemptComplete(projectId: string, info: OnAttemptCompleteInfo) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does a StreamingState need to be passed here? and project id? shouldn't they be known in advance?

It seems to me that the only new information here would be the status code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely pass in the streamingOperation parameter when we create the metrics collector instead so we don't need to pass it in here. projectId is a different story, because the projectId property on the client is only populated by calling auth just before the grpc call is made so it's not available when the metrics collector gets created.

}

/**
* An interface representing a tabular API surface, such as a Bigtable table.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this exist here? Isn't there already a class for the TabularAPISurface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An ITabularApiSurface parameter is provided when the metrics collector is created. We don't want to require this parameter to be a whole TabularApiSurface object because only a small subset of the properties are actually used and we don't want to restrict the OperationMetricsCollector further than we have to. In particular, this simplifies the test for the metrics collector where we create a simple FakeTable object which would otherwise have to be a whole TabularApiSurface object.

/**
* An interface representing a Date-like object. Provides a `getTime` method
* for retrieving the time value in milliseconds. Used for abstracting time
* in tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really necessary? There's not a way to mock a stdlib Date without this extra interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed since we removed the date provider.

* This information is used for recording metrics.
*/
interface OnAttemptCompleteInfo {
connectivityErrorCount: number;
Copy link

@daniel-sanche daniel-sanche Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this connectivityErrorCount supposed to be here? I see it's also in OnAttemptCompleteMetrics, which seems more natural. Is this a typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This is a typo and OnAttemptCompleteInfo is gone now because after other corrections it only contains one property so that property is just passed directly.

/**
* Whether the operation is a streaming operation or not.
*/
streamingOperation: StreamingState;
Copy link

@daniel-sanche daniel-sanche Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part of OnAttemptComplete? Isn't it known in advance?

My understanding is the streaming flag is basically just to check if the user called the public read_row vs read_rows method, which should be known at call time? I'm not sure if that even applies to node though, since I'm not seeing a point read method.

How were you planning on populating this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is the streaming flag is basically just to check if the user called the public read_row vs read_rows method, which should be known at call time?

That's my understanding too. This is now populated when the metrics collector gets created.

Copy link

@daniel-sanche daniel-sanche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment to consider about de-duplicating the attempt/operation data. But overall looks good!

clientName: string;
attemptStatus: grpc.status;
streamingOperation: StreamingState;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking a lot better!

One more comment though: there's a lot of duplicated fields between OnOperationCompleteData and OnAttemptCompleteData. You should consider deduplicating the data so that each field exists in just one place (some fields belong at the operation level, and some are related to the specific attempt, and it could make sense to keep them separate)

In Python, both the Attempt and its associated Operation are passed in for on_attempt_complete. And the Operation holds a list of associated attempts, which is used to calculate retry_count

But that may be less applicable here since you're building these objects on the fly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had something similar before with StandardAttributes when the metrics and attributes were separate so I decided to de-duplicate the data with a new StandardData interface and then extend it.

@danieljbruce danieljbruce merged commit d475ef2 into main Feb 24, 2025
21 of 25 checks passed
@danieljbruce danieljbruce deleted the 359913994-first-PR-add-infrastructure branch February 24, 2025 15:55
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
api: bigtable Issues related to the googleapis/nodejs-bigtable API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants