Skip to content

Brave 5.4

Compare
Choose a tag to compare
@codefromthecrypt codefromthecrypt released this 27 Sep 14:51
· 644 commits to master since this release

Brave 5.4 notably introduces FinishedSpanHandler which can listen on all traced operations and mutate data. This enables features such as metrics aggregation, multiple rate or adaptive sampling, data cleansing and "firehose" export. FinishedSpanHandler was a direct result of features and use cases discussed at Zipkin workshops.

Thanks to @devinsba @jcchavezs and @shakuzen for reviewing early versions of the code. As always, you can ask about this on our gitter channel

Handling Finished Spans

By default, data recorded before (Span.finish()) are reported to Zipkin
via what's passed to Tracing.Builder.spanReporter. FinishedSpanHandler
can modify or drop data before it goes to Zipkin. It can even intercept
data that is not sampled for Zipkin.

FinishedSpanHandler can return false to drop spans that you never want
to see in Zipkin. This should be used carefully as the spans dropped
should not have children.

Here's an example of SQL COMMENT spans so they don't clutter Zipkin.

tracingBuilder.addFinishedSpanHandler(new FinishedSpanHandler() {
  @Override public boolean handle(TraceContext context, MutableSpan span) {
    return !"comment".equals(span.name());
  }
});

Another example is redaction: you may need to scrub tags to ensure no
personal information like social security numbers end up in Zipkin.

tracingBuilder.addFinishedSpanHandler(new FinishedSpanHandler() {
  @Override public boolean handle(TraceContext context, MutableSpan span) {
    span.forEachTag((key, value) ->
      value.replaceAll("[0-9]{3}\\-[0-9]{2}\\-[0-9]{4}", "xxx-xx-xxxx")
    );
    return true; // retain the span
  }
});

An example of redaction is here

Sampling locally

While Brave defaults to report 100% of data to Zipkin, many will use a
lower percentage like 1%. This is called sampling and the decision is
maintained throughout the trace, across requests consistently. Sampling
has disadvantages. For example, statistics on sampled data is usually
misleading by nature of not observing all durations.

FinishedSpanHandler returns alwaysSampleLocal() to indicate whether
it should see all data, or just all data sent to Zipkin. You can override
this to true to observe all operations.

Here's an example of metrics handling:

tracingBuilder.addFinishedSpanHandler(new FinishedSpanHandler() {
  @Override public boolean alwaysSampleLocal() {
    return true; // since we want to always see timestamps, we have to always record
  }

  @Override public boolean handle(TraceContext context, MutableSpan span) {
    if (namesToAlwaysTime.contains(span.name())) {
      registry.timer("spans", "name", span.name())
          .record(span.finishTimestamp() - span.startTimestamp(), MICROSECONDS);
    }
    return true; // retain the span
  }
});

An example of metrics handling is here

Future work

Larger sites often have multiple tracing systems which go at different rates. For example 100% for edge one hop down and 1% for all traffic. These sites have large service graphs and so can't justify the cost to do 100% data collection in the same way they handle the 1%. Some, like yelp, use a "firehose" handler which always samples regardless what's in headers, directing that to a custom, non-indexed cassandra keyspace. Netflix have a need to propagate multiple sampling decisions, or triggers for decisions. The combination of local sampling flags and FinishedSpanHandler allows all of these use cases to exist. While, we don't have a plugin to perform this out-of-box at the moment, we do have a proof of concept test here. If you are interested, follow the Zipkin wiki for updates notably named "firehose", or join our gitter channel.

Prior art

The below discuss influence we had from OpenCensus, a telemetry project that includes tracing, and Zipkin support. This is to honor the project, not to say we are better than it: basically Brave has different feature drivers which result in some hooks being more abstract. We thank the census project for efforts into tracing library design.

Local Sampling flag

The Census project has a concept of a SampledSpanStore. Typically, you configure a span name pattern or choose individual spans for local (in-process) storage. This storage is used to power administrative pages named zPages. For example, Tracez displays spans sampled locally which have errors or crossed a latency threshold.

The "sample local" vocab in Census was re-used in Brave to describe the act of keeping data that isn't going to necessarily end up in Zipkin.

The main difference in Brave is implementation. For example, the sampledLocal flag in Brave is a part of the TraceContext and so can be triggered when headers are parsed. This is needed to provide custom sampling mechanisms. Also, Brave has a small core library and doesn't package any specific storage abstraction, admin pages or latency processors. As such, we can't define specifically what sampled local means as Census' could. All it means is that FinishedSpanHandler will see data for that trace context.

FinishedSpanHandler Api

Brave had for a long time re-used zipkin's Reporter library, which is like Census' SpanExporter in so far that they both allow pushing a specific format elsewhere, usually to a service. Brave's FinishedSpanHandler is a little more like Census' SpanExporter.Handler in so far as the structure includes the trace context.. something we need access to in order to do things like advanced sampling.

FinishedSpanHandler takes a pair of (TraceContext, MutableSpan). MutableSpan in Brave is.. well.. mutable, and this allows you to cheaply change data. Also, it isn't a struct based on a proto, so it can hold references to objects like Exception. This allows us to render data into different formats such as Amazon's stack frames without having to guess what will be needed by parsing up front. As error parsing is deferred, overhead is less in cases where errors are not recorded (such as is the case on spans intentionally dropped).