Releases: openzipkin/brave
5.6.6
Brave 5.6.5
Brave 5.6.5 is a patch release, but there are some interesting updates since 5.6.3 below:
RateLimitedSampler
is more safe - thanks to @pburm and @devinsba some edge cases are fixedTracer.currentSpanCustomizer()
is more lazy which helps reduce overhead when never used. thanks to @lambcode for investigation and work towards thiskafka-client
andkafka-streams
now works with Kafka 2.2 - thanks to @jeqo for the compat work!kafka-streams
now instrumentsfilter
- thanks to @jorgheymans for the hard work and @jeqo for review!- We exposed
KafkaStreamsTracing.kafkaClientSupplier()
- Thanks @neetkee for raising the concern and @jeqo handling it brave-instrumentation-mysql6
instrumentation is deprecated forbrave-instrumentation-mysql8
note 5.6.4 had a regression, so don't use it!
Brave 5.6
Brave 5.6 adds a rate-limited sampler, Local Root concept, Java Flight Recorder context correlation and improves Kafka Streams instrumentation.
Rate Limited Sampler
A rate per second can be a nice choice for low-traffic endpoints as it allows you surge protection. For example, you may never expect the endpoint to get more than 50 requests per second. If there was a sudden surge of traffic, to 5000 requests per second, you would still end up with 50 traces per second. Conversely, if you had a percentage, like 10%, the same surge would end up with 500 traces per second, possibly overloading your storage. Amazon X-Ray includes a rate-limited sampler (named Reservoir) for this purpose. We've taken the same approach with our RateLimitedSampler
.
Here's how to limit all tracing to 10 per second.
tracingBuilder.sampler(RateLimitingSampler.create(10));
Thanks @devinsba for spiking the implementation and @anuraaga for suggesting how to be more fair when sampling large numbers of requests. Also appreciate review from @huydx and @zeagord
Kafka Streams Instrumentation
KafkaStreamsTracing
now encapsulates most-common operations (e.g. map
, mapValues
, foreach
and peek
) with Stream Transformers and Processors. A common scenario is to mark the beginning and end of a step (or set of steps) in a stream process.
StreamsBuilder builder = new StreamsBuilder();
builder.stream(inputTopic)
.transform(kafkaStreamsTracing.mark("beginning-complex-map")
.map(complexTransformation1)
.filter(predicate)
.map(complexTransformation2)
.transform(kafkaStreamsTracing.mark("end-complex-transformation")
.to(outputTopic);
Many thanks to our Kafka Streams champions @jeqo @ImFlog and @artemyarulin!
Local Root
Local Root is an advanced concept useful for tool developers. Most end users will not use this feature, so feel free to skip this section.
A root span is the first span in a trace tree. There is also value knowing the subtrees representing work in a host. For example, a trace could start from a message consumption and result in 5 RPC requests. The root of that tree up until the client side of the RPC requests would be a local tree on one host. Each server that responds to the client requests would also have local trees until their work is complete. In other words, a distributed trace tree is a collection of smaller trees connected by network requests. Those smaller trees are "local trees" and have a "local root" that has similar properties within that host to the root of the trace.
By introducing an context field localRootId
, we are able to process these local trees to achieve some pretty interesting results. For example, you can use this ID to squash intermediate spans that never left the process. You can use this to partition data so that the whole sub tree reports at the same time. Probably the most easy example, is tagging only once per hop.
Here's an example of using the local root ID to ensure environment constant tags are only added once per sub-tree: a cost saving technique.
// while initialising the tracing component
--snip--
.addFinishedSpanHandler(new FinishedSpanHandler() {
@Override public boolean handle(TraceContext context, MutableSpan span) {
if (context.isLocalRoot()) {
// pretend these are sourced from the environment
span.tag("env", "prod");
span.tag("region", "east");
}
return true;
}
})
--snip--
It is understood this is an advanced feature. If it sparks any ideas, please reach out to us on gitter to collaborate more. Thanks very much to @felixbarny @basvanbeek @drolando @wu-sheng @llinder and @zeagord who gave feedback leading to this feature.
Java Flight Recorder
The Java Flight Recorder was added to OpenJDK 11 and provides a low-overhead data collection framework for troubleshooting applications. People analyse these recordings in JDK Mission Control 7.
One way to cross-correlate with Flight Recorder is to re-use the same approach we do with logging systems: basically staining trace identifiers with events of interest. Starting with Brave 5.6, if you are running JRE 11, you can easily integrate Zipkin Scope events by adding JfrScopeDecorator
to your tracing setup.
We found out how to integrate trace IDs through prior art from @thegreystone (the JMC lead) who maintains a contrib package in the OpenTracing project. While our implementation differs from that, his work showed exactly what's needed and made ours easier.
Other stuff
- we added caching methods to TraceContext:
parentIdString
localRootIdString
spanIdString
- @tutufool fixed a bug which prevented spans accidentally abandoned from being reported to zipkin
Brave 5.5
Brave 5.5 notably introduces Kafka Streams instrumentation, allowing traces to continue through processing stages like .flatMap()
. It also inherits a great number of fixes since the last release.
Kafka Streams
For quite a while now, you could use KafkaTracing to wrap producers and consumers. How to address higher layers such as processing pipelines was left to documentation.
In early August 2018, @jeqo started an effort to trace Kafka Streams topologies. The general problem is that a trace can easily be lost between the start of a stream (say kinesis data) and eventual outcomes (such as http invocations). Not only do we need to trace the message production and consumption, but also stages in-between. This is no simple task.
Experience gathered through the next months, with folks like @ImFlog pitching in with feedback. Others became interested due to @jeqo's kafka lab. This demand built further leading to @llofberg building an example to help show that it works. With some later polish, we decided to release the first version of tracing Kafka Streams. Here's an example screen shot
To create a Kafka Streams with Tracing Client Supplier enabled pass your topology and configuration like this:
KafkaStreams kafkaStreams = kafkaStreamsTracing.kafkaStreams(topology, streamsConfig);
This will ensure the general producer consumer parts work and the trace continues through stages. We are still working on nicely naming stages, including some upstream discussions @jeqo has with the Kafka project. For now, we have some hooks that allow you to label operations you'd like to see in the trace.
For example, you can wrap a processor like so:
builder.stream(inputTopic)
.processor(kafkaStreamsTracing.processor("my-favorite-name", customProcessor));
This is all documented in the README and as this is emerging technology any feedback in gitter is more than welcome. Please thank @jeqo for the months of effort and also contributors including @ImFlog and @llofberg for getting something into your hands.
Changes since last release
- Bug fixes to "b3" single header format. Thanks @zyfjeff
- Kafka client code no longer references internal types. Thanks @jeqo
- RxJava propagation no longer relies on internal types. Thanks to @akarnokd for advise
- The also means fuseable functionality is unsupported
- gRPC tag propagation continues through intermediate (local) spans
- OSGi metadata added for new brave.handler types
Brave 5.4
Brave 5.4 notably introduces FinishedSpanHandler
which can listen on all traced operations and mutate data. This enables features such as metrics aggregation, multiple rate or adaptive sampling, data cleansing and "firehose" export. FinishedSpanHandler
was a direct result of features and use cases discussed at Zipkin workshops.
Thanks to @devinsba @jcchavezs and @shakuzen for reviewing early versions of the code. As always, you can ask about this on our gitter channel
Handling Finished Spans
By default, data recorded before (Span.finish()
) are reported to Zipkin
via what's passed to Tracing.Builder.spanReporter
. FinishedSpanHandler
can modify or drop data before it goes to Zipkin. It can even intercept
data that is not sampled for Zipkin.
FinishedSpanHandler
can return false to drop spans that you never want
to see in Zipkin. This should be used carefully as the spans dropped
should not have children.
Here's an example of SQL COMMENT spans so they don't clutter Zipkin.
tracingBuilder.addFinishedSpanHandler(new FinishedSpanHandler() {
@Override public boolean handle(TraceContext context, MutableSpan span) {
return !"comment".equals(span.name());
}
});
Another example is redaction: you may need to scrub tags to ensure no
personal information like social security numbers end up in Zipkin.
tracingBuilder.addFinishedSpanHandler(new FinishedSpanHandler() {
@Override public boolean handle(TraceContext context, MutableSpan span) {
span.forEachTag((key, value) ->
value.replaceAll("[0-9]{3}\\-[0-9]{2}\\-[0-9]{4}", "xxx-xx-xxxx")
);
return true; // retain the span
}
});
An example of redaction is here
Sampling locally
While Brave defaults to report 100% of data to Zipkin, many will use a
lower percentage like 1%. This is called sampling and the decision is
maintained throughout the trace, across requests consistently. Sampling
has disadvantages. For example, statistics on sampled data is usually
misleading by nature of not observing all durations.
FinishedSpanHandler
returns alwaysSampleLocal()
to indicate whether
it should see all data, or just all data sent to Zipkin. You can override
this to true to observe all operations.
Here's an example of metrics handling:
tracingBuilder.addFinishedSpanHandler(new FinishedSpanHandler() {
@Override public boolean alwaysSampleLocal() {
return true; // since we want to always see timestamps, we have to always record
}
@Override public boolean handle(TraceContext context, MutableSpan span) {
if (namesToAlwaysTime.contains(span.name())) {
registry.timer("spans", "name", span.name())
.record(span.finishTimestamp() - span.startTimestamp(), MICROSECONDS);
}
return true; // retain the span
}
});
An example of metrics handling is here
Future work
Larger sites often have multiple tracing systems which go at different rates. For example 100% for edge one hop down and 1% for all traffic. These sites have large service graphs and so can't justify the cost to do 100% data collection in the same way they handle the 1%. Some, like yelp, use a "firehose" handler which always samples regardless what's in headers, directing that to a custom, non-indexed cassandra keyspace. Netflix have a need to propagate multiple sampling decisions, or triggers for decisions. The combination of local sampling flags and FinishedSpanHandler
allows all of these use cases to exist. While, we don't have a plugin to perform this out-of-box at the moment, we do have a proof of concept test here. If you are interested, follow the Zipkin wiki for updates notably named "firehose", or join our gitter channel.
Prior art
The below discuss influence we had from OpenCensus, a telemetry project that includes tracing, and Zipkin support. This is to honor the project, not to say we are better than it: basically Brave has different feature drivers which result in some hooks being more abstract. We thank the census project for efforts into tracing library design.
Local Sampling flag
The Census project has a concept of a SampledSpanStore. Typically, you configure a span name pattern or choose individual spans for local (in-process) storage. This storage is used to power administrative pages named zPages. For example, Tracez displays spans sampled locally which have errors or crossed a latency threshold.
The "sample local" vocab in Census was re-used in Brave to describe the act of keeping data that isn't going to necessarily end up in Zipkin.
The main difference in Brave is implementation. For example, the sampledLocal
flag in Brave is a part of the TraceContext and so can be triggered when headers are parsed. This is needed to provide custom sampling mechanisms. Also, Brave has a small core library and doesn't package any specific storage abstraction, admin pages or latency processors. As such, we can't define specifically what sampled local means as Census' could. All it means is that FinishedSpanHandler
will see data for that trace context.
FinishedSpanHandler Api
Brave had for a long time re-used zipkin's Reporter library, which is like Census' SpanExporter in so far that they both allow pushing a specific format elsewhere, usually to a service. Brave's FinishedSpanHandler is a little more like Census' SpanExporter.Handler in so far as the structure includes the trace context.. something we need access to in order to do things like advanced sampling.
FinishedSpanHandler
takes a pair of (TraceContext
, MutableSpan
). MutableSpan in Brave is.. well.. mutable, and this allows you to cheaply change data. Also, it isn't a struct based on a proto, so it can hold references to objects like Exception
. This allows us to render data into different formats such as Amazon's stack frames without having to guess what will be needed by parsing up front. As error parsing is deferred, overhead is less in cases where errors are not recorded (such as is the case on spans intentionally dropped).
Brave 5.2
Brave 5.2 updates compatibility to Kafka 2.0, deprecates Span.remoteEndpoint
and introduces ScopeDecorator
Kafka 2.0 Compatibility
Many thanks to @jeqo for making our brave-instrumentation-kafka-clients
compatible with version 2 while remaining compatible with version 1. This means you can transparently upgrade without affecting tracing!
Dropping zipkin2.Endpoint Span apis
For reasons of efficiency and library decoupling, we've deprecated Span.remoteEndpoint(zipkin2.Endpoint)
for Span.remoteServiceName(String)
and Span.remoteIpAndPort(String, int)
. This allows RPC instrumentation to indicate a remote service without affecting IP information and visa versa. All Brave supported instrumentation now use this mechanism.
End users will notice nothing notable, now, but future work will build on this (ex span metrics). Deprecated methods remain until Brave v6 which isn't currently planned.
ScopeDecorator
Brave 5.2 also introduces ScopeDecorator
which is a way to add things like log4j2 trace ID correlation without affecting which thread locals are used by Brave.
Formerly we did log4j2 integration by wrapping our thread local with something to synchronize log4j2's thread locals:
currentTraceContext = ThreadContextCurrentTraceContext.create(CurrentTraceContext.Default.create());
Now, we can do the same via decorating plugins, which is more elegant at least:
currentTraceContext = ThreadLocalCurrentTraceContext.newBuilder() // use thread local trace context
.addScopeDecorator(ThreadContextScopeDecorator.create()) // with log4j2 trace ID correlation
.build();
More importantly, you can swap out the thread-local backend without affecting your decorators:
currentTraceContext = RequestContextCurrentTraceContext.newBuilder() // use Armeria's request context
.addScopeDecorator(ThreadContextScopeDecorator.create()) // with log4j2 trace ID correlation
.build();
And flexibly, you can now weave in multiple aspects, such as our strict checker:
currentTraceContext = RequestContextCurrentTraceContext.newBuilder() // use Armeria's request context
.addScopeDecorator(ThreadContextScopeDecorator.create()) // with log4j2 trace ID correlation
.addScopeDecorator(StrictScopeDecorator.create()) // complain if closed on the wrong thread
.build();
The major design contributor on this was @anuraaga with lots of review help by @kojilin. Thanks both!
Sidebar on Zipkin Fukuoka workshop
The below discusses advanced trace context code developed at a Zipkin workshop in Fukoka Japan at LINE's office. Main contributions were from folks who work both on Brave and Armeria, an asynchronous RPC library. This all led to what you can do above, done by volunteers.. some on holiday! As a treat, we've also put some diagrams in.
Armeria's RequestContext
First, let's review request and response processing. In the case of Armeria, a RequestContext
is setup to scope data about a request/response exchange. We use >>>
and <<<
to indicate the direction of the network, and to note that the same context is used for both directions.
┌──────────────────────────────────────────┐┌────────────────────────────────────────────┐
│ >>> Request Processing ││ <<< Response Processing │
└──────────────────────────────────────────┘└────────────────────────────────────────────┘
┌>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<┐
│ Armeria Server Request Context │
└──────────────────────────────────────╦──────────────────────────────────────────────────┘
║
║
.───────────────. ║
_.─' `──. ┌────▶║
╱ A client request ╲ │ ║┌───────────────────────┐┌───────────────────────┐
( context is a copy of )────┘ ║│>>> Request Processing ││<<< Response Processing│
`. the server request ,' ║└───────────────────────┘└───────────────────────┘
`──. _.─' ▽>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<┐
`─────────────' │ Armeria Client Request Context │
└──────────────────────────────────────────────────┘
As noted above, in the relevant setup, a client context can fork certain values from the server's context. Regardless of if any values are copied, it is important to note that client contexts are separate from server contexts: changes made to the client context do not affect the server.
Teaching Brave to use Armeria's RequestContext
By default, Brave uses thread local storage to hold the current span's trace context. Formerly, Armeria used lifecycle hooks to coordinate these two thread local stores. Starting in Armeria v0.69, Brave uses the above RequestContext model to store its TraceContext. Here's how it works:
When Brave scopes a trace context, for example a server span, it now writes an entry in Armeria's current RequestContext
. When a client request context is created, it forks the last value seen on the server context. This allows asynchronous commands later to be associated with the correct position in the trace.
┌>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<┐
│ Armeria Server Request Context │
└──△111111111111111111△222222222222222222△111△11111111111111111111111111111111111111111111┘
│ │ ║ │ │◀─────┐ .───────────────.
┌○──────────────────┼────────────────╬─┼───●┐ │ _.──' `───.
│newScope(server 1) │ ║ │ │ │ ╱ The server's trace ╲
└───────────────────┼────────────────╬─┼────┘ └──────( context is retained even )
▲ ┌◇────────────────╬─◆┐ `. after it is descoped ,'
│ │newScope(local 2)║ │ `───. _.──'
│ └─────────────────╬──┘ `─────────────'
│ ▲ ▽>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<┐
│ │ │ Armeria Client Request Context │
.───────┴───────. ┌──┘ └222△3333333333333333△22222222222222222222222222222┘
_.─' `──.│ │ │
╱ Each trace scope sets ╲ ┌□────────────────■┐
( trace identifiers and ) ────────────▶ │newScope(client 3)│
`. reverts them on close ,' └──────────────────┘
`──. _.─'
`─────────────'
You'll see above numbers like 22222211111
This is just showing what span ID is present, in a way highlighting values in the context that change over time. For example, 222
is saying it remained span ID 2, and 22211
says it changed from 2 to 1. In other words, this is a scrappy timeline diagram.
The more interesting part is while scoping restores the prior ID on close, this is not the case on the initial server span. Since an armeria RequestContext
is provisioned per server request, there's no need to revert the trace IDs associated with that request. By not reverting, it also allows any response callbacks to be associated with the correct server span, too!
Closing notes
Most people won't need to understand the mechanics described above, but it is helpful to those trying to understand a library in its own semantics. By having Brave use Armeria's context, it is making the integration more natural to the maintainers, so less error prone. Please look forward to Armeria 0.70 which will have even more integration!
This integration was the result of significant brainstorming, design and review by @anuraag @kojilin and @trustin. Please thank them directly if you use any of this.. or even just like reading about it. If you have any questions or feedback, feel free to contact us on https://gitter.im/openzipkin/zipkin
Brave 5.1
Brave 5.1 adds RxJava 2 propagation, mysql8 and gRPC propagation interop.
Brave 5.1 no longer publishes 'io.zipkin.brave:brave-propagation-aws' as it has been moved to the other Amazon code as 'io.zipkin.aws:brave-propagation-aws'
RxJava 2 propagation
One of the three legs of Distributed Tracing is in-process propagation. This means carrying a trace from one side of the process to the other, hopping threads as necessary. Reactive frameworks can be particularly tricky with this, noted by LINE in their recent meetup.
Brave 5.1 bundles a context propagation hook which ensures all stages of reactive flows propagate trace indentifiers. This includes any implicit brave activities such as MDC logging synchronization.
You can download it via the maven coordinates 'io.zipkin.brave:brave-context-rxjava2'
To set this up, create CurrentTraceContextAssemblyTracking
using the
current trace context provided by your Tracing
component.
contextTracking = CurrentTraceContextAssemblyTracking.create(
tracing.currentTraceContext()
);
After your application-specific changes to RxJavaPlugins
, enable trace
context tracking like so:
contextTracking.enable();
Have a look at the project README for more.
The design of this library borrows heavily from https://github.com/akaita/RxJava2Debug and https://github.com/akarnokd/RxJava2Extensions. Thanks to @akaita and @akarnokd for their
prior art and especially thanks to @kojilin for testing this in production!
MySQL 8 instrumentation
Many thanks to @anuraaga for developing trace integration for MySQL 8. The hooks are different from before, so make sure when you add the following to your connection URL, you also include the error hook!
First, depend on the maven artifact 'io.zipkin.brave:brave-instrumentation-mysql8'
Then, add append the following to the end of your connection URL:
?queryInterceptors=brave.mysql8.TracingQueryInterceptor&exceptionInterceptors=brave.mysql8.TracingExceptionInterceptor
Have a look at the project README for more.
gRPC propagation interop
Brave includes gRPC instrumentation compatible with versions as early as 1.2.0. Starting with gRPC 1.4, an experimental tracing feature can be enabled which uses the OpenCensus library. By default, this writes binary data including trace IDs and the upstream service method.
Starting in Brave 5.1, our gRPC instrumentation can interop with this format, both reading and writing it. This means that tracing can work seamlessly between unlike libraries provided they report to the same tracing system.
As this feature is experimental, it is disabled by default. Enable it with GrpcTracing.Builder.grpcPropagationFormatEnabled
.
http.route bug fix
Thanks to @xihw for noticing some edge cases where route-based span names didn't always appear where they should. These were due to thread visibility issues that are now solved and insulated with new integration tests.
Brave 5
Brave 5 is the same as the last release of Brave 4 except without any deprecated methods. We increased the major version number to acknowledge that those using deprecated methods will break.
Even-though we have no formal EOL policy, you can consider Brave 5 a long term release as it has a very stable api at this point. Brave 6 will be more interesting, but unlikely to start this year.
Brave 4.19
Brave 4.19 includes new apis to streamline both routine and advanced tracing. Notably, it adds the Span.error
, escalating error handling from convention to api. Tracer.startScopedSpan
eases synchronous tracing. Meanwhile, we add more tools for those instrumenting common libraries and frameworks.
These new features were added with care and after practice mandated it. If you have advice for us, please join us on gitter so that future versions continue to clarify and improve.
Adds Span.error and ErrorParser
Before, we didn't define an api to generically handle errors. Rather, it was framework specific.
For example, http operations had a clear error-handling api like so:
serverHandler.handleSend(context.response(), context.failure(), span);
Over time, we noticed a lot of copy-paste, especially when people used tools like AspectJ to automatically trace operations. For example, there were many variations on how to parse an error:
} catch (RuntimeException | Error e) {
span.tag("error", e.getMessage());
throw e;
If you look above, you'll notice a bug, which is that the message can be null. Bugs aside, some export to tracing systems with first-class errors.
@devinsba works on our Amazon X-Ray support, which includes an error type. He noticed that error handling by convention as opposed to by api meant backends like X-Ray result in less fidelity. As errors are a very big part of tracing, we could do better.
Thanks to efforts by Brian, existing work in Sleuth and review by @shakuzen, this is now sorted.
Rather than eagerly parsing exceptions, use Span.error(Throwable)
, which incidentally is more efficient when unsampled anyway.
} catch (RuntimeException | Error e) {
span.error(e);
throw e;
ErrorParser
by default adds an "error" tag safely based on the error message or simple type name if there's no message. You can override this with Tracing.Builder.errorHandler
. Other systems like http have been retrofitted over this api.
Adds Tracer.startScopedSpan
Before, we required two operations to start a span and make it visible for synchronous tracing: Span.start()
and Tracer.withSpanInScope(span)
:
// Start a new trace or a span within an existing trace representing an operation
Span span = tracer.nextSpan().name("encode").start();
// Put the span in "scope" so that downstream code such as loggers can see trace IDs
try (SpanInScope ws = tracer.withSpanInScope(span)) {
--snip--
} finally {
span.finish(); // note the scope is independent of the span. Always finish a span.
}
This is two-step design was intentional, to decouple the activity of making
a span visible from its lifecycle. People regularly doing instrumentation hit
this concern often, notably when a callback completes a span.
For example, if the below code from Kafka tracing finished a span just
because it exited the code block, it would be too early. The message doesn't
actually send until later (hence the callback):
try (SpanInScope ws = tracing.tracer().withSpanInScope(span)) {
return delegate.send(record, new TracingCallback(span, callback));
} catch (RuntimeException | Error e) {
span.error(e).finish(); // finish as an exception means the callback won't finish the span
throw e;
}
While sufficient, synchronous tracing still exists, and when that's the case the two-step dance can be laborious. Most notably, AspectJ/AOP/Annotation based tracing is routinely synchronous. How to address this without breaking compatibility took some time and input from prior art in Wingtips and OpenCensus. It also took some incentive, such as the lower overhead we can cause when we know something is single-threaded.
Here's what we came up with: Tracer.startScopedSpan
// Start a new trace or a span within an existing trace representing an operation
ScopedSpan span = tracer.startScopedSpan("encode");
try {
// The span is in "scope" meaning downstream code such as loggers can see trace IDs
return encoder.encode();
} catch (RuntimeException | Error e) {
span.error(e); // Unless you handle exceptions, you might not know the operation failed!
throw e;
} finally {
span.finish(); // always finish the span
}
We took full advantage of the call-site givens to design the simplest api that has least overhead for the task. The "finish" operation above closes the scope and also marks the duration of the operation. There's no need to defer the start operation, or knowledge of the span name.
One design note is ScopedSpan
is not AutoCloseable
intentionally: catch blocks happen after auto-close, which means errors would happen after the span reports. This is also not a thread-safe api as by definition this is for same-thread, synchronous tracing. If you find yourself limited, just switch to our normal Span
api which is more advanced.
Thanks very much to @devinsba and @shakuzen for design and code review. Thanks also notably to the OpenCensus project as this api intentionally resembles their SpanBuilder.startScopedSpan().
Deprecates Tracer.newTrace(SamplingFlags)
for Tracer.withSampler
Before, we had an edge-case api for parameterized sampling: Tracer.newTrace(SamplingFlags)
. This deprecates that for a general purpose api that allows you to temporarily override the underlying trace ID sampler.
Ex.
// scope a tracer so that it always reports data
Tracer tracer = tracing.tracer().withSampler(Sampler.ALWAYS_SAMPLE);
Adds CurrentTraceContext.maybeScope
This section is very advanced, feel free to skip it!
Those instrumenting reactive frameworks like RxJava know that overhead can be staggering if not watched carefully. One problem area here is the act of scoping a trace context (ex so that MDC can see it) has overhead, yet there are numerous places where user-code can execute in a reactive flow.
For example, RxJava includes hooks to wrap types that represent an asynchronous functional composition like flowable.parallel().flatMap(Y).sequential()
. Assembly hooks for trace context can ensure every stage can see its trace identifiers.
However, other tools might also instrument these stages, including vert.x or even agent instrumentation, resulting in a doubling of overhead.
We added CurrentTraceContext.maybeScope
, to automate peeking at state to see if it should be applied or not. It should only be used in propagation-only wrappers, such as executor instrumentation or assembly hooks.
Here's an example from our pending RxJava2 support:
@Override protected void subscribeActual(CompletableObserver s) {
try (Scope scope = currentTraceContext.maybeScope(assemblyContext)) {
source.subscribe(new Observer(s, currentTraceContext, assemblyContext));
}
}
See our javadoc for more!
Other stuff
- Hides accidentally exposed constructors on Span and Tracing
- These types were not designed to be extensible
- Deprecates
SpanCustomizer.annotate(long timestamp, String value)
SpanCustomizer
is a basic api, and has no access to a clock that can prevent skew. UseSpan
or reporter directly if you must backdate data.
- Fixes bug where vert.x span duration happened before response headers
HttpHandler
was refactored to ensure scope clears before reporting spans- this handles some edge cases where reporters might amplify by accidentally tracing
- Fixes bug where
X-B3-Flags: 1
did not imply sampled whenX-B3-Sampled
was missing KafkaTracing.nextSpan(record)
clears trace headers as we do in RabbitMQ- This prevents accidental re-processing of the same span
Brave 4.18
Brave 4.18 packs in a lot of new features, notably a redesigned integration model for
servlet-based instrumentation, dependency linking for messaging spans, and a safer
customization implementation.
Much of this was driven by users testing Spring Cloud Sleuth 2.0, which is based
on Brave now. Many thanks and patience to the Spring Boot 2 users testing this.
There was also significant design discussions for all of this, individuals thanked as
we go along below.
Redo of integrations that layer over servlet
All servlet-based frameworks collaborate with TracingFilter vs replicating lifecycle
in framework-specific ways. Even when the replicated code is less than 100 lines, there
are a number of reasons explained below for this choice.
Problems with mixed instrumentation starting server spans
Before, frameworks stacked over servlet, such as JAX-RS and Spring WebMVC, had
their own full-stack instrumentation. Even if only a few dozen lines, starting
"server spans" independently per-framework caused issues in integration. For
example, some frameworks could be "cut off" prior to their handlers kicking in.
Some frameworks, notably JAX-RS had very poor quality traces due to the lack of
available hooks. Finally, duplicate instrumentation was possible, especially in
environments like Spring Boot which support multiple coexisting frameworks. This
could materialize as multiple "spans" for the same server request unless
configured very carefully.
Using brave.servlet.TracingFilter
whenever possible
The Micrometer team had similar problems trying to
reliably measure web request latency and found normalizing at the Servlet layer
helpful. This makes sense as the servlet layer is the most reused and tested code.
We expect the least amount of new bugs there. Also, its pros and cons are well
understood, as is its configuration.
So, we now recommend using the normal brave.servlet.TracingFilter whenever
possible. We also provide brave.spring.webmvc.DelegatingTracingFilter for
those needing to bootstrap that way.
For example, here's a snippet from our legacy app example, which
sets up the tracing filter in XML.
<!-- ContextLoaderListener makes sure delegating filters can read the application context -->
<listener>
<listener-class>org.springframework.web.context.ContextLoaderListener</listener-class>
</listener>
<!-- Add the delegate to the standard tracing filter and map it to all paths -->
<filter>
<filter-name>tracingFilter</filter-name>
<filter-class>brave.spring.webmvc.DelegatingTracingFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>tracingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
Leveraging framework-specific data
While standardizing on Servlet where possible solves some problems, it doesn't
naturally know anything resting above it. For example, Jersey and Spring WebMVC
know about the controller that processed a request and the route that led to it.
We now employ lighter and less error-prone "span customizing" extensions for
frameworks that stack over servlet.
To make this concrete. Let's take the legacy application example. If just using the
normal servlet instrumentation, you'd see a trace like this:
The above is skimpy on details but tells you about what happened at the http abstraction. Let's
say you took that several year old application and added this bean to it.
<mvc:interceptors>
<bean class="brave.spring.webmvc.SpanCustomizingHandlerInterceptor" />
</mvc:interceptors>
As you'll see below, we've coordinated the layers such that the "mvc" component can contribute
tags and even a better span name with no effort on your part.
This isn't limited to Spring applications; it is available for anything that leverages
servlet libraries, including our 1st party instrumentation and whatever you contribute next.
For brave-instrumentation-jaxrs2
TracingContainerFilter
->SpanCustomizingContainerFilter
For brave-instrumentation-jersey-server
TracingApplicationEventListener
->SpanCustomizingApplicationEventListener
- Use
TracingApplicationEventListener
when your container is native Jersey
- Use
For brave-instrumentation-spring-webmvc
TracingAsyncHandlerInterceptor
->SpanCustomizingAsyncHandlerInterceptor
TracingHandlerInterceptor
->SpanCustomizingHandlerInterceptor
- Note: SpanCustomizingHandlerInterceptor is only for Spring 2.5
For brave-instrumentation-sparkjava
- There is no span customizing hooks right now. Let us know if you need them.
Credits roll
We only made it this far due to extended design and review discussions. Thanks very
much to @jplock (dropwizard) @jkschneider (micrometer) @marcingrzejszczak (sleuth) @wu-sheng (skywalking) and indirectly @nicmunroe as wingtips code and docs around servlet are fantastic.
Big changes can be intimidating, so special thanks also to Zipkin regulars
@shakuzen and @zeagord for taking time to review design points.
Messaging spans default to make dependency links
Before, using brave-instrumentation-kafka or brave-instrumentation-spring-rabbit
did not affect the Zipkin dependency graph. In other words, you wouldn't see any links
between the app producing a message to a broker and from the broker to a consumer.
This configuration oversight led to some support questions. Thanks for reporting them!
You can now set the remoteServiceName
builder property to indicate the name of
your broker. These default to "kafka" and "rabbitmq" respectively.
For example, if using default rabbit instrumentation, the following trace:
Shows itself routing through the rabbitmq (default name) broker in the dependency diagram:
Thanks notably to @shakuzen and @jonathan-lo for reviewing this change.
Tracer.currentSpanCustomizer()
and Span.customizer()
for safer customization
Our "user-safe" type is SpanCustomizer. This type will have zero changes for its
tenure in the version 4.x line. It also it has no lifecycle support which means it can't
abnormally end your traces. SpanCustomizer
happened because we wanted an
api platform developers needed be worried about exposing to users or 3rd parties.
We noticed a glitch which was it was easy to accidentally expose the underlying
span to users (due to class hierarchy ex. Span
is a subtype of SpanCustomizer
).
We've now fixed this and encourage Tracer.currentSpanCustomizer()
and
Span.customizer()
to provide where needed.
Thanks @yschimke and @zeagord for reviewing this!
Future work will introduce UnsafeSpan
, which can rely on these methods to
ensure higher performance in single-threaded scenarios such as synchronous
calls. Watch the corresponding issue for updates.