Call tree summary, e.g. after each request #639

kolloch · 2020-03-22T12:19:10Z

Feature Request

Provide an easy way to print a call tree summary after each processed request or another meaningful unit of work in your application. The call tree should be based on the tracing spans. Events are ignored.

(I am writing this feature proposal after the nice encouragement by @hawkw)

Output

In the past, roughly the information that you would find in a flame graph but in text form has proven very helpful. Here is my first proposal for an output format:

 [# calls]    wall  ms  |     own ms  |  span name
              ----------------------------------------------------------------------
 [       1]   7,033.22 |      20,25  |  graphql
 [       1]      23.22 |       3,20  |  |- database_pool.get
 [       1]     611.21 |       5,23  |  |- elections.query     
 [      15]     511.21 |       3,23  |  |  |- election_options.query
 [      15]     311.21 |       5,22  |  |  |   |- election_comments.query
 [       1]     033.22 |      20,25  |  |- commit transaction

[# calls] the total number of calls for this call tree path.
wall ms the total wall time that a span with this call path was alive in ms (Subscriber::new_span until try_close). Edit: was sum ms before but this is misleading.
own ms the total wall time that executing was in a span with this call path (Subscriber::enter until Subscriber::leave).
span name The name from the Metadata. We could also add the callsite for disambiguation but this is probably not necessary. ~~some relatively short identifier for the span -- probably we should allow customizing this so that each user can create a function that creates an appropriate short name for each span~~.

The order of tree nodes: There should be only one entry for each call path but the spans should be sorted by the first time they were seen. Think of storing the children of each call path in something like a linked_hash_map. That way, the order of the children resembles the order in which they were called. For repeated calls, the order is often still quite readable since it is typical that some sub sequence of calls is simply repeated in a loop.

That, in practice, gives you a lovely outline of how your request was processed.

Crates

I think that we should create a new sub crate for this since the functionality is orthogonal to the rest and only users who want this should pay the price. Alternatively, it might be enabled by a feature on the tracing-subscriber crate.

The data model built by this subscriber might be useful for other summaries and might be extracted into another crate or a lib in the tracing-subscriber crate. I'd start with keeping the code together in one place, though.

Motivation

Let's assume a simple data model for the above example. Every election has several election_options for voting. Users can comment on every election_option with an election_comment.

Counts are useful: With the call tree above e.g. "N+1" problems become obvious. One can suspect that there is one query per election because there are multiple election_options.query calls for only one elections.query call. Fortunately, the comments do not seem to be queried individually since their cardinality is the same as the parent.
wall ms is useful: We do see where the majority of the latency comes from.
own ms is useful: We can confirm that our app mostly waits for the database. At least for async this should work well, for sync code we should rely on sub spans.

Proposal

A new configurable subscriber should be created for this. I assume that the Layer/Subscriber infrastructure in tracing-subscriber is a good match but I haven't looked into the details.

Introspection into spans:

summary roots: By default, the subscriber should start tracking at all root spans. ~~the subscriber should start tracking call paths for any spans that have a marker field, like e.g. summary_root~~. This should be overridable.
short names: The subscriber needs to get a short name for each span. This should be user configurable. (not necessary, the metadata for spans already includes a "name" that is suitable for this)

New spans which are not children of summary roots can be completely ignored.

When a summary root is try_closed, the summary as defined above should be printed.

Synthetic example

    use tracing::{info_span, info, instrument};

    #[instrument]
    fn repeated(repetition: i32) {
        info!("repetition: {}", repetition);        
    }

    #[instrument]
    fn nested() {
        for i in 1..=10 {
            repeated(i);
        }
    }

    #[instrument]
    fn request() {
        nested();
        repeated(-1);
        // Even though the name is the same, this is a different span.
        let name_clash_span = info_span!("nested");
        let _enter = name_clash_span.enter();
    }

    request();

should result in the following tree (times are obviously unrealistic):

 [# calls]    wall  ms  |     own ms  |  span name
              ----------------------------------------------------------------------
 [       1]       12.22 |       0,25  |  request
 [       1]       11.21 |       0,23  |  |- nested     
 [      10]       11.21 |       0,23  |  |  |- repeated
 [       1]        1.21 |       0,23  |  |- repeated     
 [       1]        0.27 |       0,25  |  |- nested

The text was updated successfully, but these errors were encountered:

hawkw · 2020-03-22T18:28:59Z

Thanks for the great feature request! I really appreciate the level of thought and detail that you've put into this issue; this is very helpful! :)

I believe @davidbarsky has been experimenting with something similar --- I'm not sure how much additional work is needed to get that crate ready for release, it may just be adding documentation and polish.

I think David's crate right now is currently closer to a tree-structured log than a flamegraph-like profile — it performs no aggregation of multiple instances of the same span, and includes events as children of spans. However, I think we could add a lot of the features you're proposing on top of what he's got, and it could probably expose some configuration options to change the level of aggregation, etc.

Here are some assorted notes on what you've written up:

sum ms the total wall time that a span with this call path was alive in ms (Subscriber::new_span until try_close).

own ms the total wall time that executing was in a span with this call path (Subscriber::enter until Subscriber::leave).

Note thatown ms is also going to be a sum, as enter and exit may be called multiple times over a span's lifetime (this is especially common in async code, where a future will be polled several times before completing).

span name some relatively short identifier for the span -- probably we should allow customizing this so that each user can create a function that creates an appropriate short name for each span.
[...]

short names: The subscriber needs to get a short name for each span. This should be user configurable.

Span metadata already has a name field which is the string literal provided in the span! macro. By convention, these are relatively short --- and, since they are string literals, they don't contain dynamically interpolated content the way that messages can. That should probably work for this purpose!

summary roots: By default, the subscriber should start tracking call paths for any spans that have a marker field, like e.g. summary_root. This should be overridable.

A much simpler option: we could just start by tracking any span that is a root (i.e. doesn't have a parent).

We could also add the ability to layer in a filter using the existing filter infrastructure, to allow selecting which roots should be tracked by disabling those we don't care about. I think users could easily configure the behavior you're describing, with a special summary_root marker field, using the existing filter infrastructure.

And some general open questions:

How to handle field values? If we're counting instances of a span, and we record instances of that span with different field values as unique spans, we may not do a whole lot of aggregation — we may end up with a bunch of spans which all have a count of 1 because they have different field values. On the other hand, it seems like a shame to discard values entirely... Is there some kind of aggregation we can do here, like counting the number of times each value occurred?
What about events? Your introductory paragraph suggested events should be ignored entirely. Since each event callsite is uniquely identifiable across instances of that event, we could perform some aggregation or counting of event instances as well. Might be worth thinking about eventually.

davidbarsky · 2020-03-22T18:48:45Z

I believe @davidbarsky has been experimenting with something similar --- I'm not sure how much additional work is needed to get that crate ready for release, it may just be adding documentation and polish.

Yup. I can release it now as an alpha to crates.io with minimal documentation, but before I feel comfortable saying its "ready", I'd like to add additional documentation.

I think David's crate right now is currently closer to a tree-structured log than a flamegraph-like profile — it performs no aggregation of multiple instances of the same span, and includes events as children of spans. However, I think we could add a lot of the features you're proposing on top of what he's got, and it could probably expose some configuration options to change the level of aggregation, etc.

I'd gladly accept these PRs. The proposal @kolloch has laid out is a direction I'd like to see tracing-tree move towards.

kolloch · 2020-03-24T20:24:09Z

Thank you for your interest and your thoughtful comments!

I think David's crate right now is currently closer to a tree-structured log than a flamegraph-like profile — it performs no aggregation of multiple instances of the same span, and includes events as children of spans. However, I think we could add a lot of the features you're proposing on top of what he's got, and it could probably expose some configuration options to change the level of aggregation, etc.

@hawkw @davidbarsky I just had a look at the code and I don't see much overlap, unfortunately. No aggregation, the output format is quite different, ... I might be wrong! Maybe when I start implementing I will realize that I need exactly what David wrote already ;)

Note thatown ms is also going to be a sum, as enter and exit may be called multiple times over a span's lifetime (this is especially common in async code, where a future will be polled several times before completing).

Yes, I am aware. sum ms might be misleading there. Even without repeated enter exit events, everything is a sum because the idea is to sum up by call path and call paths might be within a loop etc..

Span metadata already has a name field which is the string literal provided in the span! macro. By convention, these are relatively short --- and, since they are string literals, they don't contain dynamically interpolated content the way that messages can. That should probably work for this purpose!

Very good suggestion, sounds like a match. Thanks!

summary roots: By default, the subscriber should start tracking call paths for any spans that have a marker field, like e.g. summary_root. This should be overridable.

A much simpler option: we could just start by tracking any span that is a root (i.e. doesn't have a parent).

That is a good default, true.

We could also add the ability to layer in a filter using the existing filter infrastructure, to allow selecting which roots should be tracked by disabling those we don't care about. I think users could easily configure the behavior you're describing, with a special summary_root marker field, using the existing filter infrastructure.

Nice, I have to read up on the code to fully grasp what you mean.

And some general open questions:

* **How to handle field values?** If we're counting instances of a span, and we record instances of that span with different field values as unique spans, we may not do a whole lot of aggregation — we may end up with a bunch of spans which all have a count of 1 because they have different field values. On the other hand, it seems like a shame to discard values entirely... Is there some kind of aggregation we can do here, like counting the number of times each value occurred?

I am thinking of true call paths in which path entries correspond to code locations. Probably I could do the matching on callsite. It might be sometimes useful to distinguish spans by further means but I would only implement it if that happens to be common in practice. I'd guess by default not and there might be an option or so to include certain fields in the call path element identity?

* **What about events?** Your introductory paragraph suggested events should be ignored entirely. Since each event callsite is uniquely identifiable across instances of that event, we could perform some aggregation or counting of event instances as well. Might be worth thinking about eventually.

Yeah, true. We could e.g. count all warnings/errors at a call path.

The name of events is not very useful without having the source code side-by-side: definition of the name. Maybe we could use a certain special field value or do you have other ideas? In similar cases, I have seen the formatting string used as informal "name" -- with the placeholders still shown as placeholders. E.g. "this is an event: {}" for info!("this is an event: {}", something).

Unfortunately, that is not yet available within the Metadata. Since it would be simply a static string, it is probably okay to add it, though.

kolloch · 2020-12-20T18:41:40Z

Thanks to your encouragement, I finally got around putting something together:

https://github.com/kolloch/reqray

I think it should be useful at its current form. One implementation detail that I am uncomfortable with is the use of locking through the span extensions: Since this involves locking under the hood and I use extensions not only from the current span, but the parent and the root spans, I could imagine this results in deadlocks if another layer also uses these extensions.

Can I assume if I only use my own crate types in the extensions map that I am free from such interferences of other extensions?

Thanks!

davidbarsky · 2020-12-21T17:25:56Z

This looks really cool, @kolloch!

Can I assume if I only use my own crate types in the extensions map that I am free from such interferences of other extensions?

Yup. While the types placed within an extension can be shared between layers, I'm not sure how often it happens in practice due to those types being private.

kolloch · 2020-12-22T17:43:55Z

Thank you! The extensions(_mut) functions for SpanData acquire locks:

    fn extensions(&self) -> Extensions<'_> {
        Extensions::new(self.inner.extensions.read().expect("Mutex poisoned"))
    }

    fn extensions_mut(&self) -> ExtensionsMut<'_> {
        ExtensionsMut::new(self.inner.extensions.write().expect("Mutex poisoned"))
    }

The code I wrote originally acquired the extensions of the root/parent spans in addition to the "active" span. If other extensions did that as well and tried to acquire the locks in opposite order, it would result in deadlocks. I avoid that now by always releasing the lock before acquiring other extensions.

I guess that this should probably be documented in the SpanRef methods. I'll create a small PR...

Also, I am quite open to contribute the code of reqray back into the main tracing repo if there is any interest in that.

hawkw added kind/feature New feature or request kind/rfc A request for comments to discuss future changes labels Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call tree summary, e.g. after each request #639

Call tree summary, e.g. after each request #639

kolloch commented Mar 22, 2020 •

edited

Loading

hawkw commented Mar 22, 2020 •

edited

Loading

davidbarsky commented Mar 22, 2020

kolloch commented Mar 24, 2020

kolloch commented Dec 20, 2020

davidbarsky commented Dec 21, 2020

kolloch commented Dec 22, 2020

Call tree summary, e.g. after each request #639

Call tree summary, e.g. after each request #639

Comments

kolloch commented Mar 22, 2020 • edited Loading

Feature Request

Output

Crates

Motivation

Proposal

Synthetic example

hawkw commented Mar 22, 2020 • edited Loading

davidbarsky commented Mar 22, 2020

kolloch commented Mar 24, 2020

kolloch commented Dec 20, 2020

davidbarsky commented Dec 21, 2020

kolloch commented Dec 22, 2020

kolloch commented Mar 22, 2020 •

edited

Loading

hawkw commented Mar 22, 2020 •

edited

Loading