Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

proof of concept of tenancy #3605

Closed
wants to merge 3 commits into from

Conversation

esnible
Copy link
Contributor

@esnible esnible commented Mar 29, 2022

Updates: #3427

Demo of tenancy implementation

I have Jaeger running behind a proxy that adds x-tenant header depending on an Authorization header token or on a cookie. Each tenant sees only the traces posted with the same header value.

Only HTTP GET /api/traces/* and gRPC jaeger.api_v2.CollectorService/PostSpans are implemented.
Only the memory storage is implemented. No limits on the number of tenants allowed.

Is this the right approach? This PR offered as a proof-of-concept.

cc @pavolloffay

@codecov
Copy link

codecov bot commented Mar 29, 2022

Codecov Report

Merging #3605 (e695f99) into main (784644c) will decrease coverage by 0.14%.
The diff coverage is 76.85%.

@@            Coverage Diff             @@
##             main    #3605      +/-   ##
==========================================
- Coverage   96.40%   96.25%   -0.15%     
==========================================
  Files         264      264              
  Lines       15509    15606      +97     
==========================================
+ Hits        14951    15022      +71     
- Misses        468      484      +16     
- Partials       90      100      +10     
Impacted Files Coverage Δ
cmd/collector/app/handler/grpc_handler.go 70.96% <40.00%> (-14.75%) ⬇️
plugin/storage/memory/memory.go 92.65% <74.64%> (-7.35%) ⬇️
cmd/collector/app/span_processor.go 98.41% <85.71%> (-1.59%) ⬇️
cmd/collector/app/options.go 98.91% <87.50%> (-1.09%) ⬇️
cmd/collector/app/model_consumer.go 100.00% <100.00%> (ø)
cmd/collector/app/root_span_handler.go 100.00% <100.00%> (ø)
cmd/query/app/http_handler.go 100.00% <100.00%> (ø)
plugin/storage/memory/options.go 100.00% <100.00%> (ø)
cmd/query/app/static_handler.go 94.01% <0.00%> (-3.00%) ⬇️
pkg/config/tlscfg/cert_watcher.go 94.73% <0.00%> (+2.10%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 784644c...e695f99. Read the comment docs.

@yurishkuro
Copy link
Member

The mechanism of passing tenant via Context makes me rather uncomfortable. Since there are strong privacy implications associated with tenancy, passing it via Context in a 'best effort" manner makes it pretty easy to violate those implications. On the other hand, there are steps we could take to protect against that, such as storage integration tests that write & query to ensure different tenants do not see each other data, and also by making the tenant key in the Context required, i.e. by failing fast if it's undefined. The latter can be implemented as tenant-enforcing decorators to the SpanProcessor and Storage interfaces.

I also think the entry points to the backend should always invoke some "tenant strategy" that should be configured centrally (only once), such that it could enforce that a tenant header is always populated or always ignored (for single-tenant sites), and in the future could be use to validate the tenant IDs.

@pavolloffay
Copy link
Member

I agree with the @yurishkuro comment above. If we want to add tenancy support in the storage layer I would prefer to pass the tenant explicitly, however, it will be a breaking change for the grpc storage plugin - actually, there will be always a breaking change because the dependency store write method does not pass context.

That said if we can make sure the tenant is always properly propagated I am fine with the context as well.

@esnible
Copy link
Contributor Author

esnible commented Apr 4, 2022

@yurishkuro @pavolloffay I wrote this PR as an exercise to make sure I understood the Jaeger internals well enough to write a proposal, and to size the effort.

Is the next step to write a proposal? I had hoped to get started on this before the April community call.

Here are my design thoughts:

  • Only one incoming queue (so no per-tenant metrics.)
  • The current design of one spanstore.Writer is kept (no map[tenant]Writer).
    • Corollary: every Writer needs tenancy tests
    • Each storage can choose the best tenancy implementation (map of tenants for memory, tenant-name-in-the-index for Elasticsearch, tenant-name-in-a-column etc.)

Flow of tenant information on post (new parameters in bold)

  • network -> PostSpans(context, ...) -> ProcessSpans(..., SpansOptions{SpanTenant}) -> preProcessSpan(..., context) ; enqueueSpan(..., string)
  • queue -> processSpan(..., context) -> saveSpan(..., context) -> Writer.WriteSpan(context, ...)

Currently only the V2 HTTP getTrace was implemented to pass tenant info from header

  • All of the other methods (getServices, getOperations) and the GRPC and the V3 interfaces need to do the same.
  • Corallary: Every reader needs tenancy tests
  • All four paths (v2/3 and HTTP/gRPC) need to be tested against at least the memory spanstore.Reader

Flow of tenant information on query (no new params; although context is used to pass information from heaters

network -> getTrace() etc -> queryService.GetTrace(context, ...) -> Reader.GetTrace(context, ...)

We probably can't implement tenancy in all back-ends. The first release should target memory and Elasticsearch.

This PR does not yet have a way to tell Jaeger what the allowed tenants are. Flag with comma-separated list? File?

Each implementation of spanstore.Reader and spanstore.Writer needs a way to indicate if it is tenant-aware. Probably by implementing an interface

type Tenanted interface {
	IsValid(tenant string) error
}

If the user specifies something like --tenants=dusenberg,packard,plymouth Jaeger could verify that the Reader and Writer can be cast into Tenanted and that the three strings were considered valid tenants for that storage implementation.

@yurishkuro
Copy link
Member

Only one incoming queue (so no per-tenant metrics.)

I think we can iterate on this later, as I see benefits to having per-tenant queues for better isolation. Why no per-tenant metrics? If the tenant data is available, # of queues doesn't matter, certain metrics can still be tagged with the tenant.

Corallary: Every reader needs tenancy tests

As I mentioned in earlier comments, I would like to have a decorator-type Reader/Writer that verifies that the tenant info is present in the Context for all methods. That doesn't guarantee that storage implementation would do the right thing, but it enforces the upstream pipeline to always pass the tenant (since we're not adding an explicit parameter to all of the pipeline APIs).

Each implementation of spanstore.Reader and spanstore.Writer needs a way to indicate if it is tenant-aware.

Why is that? I think every integration test instantiation needs that, but I don't see who would use this tenant-aware information from the storage implementations.

This PR does not yet have a way to tell Jaeger what the allowed tenants are. Flag with comma-separated list? File?

We can add that incrementally, but we should some central choke point (perhaps that decorator I mentioned above) that could validate tenants if necessary.

@yurishkuro
Copy link
Member

Is the next step to write a proposal? I had hoped to get started on this before the April community call.

I feel that this PR can work as a proposal, but at the same time I've always wanted to start including Architecture Decision Records (ADRs) in the repo, so not a bad idea to give it a spin.

@esnible
Copy link
Contributor Author

esnible commented Apr 28, 2022

@yurishkuro I am restarting work on this PR today, in response to your comments.

This PR used a single spanstore.Writer. The next version will use one Writer per tenant.

Would you suggest all tenants share the same SpanProcessors?
Would you suggest spans for all tenants live in different queues before processing?

@yurishkuro
Copy link
Member

Why one writer per tenant? Are you trying to allow different storage configs per tenant?

My suggestion would be to start with the simplest form: everything identical and shared, but tenancy exists at the data level. Then we can start adding complications for better isolation, if there's interest.

esnible added 3 commits May 2, 2022 13:14
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
…jection of invalid tenants

Signed-off-by: Ed Snible <snible@us.ibm.com>
@esnible esnible force-pushed the memory-storage-tenancy branch from b5bb192 to e695f99 Compare May 2, 2022 20:56
@yurishkuro
Copy link
Member

@esnible this PR is quickly getting out of hand in size & complexity, I strongly advise incremental changes. For example, you're doing a large scale refactoring to pipe through the Context into more places - this should be done separately and we can merge such change to unblock the actual tenant work.

@esnible
Copy link
Contributor Author

esnible commented May 3, 2022

@yurishkuro Thanks for your eyes on this.

I am happy to close this draft and create small PRs to do this work gradually.

The breaking change I am introducing is the addition of a Context parameter in ProcessSpan and ProcessSpans. That single change can be its own PR.

  • Before I work on that PR, does it make sense to make that change, or introduce a new ProcessSpanV2 and new methods on the interfaces that take V2 versions?
  • Is it reasonable to let the processing pipeline take a context, similar to an HTTP middleware, or is it better to do what @pavolloffay suggested and introduce a new parameter that is just for tenancy?

@yurishkuro
Copy link
Member

Before I work on that PR, does it make sense to make that change, or introduce a new ProcessSpanV2 and new methods on the interfaces that take V2 versions?

No, I don't think we should do V2 interfaces, I would rather change/upgrade things in place, to reduce the maintenance burden in the future (we're changing the internals, so no need to keep V1s)

Is it reasonable to let the processing pipeline take a context

It's a hard question to answer. My preference is to pass tenancy explicitly, if possible, but we have to weigh it against the amount of API churn it will introduce. From your PR it seems that a lot of churn is already needed anyway just to pass the context. As I mentioned, I don't mind the Context approach too much, but in this case we should be consistent end-to-end, which was not the case in your PR since some signatures were passing tenant explicitly.

So I think you, having looked at all the pipeline code, might be in the best position to argue which way is better.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants