Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Remove the need for external tools for managing Elasticsearch #6283

Closed
5 tasks
yurishkuro opened this issue Dec 1, 2024 · 37 comments
Closed
5 tasks

Remove the need for external tools for managing Elasticsearch #6283

yurishkuro opened this issue Dec 1, 2024 · 37 comments
Labels
area/storage good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement storage/elasticsearch v2

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Dec 1, 2024

PR #5797 is a good example of eliminating the need for schema initialization script for Cassandra by moving the logic directly into Jaeger binary.

We have three standalone tools for managing ES that we can also merge into Jaeger binary directly. All of them are already implemented in Go, so are fairly straightforward to embed in the Jaeger binary (fewer binary artifacts to release).

  • cmd/es-index-cleaner/ is used as a cron job to delete old ES indices. The logic can instead be executed from the Jaeger binary on a timer.
  • cmd/es-rollover/ - similar to cleaner but for rolling over indices.
  • esmapping-generator/ - just prints the index mappings to stdout (probably for manual installation)

Note: this issue is not about removing the existing binaries, which we still want to maintain until Jaeger v1 EOL in 2026. We just want to make them not required for operating Jaeger-v2.

Tasks

  • extends configuration for ES in Jaeger v2 to support additional parameters for the cleaner & rollover scripts, such as frequency of running, including enabled properties for each.
  • enable index cleaner capability in v2, update integration tests to use that
  • enable index rollover capability in v2, update integration tests to use that
  • add a sub-command es-mappings to jaeger v2 binary to print es mappings
  • update documentation accordingly to no longer reference the separate scripts
@yurishkuro yurishkuro added good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement labels Dec 1, 2024
@Rohanraj123
Copy link
Contributor

I am working on it

@akstron
Copy link
Contributor

akstron commented Dec 1, 2024

Hi @yurishkuro , for cmd/es-index-cleaner/, we would only require the filters to be added in the config, rest other fields used for client creation would already be available to us.
Below config should suffice it I believe:

type Cleaner struct {
	// Separator between date fragments.
	IndexDateSeparator string
	// Whether to filter archive indices.
	Archive bool
	// Whether to filter rollover indices.
	Rollover bool
	// Indices created before this date will be deleted.
	DeleteBeforeThisDate time.Time
}

Does it sound good?

@yurishkuro
Copy link
Member Author

IndexDateSeparator - if the main storage doesn't need this argument, why does cleaner need it?

Archive bool - archive storage is configured as separate storage in v2, so it can have its own cleaner if desired.

Rollover bool - not clear. What's the alternative?

DeleteBeforeThisDate time.Time - doesn't make sense, cleaner is a permanent periodic job, not something that takes an absolute time. We already have max span age parameter iirc, is it not sufficient to drive the behavior of the cleaner?

@akstron
Copy link
Contributor

akstron commented Dec 2, 2024

I mostly took out the filters used in the current implementation.

IndexDateSeparator - if the main storage doesn't need this argument, why does cleaner need it?

Yes, I don't think so we need it. There is DateLayout in index config, which we can probably use to find indices here.

Archive bool - archive storage is configured as separate storage in v2, so it can have its own cleaner if desired.

Makes sense. Thanks

Rollover bool - not clear. What's the alternative?

I was thinking maybe I can use the exact filter used for the rollover case, in the current implementation. But I probably have to read more about how this is being used here before commenting further. If you can provide me with some pointers, that would be really helpful? Thanks

DeleteBeforeThisDate time.Time - doesn't make sense, cleaner is a permanent periodic job, not something that takes an absolute time. We already have max span age parameter iirc, is it not sufficient to drive the behavior of the cleaner?

Was mostly referring to the time calculated before which we should delete indices. This calculation should actually be done based on a time.Duration i.e., spanMaxAge you are referring to.

Also, looks like rohan has already started working on it.

@akstron
Copy link
Contributor

akstron commented Dec 2, 2024

Also, shouldn't this be added as an extension?

@yurishkuro
Copy link
Member Author

it could be an extension but I think it's very closely coupled with the storage implementation that it makes more sense for it to be just an option in the config

@akstron
Copy link
Contributor

akstron commented Dec 3, 2024

Rollover bool - not clear. What's the alternative?

Just wanted to add this info. We won't need this config as well. Rollover indices can be treated as index-per-day pattern, where we just need to delete those based on the creationTime.
In conclusion, I believe maxAge should be more than enough for cleaning indices.

@akstron
Copy link
Contributor

akstron commented Dec 18, 2024

Hi @Rohanraj123 , let me know if you are working on it? I would like to take up merging cmd/es-index-cleaner/. Thanks

@Rohanraj123
Copy link
Contributor

Hi @Rohanraj123 , let me know if you are working on it? I would like to take up merging cmd/es-index-cleaner/. Thanks

yeah go ahead please!

@akstron
Copy link
Contributor

akstron commented Dec 23, 2024

Thanks, I will start working on it.

@akstron
Copy link
Contributor

akstron commented Dec 23, 2024

Hi @yurishkuro , I see that for cmd/es-index-cleaner, the IndicesClient wraps a simple http client underneath instead of wrapping the es go client library. Do you know if there is any particular reason for it? If not, should we refactor it first to use the es go client library?

@yurishkuro
Copy link
Member Author

ideally all logic should go through regular clients

@Manik2708
Copy link
Contributor

Manik2708 commented Dec 30, 2024

@yurishkuro I am little confused on the question, whether after this change, will we still be giving the cmds (For those who still want to keep these services in their control) or should we have to discard the cmd and move the code to pkg or a mixture of these two, keeping the cmd as it is but moving the code to a sharerd package for es and cmd?

@yurishkuro
Copy link
Member Author

@Manik2708 they should become sub-commands of jaeger binary, i.e. added to this list:

$ go run ./cmd/jaeger help
2024/12/30 11:06:11 application version: git-commit=, git-version=, build-date=
Jaeger backend v2

Usage:
  jaeger [flags]
  jaeger [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  components  Outputs available components in this collector distribution
  docs        Generates documentation
  help        Help about any command
  validate    Validates the config without running the collector
  version     Print the version.

@Manik2708
Copy link
Contributor

@yurishkuro Also is there any connection between RolloverFrequency and es-rollover?

@yurishkuro
Copy link
Member Author

@Manik2708 you have to check the code where/how it's used

@Manik2708
Copy link
Contributor

@yurishkuro Es Index Rollover runs for a single server, but while integrating it with Jaeger Binary we can have multiple servers. So should we always rollover run for all servers or we have to ask from user for through config file, like in this way

type Rollover struct {
...
// RunForAllServers enables rollover for all servers
RunForAllServers bool
// RunForServers enables rollover for the provided servers if RunForAllServers is false
RunForServers []string
}

@yurishkuro
Copy link
Member Author

@Manik2708 there are two ways

  1. Can we make the operation idempotent or no-op if already done? If we can then it's safe to run from all servers. This would be the preferred way. For example, Cassandra schema initialization works this way.
  2. If we must have a single server doing this, then we will have to use leader/follower election, for which we already have the code (it's only used with adaptive sampling, look for DistributedLock).

@Manik2708
Copy link
Contributor

@Manik2708 there are two ways

  1. Can we make the operation idempotent or no-op if already done? If we can then it's safe to run from all servers. This would be the preferred way. For example, Cassandra schema initialization works this way.
  2. If we must have a single server doing this, then we will have to use leader/follower election, for which we already have the code (it's only used with adaptive sampling, look for DistributedLock).

@yurishkuro As far I have understood from the code and by using it, all the operations of es-rollover cmd seems to be idempotent. Still I want to clear some doubts:

  1. Are all operations of es-rollover (cmd) idempotent?
  2. If yes then can we run it for all servers asynchronously? Like in this way:
for _, server := range configuration.Servers {
       go startEsRollover(configuration, server, logger)
}
  1. The init operation should be run only once (after setting jaeger) or everytime before running rollover?

@yurishkuro
Copy link
Member Author

  1. Are all operations of es-rollover (cmd) idempotent?

this is my question to you. Also the timing - if one instance initiates a rollover at :00min and then another at :05min, what's going to happen?

Why do you need to run goroutine per-server? There's only one cluster.

  1. init

As I understand init needs to execute only once, so you may need to add a check for it to verify if it's already been executed on a cluster. I would approach all of steps like that - check first, then run.

@Manik2708
Copy link
Contributor

@yurishkuro So as per my search:

  1. Init has already become idempotent by Make rollover idempotent #1407. It can also be seen in go:
    // check for reason, ignore already exist error
    if strings.Contains(errorMap["type"].(string), "resource_already_exists_exception") {
    return nil
    }
    }
    So we can call init everytime on calling rollover, init will always check and run (as said by your comment)
  2. Rollover: The operation in elastic search itself is idempotent that means Rollover doesn't happen on already rolled over data. So we can call rollover also everytime without taking care of idempotency. (This is also the reason no user using es-rollover faced the problem of idempotency as rollover has no affect on already rolled over data.
  3. Lookback: This only cleanes the data, so if it is already cleaned it will return without any error (logging that no index to that alias was present).
    The doubt which I am having is that: The official es client takes an array of servers, so this official client takes care of multiple nodes of the same cluster but the client used in es-rollover is a http client not an es-client. And as per my knowledge (please correct if I am wrong) in a cluster of es, communicating with one node is sufficient to make the change in cluster. The official client is for load balancing or following the leader/follower or any other similar algorithm. So I think using a single server with leader/follower election would be a good choice here rather than using all servers or running for all servers synchronously can also be an option with an e2e test with a cluster of 3 nodes.

@yurishkuro
Copy link
Member Author

yurishkuro commented Jan 6, 2025

correct, a command can be sent to any node.

@zzzk1
Copy link
Contributor

zzzk1 commented Jan 10, 2025

@yurishkuro For the esmapping-generator/, we should create a new command in cmd/internal/esmapping and reuse the logic from esmapping-generator/app in this command. After that, add internal/esmapping to all-in-one/main.go and jaeger/main.go using the following syntax: command.AddCommand(esmapping.Command()). Is that right?🤔

@yurishkuro
Copy link
Member Author

@zzzk1 yes. There was a PR #6327 for that, I just realized I never submitted my review for it.

@zzzk1
Copy link
Contributor

zzzk1 commented Jan 11, 2025

@Rohanraj123 Are you still working on esmapping-generator?

@Rohanraj123
Copy link
Contributor

Rohanraj123 commented Jan 11, 2025

@Rohanraj123 Are you still working on esmapping-generator?

Yes I was waiting for the review. I will (re)start working from today only.

@Manik2708
Copy link
Contributor

@yurishkuro This is what I am thinking of options of AutomatedRollover (requires UseILM and UseReadWriteAliases to be true).

type struct AutomatedRollover {
	Enabled bool
	PolicyName string // If user has created a policy and want to do rollover with that policy
	Conditons []string // Conditions like max_age etc.
	SkipDependencies bool // Whether user want to apply rollover on dependencies index or not
        DisableDeleteOperation bool // Whether user want to disable the delete oepration of automated rollover (by default it will be on)
}

And this as the flow:

Image

Please review this, if it looks good, I would start working on it!

@yurishkuro
Copy link
Member Author

@Manik2708

  • What is the official recommendation from ES when using ILM? Is the user expected to create read/write aliases, or is the policy doing that automatically?
  • I would treat policy name not as a signal for create/not-create, but just a default that the user can override, and Jaeger always checks if the policy exists and creates if it does not. I am worried about auto-creating the policy overall in terms of how much customization is required for that process - is the list of conditions completely sufficient to represent the ILM policy?

For your struct

  • I think enabled is unnecessary, we can treat the presence/absence of the config entry as the same bool.
  • I don't like fields SkipDependencies and DisableDeleteOperation, are they necessary?

@Manik2708
Copy link
Contributor

Manik2708 commented Jan 14, 2025

What is the official recommendation from ES when using ILM? Is the user expected to create read/write aliases, or is the policy doing that automatically?

As per my understanding it requires manually, please see this: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html#ilm-gs-alias-apply-policy It requires index.lifecycle.rollover_alias in the template.

I would treat policy name not as a signal for create/not-create, but just a default that the user can override, and Jaeger always checks if the policy exists and creates if it does not. I am worried about auto-creating the policy overall in terms of how much customization is required for that process - is the list of conditions completely sufficient to represent the ILM policy?

This idea seems great! We are concerned for only rollover action of policy which requires only condition (even if you will see the rollover part of e-rollover it is only asking for conditions) so I think conditions would be enough for automated rollover. Please see this: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-rollover.html#ilm-rollover-options

I think enabled is unnecessary, we can treat the presence/absence of the config entry as the same bool.

Could agree but have a question which is linked with the next point, if let's say SkipDependencies and DisableDeleteOperation are eliminated, what would then empty string array of Conditions would imply? I think in that case we have to keep this Enabled option.

I don't like fields SkipDependencies and DisableDeleteOperation, are they necessary?

For SkipDependencies, does that imply we should always rollover on dependency index?

For DisableDeleteOperation, delete is an action of ILM policy, which is similar to cleaning, should we always enable/disable it? Please see this for context of delete operation: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-delete.html

@yurishkuro
Copy link
Member Author

yurishkuro commented Jan 14, 2025

I think in that case we have to keep this Enabled option.

Your config could look like:

  backends:
    my_es:
      elasticsearch:
        ...
        index_maintenance:

I.e. even if this index_maintenance section is completely empty we can still tell that it exists or not. This is a common pattern in OTEL config, e.g. for enabling or not an HTTP endpoint.

Another high level question I have - do we even need to invest in custom ILM? Or should we be looking into data streams #4708, which may have different management process.

@Manik2708
Copy link
Contributor

I studied about data streams also, official docs suggest using ILM over data streams for rollover. https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html#data-streams-rollover

@Manik2708
Copy link
Contributor

@yurishkuro Then where should we head? ILM or data streams!

@yurishkuro
Copy link
Member Author

Then where should we head? ILM or data streams!

@Manik2708 my preference is data streams, but like you said they are not mutually exclusive.

@Manik2708
Copy link
Contributor

Ok, then let's try with data streams?

@Manik2708
Copy link
Contributor

@yurishkuro I feel a need of a helping PR for this issue (can be a prerequisite) which will be refactoring of e2e tests of Index Rollover and Cleaner to make them reusable for v2 e2e tests just like storage integration tests. This will help the contributors to test the functionality e2e and for reviewers it will make the job little easy as it will reduce the PR size (You can see the PR #6500 becoming very big due to this refactoring). Should I raise a PR or should it be included with the functionality itself?

@yurishkuro
Copy link
Member Author

smaller PRs are always better.

yurishkuro pushed a commit that referenced this issue Jan 23, 2025
<!--
!! Please DELETE this comment before posting.
We appreciate your contribution to the Jaeger project! 👋🎉
-->

## Which problem is this PR solving?
- part of #6283  

## Description of the changes
- Decoupled the Esmapping generation logic from
cmd/esmappnig-generator/main.go and kept it into reusable renderer
package and then reimplemented the subcommand in the
plugin/storage/es/esmapping-generator.go file and then registered in
jaeger/main.go.

## How was this change tested?
- go test ./... -count=1

## Checklist
- [X] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [X] I have signed all commits
- [ ] I have added unit tests for the new functionality
- [X] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: Rohanraj123 <rajrohan88293@gmail.com>
@yurishkuro
Copy link
Member Author

Per discussions elsewhere, we decided to limit this to esmappings command only, and ignore rollover so that it can be deprecated altogether in the future, as we switch towards support for data streams and ILM.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/storage good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement storage/elasticsearch v2
Projects
None yet
Development

No branches or pull requests

5 participants