Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add Alternator TTL metrics to Alternator dashboard #1783

Closed
nyh opened this issue Aug 24, 2022 · 9 comments
Closed

Add Alternator TTL metrics to Alternator dashboard #1783

nyh opened this issue Aug 24, 2022 · 9 comments
Labels
area/alternator Alternator related Issues enhancement New feature or request

Comments

@nyh
Copy link
Contributor

nyh commented Aug 24, 2022

In Scylla commit scylladb/scylladb@c262309 we added four metrics for the new Alternator TTL feature. The Alternator TTL feature runs background threads which look for expired items and delete them, and these metrics can be used to see that these threads have indeed been running, how often they scanned the table, how many items got deleted, etc. (the commit message linked above contains a longer description of each metric).

We should probably (?) add an Alternator TTL tab in the Alternator dashboard, with these metrics.

Please note that Alternator TTL is currently an "experimental" feature, so in the default case, all these metrics will be zero. I don't know how this experimental-ness should, or should not, affect the design of the monitoring dashboard.

@nyh nyh added enhancement New feature or request area/alternator Alternator related Issues labels Aug 24, 2022
@amnonh
Copy link
Collaborator

amnonh commented Aug 24, 2022

I've tried to see those metrics with 5.1 and 2022.2 and couldn't. Is there something specific I should do?
Also, is there something the user should look at in those metrics?

What the user should look for in a metric?

@nyh
Copy link
Contributor Author

nyh commented Aug 24, 2022

As you can see in the above-linked commit, it did reach 5.1.
However, all the metrics are zero unless you:

  1. Enable the experimental feature (--experimental-features=alternator-ttl) and if you want to see a lot of activity quickly, increase its frequency (--alternator-ttl-period-in-seconds=1).
  2. Create an Alternator table with TTL enabled. The easiest way to do this without learning to use the DynamoDB API is just to run an existing test. For example, cd test/alternator; pytest test_ttl.py will connect to Scylla running on this machine (add the --url option to pytest to tell it Scylla is running elsewhere). This test will, among other things, create Scylla tables with TTL enabled and cause a bunch of interesting TTL activity to take place.

@amnonh
Copy link
Collaborator

amnonh commented Aug 24, 2022

@nyh my point was the metrics were missing, not that they were zero.
I'll your instruction and see what I get.

Please see my other comment about the user perspective, the main question will it be helpful and how?
What the user should look for in this graphs? and what the user should do based on that?

@nyh
Copy link
Contributor Author

nyh commented Aug 24, 2022

@nyh my point was the metrics were missing, not that they were zero.

You're right. I checked, and today for the "expiration service" to start at all you need 1. Alternator to be enabled (alternator port configured) and 2. the TTL experimental feature to be turned on. If one of these aren't on, the "expiration service" is never started, and it never registers these Alternator TTL metrics.

Is this a problem? I was under the assumption that a missing metric is basically the same thing as a zero metric - especially after your recent patch which (if I remember correctly) drops zero metrics from the output.

Please see my other comment about the user perspective, the main question will it be helpful and how? What the user should look for in this graphs? and what the user should do based on that?

That's a good question. Here is what I think:

  • scylla_expiration_items_deleted is probably the most interesting - it counts the number of items actually deleted by TTL.
  • The counters scylla_expiration_scan_passes and scylla_expiration_scan_table can be used to see that the TTL feature is scanning (so it's enabled, even if so far it didn't delete anything), but aren't very exciting metrics to tell the truth.
  • scylla_expiration_secondary_ranges_scanned is probably the least interesting metric - it counts the number of times that one node was down so another node took over its scanning work. It's good for verification in tests, but probably not useful to end users.

Maybe the scylla_expiration_items_deleted can be added as a single metric (not an entire tab) similar to other operations like DeleteItem. However, it's different in that it's not an actual API request, it's an internal decision to delete the item.

@amnonh
Copy link
Collaborator

amnonh commented Aug 24, 2022

Most of the time it's fine not to report counters that are never used.
Especially around alternator, it would be best if we'll report only if alternator is enabled.

After enabling and running the test I got:
scylla_expiration_scan_passes{shard="0"} 2364
The rest of the metrics are zero.

My option to remove empty counters is done explicitely, but the idea is the same, don't report what is not needed

@nyh
Copy link
Contributor Author

nyh commented Aug 24, 2022

After enabling and running the test I got: scylla_expiration_scan_passes{shard="0"} 2364 The rest of the metrics are zero.

Two of the other metrics, scylla_expiration_items_deleted and scylla_expiration_scan_table should also be non-zero, at least one one shard... If these two metrics are still zero, it might mean you didn't run the test against the same Alternator you are asking for metrics (make sure you ran "pytest", NOT test/cql-pytest/run, because the latter starts a new Alternator!), or - I have a bug in the metrics! I'll test this myself too, maybe you found a bug.

@amnonh
Copy link
Collaborator

amnonh commented Aug 24, 2022

This is how I run it: pytest test_ttl.py --url http://172.17.0.2:8000

There's only one alternator

@nyh
Copy link
Contributor Author

nyh commented Aug 25, 2022

@amnonh I know what happened :-) The tests in test_ttl.py are all very slow so they are skipped by default, you need to add the "--runveryslow" option to pytest to actually run those tests :-)

I just wrote a test that verifies that these two metrics actually work when an item expires. The new test takes around one second, I think I'll put it in, and also consider reducing the TTL frequency even less than one second to make these tests even faster. I'll open an issue about these tests being skipped.

@amnonh amnonh added this to the monitoring 4.1 milestone Aug 25, 2022
@amnonh
Copy link
Collaborator

amnonh commented Aug 25, 2022

Fixed by #1782

@amnonh amnonh closed this as completed Aug 25, 2022
psarna added a commit to scylladb/scylladb that referenced this issue Sep 22, 2022
…metrics)' from Nadav Har'El

We had quite a few tests for Alternator TTL in test/alternator, but most
of them did not run as part of the usual Jenkins test suite, because
they were considered "very slow" (and require a special "--runveryslow"
flag to run).

In this series we enable six tests which run quickly enough to run by
default, without an additional flag. We also make them even quicker -
the six tests now take around 2.5 seconds.

I also noticed that we don't have a test for the Alternator TTL metrics
- and added one.

Fixes #11374.
Refs scylladb/scylla-monitoring#1783

Closes #11384

* github.com:scylladb/scylladb:
  test/alternator: insert test names into Scylla logs
  rest api: add a new /system/log operation
  alternator ttl: log warning if scan took too long.
  alternator,ttl: allow sub-second TTL scanning period, for tests
  test/alternator: skip fewer Alternator TTL tests
  test/alternator: test Alternator TTL metrics
@amnonh amnonh modified the milestones: monitoring 4.1, Monitoring 4.1 Nov 2, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/alternator Alternator related Issues enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants