Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Recording rule and adhoc query produce different (floating point) result #2951

Closed
StephanErb opened this issue Jul 14, 2017 · 10 comments
Closed

Comments

@StephanErb
Copy link
Contributor

What did you do?
Define an aggregation rule for an expensive query.

job:azure_costs_eur:sum =
    sum(azure_costs_eur)

The underlying data looks something like this (simplified label set). Please note the large number of digits.

azure_costs_eur{resourcegroup="first-resource-group", category="vms"} 100.000049858328046624874
azure_costs_eur{resourcegroup="other-resource-group", category="network"} 0.00002994732437320617

The metric has lots of dimensions:

count(azure_costs_eur)
694

The metric has not changed within the last hour:

sum(increase(azure_costs_eur[1h]))
0

What did you expect to see?
The underlying metric consists of several very slow moving counters. I therefore expect that the aggregation rule and the adhoc query produce the same results.

What did you see instead? Under which circumstances?

Plotting the adhoc query produces a flat line where as plotting the aggregated timeseries produces a non-linear one.

prom-adhoc

prom-rule

The actual difference is small but still noticeable.
prod-diff

Looking at the data this probably boils down to rounding errors in floating point math. But why does it differ for the recording rule and the adhoc query?

Environment

  • System information:

    Linux 3.16.0-4-amd64 x86_64

  • Prometheus version:

    prometheus, version 1.7.1 (branch: master, revision: 3afb3ff)
    build user: root@0aa1b7fc430d
    build date: 20170612-11:44:05
    go version: go1.8.3

  • Prometheus configuration file:

global:
  evaluation_interval: 15s

  - job_name: azure-cost
    scrape_interval: 1m
    scrape_timeout: 1m
    serverset_sd_configs:
    - servers:
      - 10.x.x.5:2181
      - 10.x.x.6:2181
      - 10.x.x.7:2181
      paths:
      - /path/to/exporter/
      timeout: 30s
    relabel_configs:
     ....
@brian-brazil
Copy link
Contributor

Queries run on the second, recording rules can happen at any millisecond within the second.

@StephanErb
Copy link
Contributor Author

The data has been scraped regularly but all values and dimensions have been constant over the last hour. All data comes from a single target.

Can it still cause trouble if the rule evaluation happens concurrently to the scraping/ingesting? Or what are you implying with the second/milisecond precision?

@brian-brazil
Copy link
Contributor

Hmm, you haven't demonstrated that the value hasn't changed. Try "changes" rather than "increase".

@StephanErb
Copy link
Contributor Author

Same result for changes. There are no changes in the date source

sum(changes(azure_costs_eur[1h]))
0

but several changes in the aggregate

changes(job:azure_costs_eur:sum[1h])
211

@brian-brazil
Copy link
Contributor

Is the count() consistent?

@StephanErb
Copy link
Contributor Author

Yes, the count is consistent.

I managed to track this down further. In the following examples I am in the query explorer on the UI and hit enter a couple of times in very short succession.

As expected, the aggregated result does not change as it all happens within one rule evaluation interval:

job:azure_costs_eur:sum
x.650125930708
x.650125930708
x.650125930708

After the time for a rule evaluation has passed, I get a slightly different result even though the underlying data has no changes:

sum(changes(azure_costs_eur[10m]))
0

job:azure_costs_eur:sum
x.65012593069
x.65012593069
x.65012593069

If I do this for the non-aggregated ones I get different results on each query evaluation:

sum(azure_costs_eur)
x.650125930697
x.650125930704
x.6501259307
x.650125930693

Sorting the resultset before hand gives a consistent result though, as we will make the same floating point error every time:

sum(sort(azure_costs_eur))
x.6501259307
x.6501259307
x.6501259307

In total, this would explain why we get a different rule evaluation result every time, and thus lots of changes in the resulting time series.

I believe it is fine that Prometheus makes these slight mistakes. I will have to correct my incoming data instead.

@brian-brazil
Copy link
Contributor

This looks like normal floating point inaccuracy.

@StephanErb
Copy link
Contributor Author

Thanks for your help!

(I will opt for the mailing list next time. It looked like a bug to me at first which is why I jumped to the tracker)

@juliusv
Copy link
Member

juliusv commented Jul 16, 2017

@StephanErb In case you were still wondering why the result was stable for ad-hoc queries, but not for recording rules: within a single range query, all the individual time resolution steps share the same ordering for the underlying time series because they get attached to the AST of the expression in a particular order in the query preparation phase and then just used at every time step in that order. Rules are individual instant queries that get executed at every rule evaluation cycle, so multiple rule evaluations don't share the same underlying series order.

@lock
Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

3 participants