Recording rule and adhoc query produce different (floating point) result #2951

StephanErb · 2017-07-14T20:47:21Z

What did you do?
Define an aggregation rule for an expensive query.

job:azure_costs_eur:sum =
    sum(azure_costs_eur)

The underlying data looks something like this (simplified label set). Please note the large number of digits.

azure_costs_eur{resourcegroup="first-resource-group", category="vms"} 100.000049858328046624874
azure_costs_eur{resourcegroup="other-resource-group", category="network"} 0.00002994732437320617

The metric has lots of dimensions:

count(azure_costs_eur)
694

The metric has not changed within the last hour:

sum(increase(azure_costs_eur[1h]))
0

What did you expect to see?
The underlying metric consists of several very slow moving counters. I therefore expect that the aggregation rule and the adhoc query produce the same results.

What did you see instead? Under which circumstances?

Plotting the adhoc query produces a flat line where as plotting the aggregated timeseries produces a non-linear one.

The actual difference is small but still noticeable.

Looking at the data this probably boils down to rounding errors in floating point math. But why does it differ for the recording rule and the adhoc query?

Environment

System information:

Linux 3.16.0-4-amd64 x86_64
Prometheus version:

prometheus, version 1.7.1 (branch: master, revision: 3afb3ff)
build user: root@0aa1b7fc430d
build date: 20170612-11:44:05
go version: go1.8.3
Prometheus configuration file:

global:
  evaluation_interval: 15s

  - job_name: azure-cost
    scrape_interval: 1m
    scrape_timeout: 1m
    serverset_sd_configs:
    - servers:
      - 10.x.x.5:2181
      - 10.x.x.6:2181
      - 10.x.x.7:2181
      paths:
      - /path/to/exporter/
      timeout: 30s
    relabel_configs:
     ....

Used exporter https://github.com/blue-yonder/azure-cost-mon

The text was updated successfully, but these errors were encountered:

brian-brazil · 2017-07-14T21:01:37Z

Queries run on the second, recording rules can happen at any millisecond within the second.

StephanErb · 2017-07-14T21:13:33Z

The data has been scraped regularly but all values and dimensions have been constant over the last hour. All data comes from a single target.

Can it still cause trouble if the rule evaluation happens concurrently to the scraping/ingesting? Or what are you implying with the second/milisecond precision?

brian-brazil · 2017-07-14T21:41:05Z

Hmm, you haven't demonstrated that the value hasn't changed. Try "changes" rather than "increase".

StephanErb · 2017-07-15T07:27:44Z

Same result for changes. There are no changes in the date source

sum(changes(azure_costs_eur[1h]))
0

but several changes in the aggregate

changes(job:azure_costs_eur:sum[1h])
211

brian-brazil · 2017-07-15T07:31:28Z

Is the count() consistent?

StephanErb · 2017-07-16T09:11:34Z

Yes, the count is consistent.

I managed to track this down further. In the following examples I am in the query explorer on the UI and hit enter a couple of times in very short succession.

As expected, the aggregated result does not change as it all happens within one rule evaluation interval:

job:azure_costs_eur:sum
x.650125930708
x.650125930708
x.650125930708

After the time for a rule evaluation has passed, I get a slightly different result even though the underlying data has no changes:

sum(changes(azure_costs_eur[10m]))
0

job:azure_costs_eur:sum
x.65012593069
x.65012593069
x.65012593069

If I do this for the non-aggregated ones I get different results on each query evaluation:

sum(azure_costs_eur)
x.650125930697
x.650125930704
x.6501259307
x.650125930693

Sorting the resultset before hand gives a consistent result though, as we will make the same floating point error every time:

sum(sort(azure_costs_eur))
x.6501259307
x.6501259307
x.6501259307

In total, this would explain why we get a different rule evaluation result every time, and thus lots of changes in the resulting time series.

I believe it is fine that Prometheus makes these slight mistakes. I will have to correct my incoming data instead.

brian-brazil · 2017-07-16T09:21:08Z

This looks like normal floating point inaccuracy.

StephanErb · 2017-07-16T09:31:58Z

Thanks for your help!

(I will opt for the mailing list next time. It looked like a bug to me at first which is why I jumped to the tracker)

juliusv · 2017-07-16T12:36:26Z

@StephanErb In case you were still wondering why the result was stable for ad-hoc queries, but not for recording rules: within a single range query, all the individual time resolution steps share the same ordering for the underlying time series because they get attached to the AST of the expression in a particular order in the query preparation phase and then just used at every time step in that order. Rules are individual instant queries that get executed at every rule evaluation cycle, so multiple rule evaluations don't share the same underlying series order.

lock · 2019-03-23T12:45:04Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

brian-brazil added the kind/question label Jul 14, 2017

StephanErb mentioned this issue Jul 16, 2017

Constant costs might decrease due to numerical instabilities blue-yonder/azure-cost-mon#12

Open

StephanErb closed this as completed Jul 16, 2017

lock bot locked and limited conversation to collaborators Mar 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recording rule and adhoc query produce different (floating point) result #2951

Recording rule and adhoc query produce different (floating point) result #2951

StephanErb commented Jul 14, 2017

brian-brazil commented Jul 14, 2017

StephanErb commented Jul 14, 2017

brian-brazil commented Jul 14, 2017

StephanErb commented Jul 15, 2017

brian-brazil commented Jul 15, 2017

StephanErb commented Jul 16, 2017

brian-brazil commented Jul 16, 2017

StephanErb commented Jul 16, 2017

juliusv commented Jul 16, 2017 •

edited

Loading

lock bot commented Mar 23, 2019

Recording rule and adhoc query produce different (floating point) result #2951

Recording rule and adhoc query produce different (floating point) result #2951

Comments

StephanErb commented Jul 14, 2017

brian-brazil commented Jul 14, 2017

StephanErb commented Jul 14, 2017

brian-brazil commented Jul 14, 2017

StephanErb commented Jul 15, 2017

brian-brazil commented Jul 15, 2017

StephanErb commented Jul 16, 2017

brian-brazil commented Jul 16, 2017

StephanErb commented Jul 16, 2017

juliusv commented Jul 16, 2017 • edited Loading

lock bot commented Mar 23, 2019

juliusv commented Jul 16, 2017 •

edited

Loading