OpenCost metrics interfere with OpenShift's "degraded control plane" detection? #249

kastl-ars · 2025-01-23T13:40:53Z

Dear OpenCost maintainers,

since last week we noticed that our OpenShift cluster show a degradation warning, as only 50% of the apiservers are responding.

Turns out this seems to be related to metrics exposed by OpenCost, scraped by Prometheus and then returned by the query used for this degradation detection.

We have explictly disabled the emission of pod annotations, namespace annotations and ksm V1 metrics and the error vanished.

  opencost:
    metrics:
      serviceMonitor:
        enabled: true
      kubeStateMetrics:
        emitPodAnnotations: false
        emitNamespaceAnnotations: false
        emitKsmV1Metrics: false

The following lines appeared in the deployment:

        - name: EMIT_POD_ANNOTATIONS_METRIC
          value: 'false'
        - name: EMIT_NAMESPACE_ANNOTATIONS_METRIC
          value: 'false'
        - name: EMIT_KSM_V1_METRICS
          value: 'false'

I would like to see this added to the documentation that @mittal-ishaan was working on IIRC.

The query that went wrong was this:

count(kube_pod_labels{label_app="openshift-kube-apiserver", label_apiserver="true", namespace="openshift-kube-apiserver" })

Before we introduced the workaround described above, this returned 6 pods, while only three were really running. Hence the degradation warning as only 50% were working...

Kind Regards,
Johannes

The text was updated successfully, but these errors were encountered:

kastl-ars · 2025-01-24T08:11:48Z

Hmmm, this seems to completely break any cost calculation in OpenCost. After setting this, there are no more metrics visible. I enabled the emitNamespaceAnnotations again, let's see if this changes something...

kastl-ars · 2025-01-24T12:06:14Z

Hmmm, this seems to completely break any cost calculation in OpenCost. After setting this, there are no more metrics visible. I enabled the emitNamespaceAnnotations again, let's see if this changes something...

Even after re-enabling emitNamespaceAnnotations (by removing setting the attribute to false in the values.yaml) and the emitPodAnnotations a little later I can no longer see any costs in OpenCost for the last couple of hours. Removing the disabling of the emitKsmV1Metrics makes OpenCost show values again almost instantaneously, but also the OpenShift degradation warning is back...

kastl-ars · 2025-01-29T11:44:06Z

As just stated in #252 I am not sure if this issue should rather go to the opencost repository, as it seems (to me, with the knowledge I have today...) like not just a problem of disabling some things on OpenShift, but a general problem of OpenCost not working on OpenShift without interfering with OpenShift itself?

mittal-ishaan · 2025-01-29T12:22:10Z

Hi @kastl-ars
I will look into this issue. Can you tell me the opencost version that you are using right now?

kastl-ars · 2025-01-29T12:40:00Z

Thank you! We are using the latest chart version 1.43.1.

kastl-ars · 2025-01-29T12:40:44Z

The more pressing issue would be #252 as a wrong CPU count sounds more problematic. But my guess is they are related...

mittal-ishaan · 2025-01-29T12:43:10Z

Thank you,
Sure. Let me check that too

kastl-ars · 2025-02-10T09:53:55Z

Any news on this?

mittal-ishaan · 2025-02-10T10:32:13Z

So sorry, did not get time for it this week. Checking right now

mittal-ishaan · 2025-02-10T12:05:43Z

I am able to reproduce this at my end.

count(kube_pod_labels{label_app="openshift-kube-apiserver", label_apiserver="true", namespace="openshift-kube-apiserver" })

is giving me a value 6 while there are only 3 with API Servers 50% degradation warning.
The workaround you suggested should be ideal in this case as then opencost job will not be pushing this metrics and we would be getting the correct count 3 in this case. I will look into why the cost is not visible when disabling emitKsmV1Metrics

mittal-ishaan · 2025-02-10T12:07:52Z

This also leads to number of CPUs being doubled in the overview page as stated in #252

kastl-ars · 2025-02-10T12:26:47Z

Thanks for looking into this, glad you could reproduce this!

mittal-ishaan · 2025-02-10T19:07:50Z

Hi @kastl-ars

One thing to point out here as I was looking into this further is that setting

opencost:
  metrics:
    kubecostMetrics:
      emitKsmV1Metrics: false

does not break anything and I am still able to view the cost in the UI. Can you confirm this if possible? I am not able to reproduce what you were experiencing here

kastl-ars · 2025-02-11T06:03:41Z

Hi @mittal-ishaan

not sure if it is intentional, but you have a wrong key in your YAML snippet. According to the values.yaml, it should be kubeStateMetrics instead of kubecostMetrics.

opencost:
  metrics:
    kubeStateMetrics: # <-- here
      emitKsmV1Metrics: false

I would be surprised if with your snippet above the errors (apiservers degraded, number of CPUs wrong) were gone.

I have been trying this snippet in our cluster for a week, I get metrics (i.e. there are bars shown in OpenCost), but all of the values are just zero. There is a total of 700$, but all of it is from __idle__. There are no values for any of the workloads running, all of them just show 0$.

mittal-ishaan · 2025-02-11T06:13:46Z

Ahh my bad here. Got mixed up in two different things.
Thank you. will look into this

kastl-ars · 2025-02-18T10:07:57Z

Any news? Sorry for the hustle, but this is rendering OpenCost unusable on our OpenShift clusters currently...

mittal-ishaan · 2025-02-24T23:39:56Z

Hi @kastl-ars
so sorry for missing this.
I am trying to find a workaround here. Could you try the following helm values along with your custom values:

opencost:
  metrics:
    serviceMonitor:
      enabled: true
    kubeStateMetrics:
      emitKsmV1Metrics: true
    config:
      enabled: true
      disabledMetrics:
        - cluster:hyperthread_enabled_nodes
        - deployment_match_labels
        - kube_job_status_failed
        - kube_namespace_labels
        - kube_node_labels
        - kube_node_status_allocatable
        - kube_node_status_capacity
        - kube_persistentvolume_capacity_bytes
        - kube_persistentvolumeclaim_info
        - kube_persistentvolumeclaim_resource_requests_storage_bytes
        - kube_pod_container_resource_requests
        - kube_pod_container_status_terminated_reason
        - kube_pod_labels
        - kube_pod_owner
  prometheus:
    kubeRBACProxy: true
    createMonitoringClusterRoleBinding: true
    createMonitoringResourceReaderRoleBinding: true
    monitoringServiceAccountName: prometheus-k8s
    monitoringServiceAccountNamespace: openshift-monitoring
    external:
      enabled: true
      url: https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091
    internal:
      enabled: false
  exporter:
    extraVolumeMounts:
      - name: configs
        mountPath: /var/configs/metrics.json
        subPath: metrics.json
  ui:
    extraVolumeMounts:
      - name: empty-var-www
        mountPath: /var/www
      # - name: opencost-ui-nginx-config-volume
      #   mountPath: /etc/nginx/conf.d/default.nginx.conf
      #   subPath: default.nginx.conf

extraVolumes:
  - name: empty-var-www
    emptyDir: {}
  - name: configs
    configMap:
      name: custom-metrics
  # - name: opencost-ui-nginx-config-volume
  #   configMap:
  #     name: opencost-ui-nginx-config

and see if you are still not able to get the cost as this seems to fix the overview page CPU count and degraded warnings. I am testing on my end too. Will update in sometime.

kastl-ars · 2025-02-25T06:05:53Z

Thanks for digging into this, @mittal-ishaan. I will test this and report back.

kastl-ars · 2025-02-25T06:07:57Z

OK, I immediately see OpenCost displaying actual data for today after using the workaround. I'll check if the warnings and the wrong CPU count reappear over the course of today.

kastl-ars · 2025-02-26T06:23:55Z

Seems like the workaround works. I have not seen any errors or wrong CPU counts on 4 clusters. And this far all 4 clusters are reporting values in the OpenCost UI.

So, the question is, how to get this into the chart properly? And how to document this properly? I think the docs need an OpenShift section... :-(

mittal-ishaan · 2025-02-27T16:51:59Z

That sounds great, thank you @kastl-ars
100% agree on getting an openshift section in the opencost docs. I will try to get that in If I get some time.
To get this in the chart, I was thinking of having a values-openshift.yaml file as an example values file to tell the users that you will need at least these configurations to make it work in openshift env.. There is already a lot that the user has to configure to even have the simplest openshift install.

kastl-ars · 2025-02-28T05:50:18Z

Having a values-openshift.yaml is a nice start, yes. Setting all of those things just by switching some openshift.enabled to true seems error prone and tedious. Hundreds of if conditions in dozens of templates...

kastl-ars · 2025-03-10T07:25:53Z

Thanks for digging into this, @mittal-ishaan. I will test this and report back.

We have implemented this workaround in all of our OpenShift clusters and this far the "degraded apiserver" error as well as the bogus CPU numbers are gone. So I daresay it works.

kastl-ars mentioned this issue Jan 29, 2025

OpenCost metrics interfere with OpenShift's CPU count for the cluster #252

Closed

mittal-ishaan marked this as a duplicate of #252 Feb 10, 2025

mittal-ishaan mentioned this issue Feb 10, 2025

fix: ksm metrics were getting duplicated in openshift env using in-cluster prometheus kubecost/cost-analyzer-helm-chart#3864

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCost metrics interfere with OpenShift's "degraded control plane" detection? #249

OpenCost metrics interfere with OpenShift's "degraded control plane" detection? #249

kastl-ars commented Jan 23, 2025

kastl-ars commented Jan 24, 2025

kastl-ars commented Jan 24, 2025 •

edited

Loading

kastl-ars commented Jan 29, 2025

mittal-ishaan commented Jan 29, 2025

kastl-ars commented Jan 29, 2025

kastl-ars commented Jan 29, 2025

mittal-ishaan commented Jan 29, 2025

kastl-ars commented Feb 10, 2025

mittal-ishaan commented Feb 10, 2025

mittal-ishaan commented Feb 10, 2025

mittal-ishaan commented Feb 10, 2025

kastl-ars commented Feb 10, 2025

mittal-ishaan commented Feb 10, 2025

kastl-ars commented Feb 11, 2025

mittal-ishaan commented Feb 11, 2025

kastl-ars commented Feb 18, 2025

mittal-ishaan commented Feb 24, 2025

kastl-ars commented Feb 25, 2025

kastl-ars commented Feb 25, 2025

kastl-ars commented Feb 26, 2025

mittal-ishaan commented Feb 27, 2025

kastl-ars commented Feb 28, 2025

kastl-ars commented Mar 10, 2025

OpenCost metrics interfere with OpenShift's "degraded control plane" detection? #249

OpenCost metrics interfere with OpenShift's "degraded control plane" detection? #249

Comments

kastl-ars commented Jan 23, 2025

kastl-ars commented Jan 24, 2025

kastl-ars commented Jan 24, 2025 • edited Loading

kastl-ars commented Jan 29, 2025

mittal-ishaan commented Jan 29, 2025

kastl-ars commented Jan 29, 2025

kastl-ars commented Jan 29, 2025

mittal-ishaan commented Jan 29, 2025

kastl-ars commented Feb 10, 2025

mittal-ishaan commented Feb 10, 2025

mittal-ishaan commented Feb 10, 2025

mittal-ishaan commented Feb 10, 2025

kastl-ars commented Feb 10, 2025

mittal-ishaan commented Feb 10, 2025

kastl-ars commented Feb 11, 2025

mittal-ishaan commented Feb 11, 2025

kastl-ars commented Feb 18, 2025

mittal-ishaan commented Feb 24, 2025

kastl-ars commented Feb 25, 2025

kastl-ars commented Feb 25, 2025

kastl-ars commented Feb 26, 2025

mittal-ishaan commented Feb 27, 2025

kastl-ars commented Feb 28, 2025

kastl-ars commented Mar 10, 2025

kastl-ars commented Jan 24, 2025 •

edited

Loading