Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

OpenCost metrics interfere with OpenShift's "degraded control plane" detection? #249

Open
kastl-ars opened this issue Jan 23, 2025 · 23 comments

Comments

@kastl-ars
Copy link
Contributor

Dear OpenCost maintainers,

since last week we noticed that our OpenShift cluster show a degradation warning, as only 50% of the apiservers are responding.

Turns out this seems to be related to metrics exposed by OpenCost, scraped by Prometheus and then returned by the query used for this degradation detection.

We have explictly disabled the emission of pod annotations, namespace annotations and ksm V1 metrics and the error vanished.

  opencost:
    metrics:
      serviceMonitor:
        enabled: true
      kubeStateMetrics:
        emitPodAnnotations: false
        emitNamespaceAnnotations: false
        emitKsmV1Metrics: false

The following lines appeared in the deployment:

        - name: EMIT_POD_ANNOTATIONS_METRIC
          value: 'false'
        - name: EMIT_NAMESPACE_ANNOTATIONS_METRIC
          value: 'false'
        - name: EMIT_KSM_V1_METRICS
          value: 'false'

I would like to see this added to the documentation that @mittal-ishaan was working on IIRC.

The query that went wrong was this:

count(kube_pod_labels{label_app="openshift-kube-apiserver", label_apiserver="true", namespace="openshift-kube-apiserver" })

Before we introduced the workaround described above, this returned 6 pods, while only three were really running. Hence the degradation warning as only 50% were working...

Kind Regards,
Johannes

@kastl-ars
Copy link
Contributor Author

Hmmm, this seems to completely break any cost calculation in OpenCost. After setting this, there are no more metrics visible. I enabled the emitNamespaceAnnotations again, let's see if this changes something...

@kastl-ars
Copy link
Contributor Author

kastl-ars commented Jan 24, 2025

Hmmm, this seems to completely break any cost calculation in OpenCost. After setting this, there are no more metrics visible. I enabled the emitNamespaceAnnotations again, let's see if this changes something...

Even after re-enabling emitNamespaceAnnotations (by removing setting the attribute to false in the values.yaml) and the emitPodAnnotations a little later I can no longer see any costs in OpenCost for the last couple of hours. Removing the disabling of the emitKsmV1Metrics makes OpenCost show values again almost instantaneously, but also the OpenShift degradation warning is back...

@kastl-ars
Copy link
Contributor Author

As just stated in #252 I am not sure if this issue should rather go to the opencost repository, as it seems (to me, with the knowledge I have today...) like not just a problem of disabling some things on OpenShift, but a general problem of OpenCost not working on OpenShift without interfering with OpenShift itself?

@mittal-ishaan
Copy link
Contributor

Hi @kastl-ars
I will look into this issue. Can you tell me the opencost version that you are using right now?

@kastl-ars
Copy link
Contributor Author

Thank you! We are using the latest chart version 1.43.1.

@kastl-ars
Copy link
Contributor Author

The more pressing issue would be #252 as a wrong CPU count sounds more problematic. But my guess is they are related...

@mittal-ishaan
Copy link
Contributor

Thank you,
Sure. Let me check that too

@kastl-ars
Copy link
Contributor Author

Any news on this?

@mittal-ishaan
Copy link
Contributor

So sorry, did not get time for it this week. Checking right now

@mittal-ishaan
Copy link
Contributor

I am able to reproduce this at my end.

count(kube_pod_labels{label_app="openshift-kube-apiserver", label_apiserver="true", namespace="openshift-kube-apiserver" })

is giving me a value 6 while there are only 3 with API Servers 50% degradation warning.
The workaround you suggested should be ideal in this case as then opencost job will not be pushing this metrics and we would be getting the correct count 3 in this case. I will look into why the cost is not visible when disabling emitKsmV1Metrics

@mittal-ishaan
Copy link
Contributor

This also leads to number of CPUs being doubled in the overview page as stated in #252

@mittal-ishaan mittal-ishaan marked this as a duplicate of #252 Feb 10, 2025
@kastl-ars
Copy link
Contributor Author

Thanks for looking into this, glad you could reproduce this!

@mittal-ishaan
Copy link
Contributor

Hi @kastl-ars

One thing to point out here as I was looking into this further is that setting

opencost:
  metrics:
    kubecostMetrics:
      emitKsmV1Metrics: false

does not break anything and I am still able to view the cost in the UI. Can you confirm this if possible? I am not able to reproduce what you were experiencing here

@kastl-ars
Copy link
Contributor Author

Hi @mittal-ishaan

not sure if it is intentional, but you have a wrong key in your YAML snippet. According to the values.yaml, it should be kubeStateMetrics instead of kubecostMetrics.

opencost:
  metrics:
    kubeStateMetrics: # <-- here
      emitKsmV1Metrics: false

I would be surprised if with your snippet above the errors (apiservers degraded, number of CPUs wrong) were gone.

I have been trying this snippet in our cluster for a week, I get metrics (i.e. there are bars shown in OpenCost), but all of the values are just zero. There is a total of 700$, but all of it is from __idle__. There are no values for any of the workloads running, all of them just show 0$.

@mittal-ishaan
Copy link
Contributor

Ahh my bad here. Got mixed up in two different things.
Thank you. will look into this

@kastl-ars
Copy link
Contributor Author

Any news? Sorry for the hustle, but this is rendering OpenCost unusable on our OpenShift clusters currently...

@mittal-ishaan
Copy link
Contributor

Hi @kastl-ars
so sorry for missing this.
I am trying to find a workaround here. Could you try the following helm values along with your custom values:

opencost:
  metrics:
    serviceMonitor:
      enabled: true
    kubeStateMetrics:
      emitKsmV1Metrics: true
    config:
      enabled: true
      disabledMetrics:
        - cluster:hyperthread_enabled_nodes
        - deployment_match_labels
        - kube_job_status_failed
        - kube_namespace_labels
        - kube_node_labels
        - kube_node_status_allocatable
        - kube_node_status_capacity
        - kube_persistentvolume_capacity_bytes
        - kube_persistentvolumeclaim_info
        - kube_persistentvolumeclaim_resource_requests_storage_bytes
        - kube_pod_container_resource_requests
        - kube_pod_container_status_terminated_reason
        - kube_pod_labels
        - kube_pod_owner
  prometheus:
    kubeRBACProxy: true
    createMonitoringClusterRoleBinding: true
    createMonitoringResourceReaderRoleBinding: true
    monitoringServiceAccountName: prometheus-k8s
    monitoringServiceAccountNamespace: openshift-monitoring
    external:
      enabled: true
      url: https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091
    internal:
      enabled: false
  exporter:
    extraVolumeMounts:
      - name: configs
        mountPath: /var/configs/metrics.json
        subPath: metrics.json
  ui:
    extraVolumeMounts:
      - name: empty-var-www
        mountPath: /var/www
      # - name: opencost-ui-nginx-config-volume
      #   mountPath: /etc/nginx/conf.d/default.nginx.conf
      #   subPath: default.nginx.conf

extraVolumes:
  - name: empty-var-www
    emptyDir: {}
  - name: configs
    configMap:
      name: custom-metrics
  # - name: opencost-ui-nginx-config-volume
  #   configMap:
  #     name: opencost-ui-nginx-config

and see if you are still not able to get the cost as this seems to fix the overview page CPU count and degraded warnings. I am testing on my end too. Will update in sometime.

@kastl-ars
Copy link
Contributor Author

Thanks for digging into this, @mittal-ishaan. I will test this and report back.

@kastl-ars
Copy link
Contributor Author

OK, I immediately see OpenCost displaying actual data for today after using the workaround. I'll check if the warnings and the wrong CPU count reappear over the course of today.

@kastl-ars
Copy link
Contributor Author

Seems like the workaround works. I have not seen any errors or wrong CPU counts on 4 clusters. And this far all 4 clusters are reporting values in the OpenCost UI.

So, the question is, how to get this into the chart properly? And how to document this properly? I think the docs need an OpenShift section... :-(

@mittal-ishaan
Copy link
Contributor

That sounds great, thank you @kastl-ars
100% agree on getting an openshift section in the opencost docs. I will try to get that in If I get some time.
To get this in the chart, I was thinking of having a values-openshift.yaml file as an example values file to tell the users that you will need at least these configurations to make it work in openshift env.. There is already a lot that the user has to configure to even have the simplest openshift install.

@kastl-ars
Copy link
Contributor Author

Having a values-openshift.yaml is a nice start, yes. Setting all of those things just by switching some openshift.enabled to true seems error prone and tedious. Hundreds of if conditions in dozens of templates...

@kastl-ars
Copy link
Contributor Author

Thanks for digging into this, @mittal-ishaan. I will test this and report back.

We have implemented this workaround in all of our OpenShift clusters and this far the "degraded apiserver" error as well as the bogus CPU numbers are gone. So I daresay it works.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants