Inconsistent Reset/Removal of metadata.labels for serviceMonitor rook-ceph-mgr #14472

mohitrajain · 2024-07-18T13:11:26Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
The metadata.labels for serviceMonitor rook-ceph-mgr are being reset inconsistently. We have enabled monitoring for the cephCluster and configured labels accordingly. While the labels are consistently applied to the serviceMonitor rook-ceph-exporter, they are absent for rook-ceph-mgr.

Sharing metadata for both services here:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  generation: 2
  labels:
    team: rook
  name: rook-ceph-mgr
  namespace: rook-ceph

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  generation: 2
  labels:
    prom: local
    team: rook
  name: rook-ceph-exporter
  namespace: rook-ceph

Sharing label and monitoring config for CephCluster as:

    labels:
      monitoring:
        prom: local
    monitoring:
      enabled: true

It is impacting our monitoring as prometheus operator has a selector:

      serviceMonitorSelector:
        matchLabels:
          prom: local

The image below demonstrates how the label reset on rook-ceph-mgr is impacting our monitoring.

Expected behavior:
We expect the serviceMonitor rook-ceph-mgr to consistently have labels present. The absence of these labels is impacting alerts and overall monitoring effectiveness.

How to reproduce it (minimal and precise):
I'm not sure how this issue is triggered. I've noticed it in other clusters but haven't had the chance to investigate thoroughly. I suspect it might be related to the cluster reconciliation process. I'm sharing logs here for your reference, which correlate with the label reset.

2024-07-15 06:29:06.153499 I | operator: rook-ceph-operator-config-controller done reconciling
2024-07-15 17:05:43.154650 I | operator: rook-ceph-operator-config-controller done reconciling
2024-07-15 17:53:42.867843 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2024-07-15 17:53:47.241130 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:53:47.241232 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:53:47.241315 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:53:47.241396 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:53:47.241481 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:53:47.241558 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:53:47.241635 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:53:47.241721 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:53:47.241826 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:53:47.241900 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:53:47.241971 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:53:47.242046 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:53:47.242125 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:53:47.242205 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:53:47.242290 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:53:47.242370 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:53:47.242449 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:53:47.242527 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:53:47.242605 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:53:55.460718 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2024-07-15 17:54:02.348999 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:02.349122 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:02.349226 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:02.349346 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:02.349453 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:02.349568 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:02.349675 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:02.349808 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:02.349921 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:02.350031 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:02.350136 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:02.350246 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:02.350362 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:02.350472 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:02.350574 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:02.350682 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:02.350804 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:02.350914 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:02.351020 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:02.351131 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:02.351233 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:02.351335 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:02.351442 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:02.351550 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:02.351651 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:02.351762 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:02.351869 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:10.581584 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2024-07-15 17:54:17.461769 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:17.461895 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:17.461998 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:17.462100 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:17.462205 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:17.462310 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:17.462410 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:17.462516 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:17.462617 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:17.462724 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:17.462831 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:17.462936 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:17.463037 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:17.463143 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:17.463244 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:17.463341 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:17.463454 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:17.463558 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:17.463661 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:17.463771 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:17.463870 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:17.463985 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:17.464087 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:17.464184 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:17.464285 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:17.464391 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:17.464489 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:32.577341 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:32.577465 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:32.577562 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:32.577655 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:32.577754 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:32.577849 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:32.577945 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:32.578033 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:32.578122 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:32.578209 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:32.578298 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:32.578387 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:32.578485 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:32.578572 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:32.578655 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:32.578737 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:32.578850 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:32.578941 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:32.579025 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:32.579111 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:32.579200 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:32.579284 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:32.579370 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:32.579459 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:32.579556 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:32.579639 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:32.579726 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:40.361870 I | op-mon: marking mon "j" out of quorum
2024-07-15 17:54:40.362021 W | cephclient: skipping adding mon "j" to config file, detected out of quorum
2024-07-15 17:54:40.369929 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2024-07-15 17:54:40.370067 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2024-07-15 17:54:40.429126 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "all PGs in cluster are clean"
2024-07-15 17:54:40.429165 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-2-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-2-reg-test-com"
2024-07-15 17:54:40.469809 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-3-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-3-reg-test-com"
2024-07-15 17:54:40.528277 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-4-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-4-reg-test-com"
2024-07-15 17:54:40.578689 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-5-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-5-reg-test-com"
2024-07-15 17:54:40.698470 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-6-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-6-reg-test-com"
2024-07-15 17:54:40.698506 W | op-mon: mon "j" not found in quorum, waiting for timeout (599 seconds left) before failover
2024-07-15 17:54:40.723018 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-7-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-7-reg-test-com"
2024-07-15 17:54:40.786116 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-8-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-8-reg-test-com"
2024-07-15 17:54:40.861689 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-9-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-9-reg-test-com"
2024-07-15 17:54:40.914391 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-07-15 17:54:41.090329 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:41.090414 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:41.090493 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:41.090577 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:41.090646 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:41.090717 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:41.090797 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:41.090871 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:41.090945 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:41.091013 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:41.091089 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:41.091159 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:41.091224 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:41.091301 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:41.091371 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:41.091440 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:41.091507 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:41.091585 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:41.091652 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:41.091721 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:41.091802 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:41.091874 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:41.091943 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:41.092018 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:41.092084 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:41.519234 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "cluster is not fully clean. PGs: [{StateName:active clean Count:338} {StateName:active clean laggy Count:6}]"
2024-07-15 17:54:42.578079 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:42.578205 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:42.578294 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:42.578381 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:42.578466 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:42.578548 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:42.578633 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:42.578714 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:42.578803 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:42.578887 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:42.578966 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:42.579052 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:42.579136 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:42.579219 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:42.579298 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:42.579385 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:42.579465 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:42.579545 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:42.579629 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:42.579714 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:42.579801 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:42.579882 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:42.579961 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:42.580041 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:43.015593 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "cluster is not fully clean. PGs: [{StateName:stale active clean Count:260} {StateName:active clean Count:73} {StateName:stale active clean laggy Count:8} {StateName:active clean laggy Count:3}]"
2024-07-15 17:55:13.023272 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:55:13.023351 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:55:13.023419 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:55:13.456604 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-3-reg-test-com". pg health: "all PGs in cluster are clean"
2024-07-15 17:55:13.456638 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-1-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-1-reg-test-com"
2024-07-15 17:55:13.468982 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-3-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-3-reg-test-com"
2024-07-15 17:55:26.154483 I | op-mon: marking mon "j" back in quorum
2024-07-15 17:55:26.157068 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2024-07-15 17:55:26.157161 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2024-07-15 17:55:26.175798 I | op-mon: mon "j" is back in quorum, removed from mon out timeout list
2024-07-15 17:55:44.004127 I | clusterdisruption-controller: all PGs are active clean. Restoring default OSD pdb settings
2024-07-15 17:55:44.004140 I | clusterdisruption-controller: creating the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-07-15 17:55:44.017383 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-1-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-1-reg-test-com"
2024-07-15 17:55:44.058718 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-2-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-2-reg-test-com"
2024-07-15 17:55:44.084699 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-4-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-4-reg-test-com"
2024-07-15 17:55:44.151186 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-5-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-5-reg-test-com"
2024-07-15 17:55:44.175744 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-6-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-6-reg-test-com"
2024-07-15 17:55:44.192532 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-7-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-7-reg-test-com"
2024-07-15 17:55:44.199887 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-8-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-8-reg-test-com"
2024-07-15 17:55:44.208876 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-9-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-9-reg-test-com"
2024-07-16 03:42:20.084641 I | operator: rook-ceph-operator-config-controller done reconciling

I was reviewing the code to get some insights and noticed that the implementation of applyCephExporterLabels in exporter.go differs from mgr.go. Specifically, there is a difference highlighted in the following code segment :

cephv1.GetMonitoringLabels(cephCluster.Spec.Labels).OverwriteApplyToObjectMeta(&serviceMonitor.ObjectMeta)

I'm not certain if this difference is responsible for the issue, but it is worth considering.

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.

Cluster Status to submit:

Output of kubectl commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health
To get the status of the cluster, use kubectl rook-ceph ceph status
For more details, see the Rook kubectl Plugin

Environment:

OS (e.g. from /etc/os-release): Rockylinux
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod): v1.14.3
Storage backend version (e.g. for ceph do ceph -v): 18.2.2
Kubernetes version (use kubectl version): v1.27.9
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): 1.29.5
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

The text was updated successfully, but these errors were encountered:

travisn · 2024-07-22T16:58:36Z

@mohitrajain Could this be the same issue as #14477? There is a PR in progress to fix that.

mohitrajain · 2024-07-22T17:32:13Z

@travisn The behavior is same as the issue mentioned.

mohitrajain added the bug label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Reset/Removal of metadata.labels for serviceMonitor rook-ceph-mgr #14472

Inconsistent Reset/Removal of metadata.labels for serviceMonitor rook-ceph-mgr #14472

mohitrajain commented Jul 18, 2024

travisn commented Jul 22, 2024

mohitrajain commented Jul 22, 2024

Inconsistent Reset/Removal of metadata.labels for serviceMonitor rook-ceph-mgr #14472

Inconsistent Reset/Removal of metadata.labels for serviceMonitor rook-ceph-mgr #14472

Comments

mohitrajain commented Jul 18, 2024

travisn commented Jul 22, 2024

mohitrajain commented Jul 22, 2024