Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Reset/Removal of metadata.labels for serviceMonitor rook-ceph-mgr #14472

Open
mohitrajain opened this issue Jul 18, 2024 · 2 comments
Labels

Comments

@mohitrajain
Copy link

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
The metadata.labels for serviceMonitor rook-ceph-mgr are being reset inconsistently. We have enabled monitoring for the cephCluster and configured labels accordingly. While the labels are consistently applied to the serviceMonitor rook-ceph-exporter, they are absent for rook-ceph-mgr.

Sharing metadata for both services here:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  generation: 2
  labels:
    team: rook
  name: rook-ceph-mgr
  namespace: rook-ceph

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  generation: 2
  labels:
    prom: local
    team: rook
  name: rook-ceph-exporter
  namespace: rook-ceph

Sharing label and monitoring config for CephCluster as:

    labels:
      monitoring:
        prom: local
    monitoring:
      enabled: true

It is impacting our monitoring as prometheus operator has a selector:

      serviceMonitorSelector:
        matchLabels:
          prom: local

The image below demonstrates how the label reset on rook-ceph-mgr is impacting our monitoring.
image

Expected behavior:
We expect the serviceMonitor rook-ceph-mgr to consistently have labels present. The absence of these labels is impacting alerts and overall monitoring effectiveness.

How to reproduce it (minimal and precise):
I'm not sure how this issue is triggered. I've noticed it in other clusters but haven't had the chance to investigate thoroughly. I suspect it might be related to the cluster reconciliation process. I'm sharing logs here for your reference, which correlate with the label reset.

image

2024-07-15 06:29:06.153499 I | operator: rook-ceph-operator-config-controller done reconciling
2024-07-15 17:05:43.154650 I | operator: rook-ceph-operator-config-controller done reconciling
2024-07-15 17:53:42.867843 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2024-07-15 17:53:47.241130 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:53:47.241232 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:53:47.241315 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:53:47.241396 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:53:47.241481 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:53:47.241558 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:53:47.241635 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:53:47.241721 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:53:47.241826 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:53:47.241900 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:53:47.241971 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:53:47.242046 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:53:47.242125 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:53:47.242205 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:53:47.242290 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:53:47.242370 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:53:47.242449 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:53:47.242527 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:53:47.242605 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:53:55.460718 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2024-07-15 17:54:02.348999 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:02.349122 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:02.349226 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:02.349346 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:02.349453 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:02.349568 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:02.349675 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:02.349808 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:02.349921 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:02.350031 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:02.350136 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:02.350246 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:02.350362 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:02.350472 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:02.350574 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:02.350682 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:02.350804 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:02.350914 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:02.351020 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:02.351131 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:02.351233 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:02.351335 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:02.351442 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:02.351550 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:02.351651 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:02.351762 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:02.351869 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:10.581584 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2024-07-15 17:54:17.461769 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:17.461895 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:17.461998 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:17.462100 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:17.462205 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:17.462310 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:17.462410 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:17.462516 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:17.462617 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:17.462724 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:17.462831 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:17.462936 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:17.463037 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:17.463143 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:17.463244 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:17.463341 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:17.463454 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:17.463558 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:17.463661 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:17.463771 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:17.463870 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:17.463985 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:17.464087 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:17.464184 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:17.464285 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:17.464391 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:17.464489 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:32.577341 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:32.577465 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:32.577562 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:32.577655 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:32.577754 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:32.577849 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:32.577945 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:32.578033 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:32.578122 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:32.578209 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:32.578298 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:32.578387 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:32.578485 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:32.578572 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:32.578655 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:32.578737 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:32.578850 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:32.578941 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:32.579025 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:32.579111 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:32.579200 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:32.579284 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:32.579370 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:32.579459 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:32.579556 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:32.579639 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:32.579726 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:40.361870 I | op-mon: marking mon "j" out of quorum
2024-07-15 17:54:40.362021 W | cephclient: skipping adding mon "j" to config file, detected out of quorum
2024-07-15 17:54:40.369929 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2024-07-15 17:54:40.370067 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2024-07-15 17:54:40.429126 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "all PGs in cluster are clean"
2024-07-15 17:54:40.429165 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-2-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-2-reg-test-com"
2024-07-15 17:54:40.469809 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-3-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-3-reg-test-com"
2024-07-15 17:54:40.528277 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-4-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-4-reg-test-com"
2024-07-15 17:54:40.578689 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-5-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-5-reg-test-com"
2024-07-15 17:54:40.698470 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-6-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-6-reg-test-com"
2024-07-15 17:54:40.698506 W | op-mon: mon "j" not found in quorum, waiting for timeout (599 seconds left) before failover
2024-07-15 17:54:40.723018 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-7-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-7-reg-test-com"
2024-07-15 17:54:40.786116 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-8-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-8-reg-test-com"
2024-07-15 17:54:40.861689 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-9-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-9-reg-test-com"
2024-07-15 17:54:40.914391 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-07-15 17:54:41.090329 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:41.090414 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:41.090493 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:41.090577 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:41.090646 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:41.090717 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:41.090797 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:41.090871 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:41.090945 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:41.091013 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:41.091089 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:41.091159 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:41.091224 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:41.091301 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:41.091371 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:41.091440 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:41.091507 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:41.091585 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:41.091652 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:41.091721 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:41.091802 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:41.091874 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:41.091943 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:41.092018 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:41.092084 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:41.519234 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "cluster is not fully clean. PGs: [{StateName:active clean Count:338} {StateName:active clean laggy Count:6}]"
2024-07-15 17:54:42.578079 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:42.578205 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:42.578294 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:42.578381 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:42.578466 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:42.578548 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:42.578633 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:42.578714 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:42.578803 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:42.578887 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:42.578966 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:42.579052 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:42.579136 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:42.579219 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:42.579298 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:42.579385 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:42.579465 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:42.579545 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:42.579629 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:42.579714 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:42.579801 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:42.579882 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:42.579961 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:42.580041 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:43.015593 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "cluster is not fully clean. PGs: [{StateName:stale active clean Count:260} {StateName:active clean Count:73} {StateName:stale active clean laggy Count:8} {StateName:active clean laggy Count:3}]"
2024-07-15 17:55:13.023272 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:55:13.023351 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:55:13.023419 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:55:13.456604 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-3-reg-test-com". pg health: "all PGs in cluster are clean"
2024-07-15 17:55:13.456638 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-1-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-1-reg-test-com"
2024-07-15 17:55:13.468982 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-3-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-3-reg-test-com"
2024-07-15 17:55:26.154483 I | op-mon: marking mon "j" back in quorum
2024-07-15 17:55:26.157068 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2024-07-15 17:55:26.157161 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2024-07-15 17:55:26.175798 I | op-mon: mon "j" is back in quorum, removed from mon out timeout list
2024-07-15 17:55:44.004127 I | clusterdisruption-controller: all PGs are active clean. Restoring default OSD pdb settings
2024-07-15 17:55:44.004140 I | clusterdisruption-controller: creating the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-07-15 17:55:44.017383 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-1-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-1-reg-test-com"
2024-07-15 17:55:44.058718 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-2-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-2-reg-test-com"
2024-07-15 17:55:44.084699 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-4-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-4-reg-test-com"
2024-07-15 17:55:44.151186 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-5-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-5-reg-test-com"
2024-07-15 17:55:44.175744 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-6-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-6-reg-test-com"
2024-07-15 17:55:44.192532 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-7-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-7-reg-test-com"
2024-07-15 17:55:44.199887 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-8-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-8-reg-test-com"
2024-07-15 17:55:44.208876 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-9-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-9-reg-test-com"
2024-07-16 03:42:20.084641 I | operator: rook-ceph-operator-config-controller done reconciling

I was reviewing the code to get some insights and noticed that the implementation of applyCephExporterLabels in exporter.go differs from mgr.go. Specifically, there is a difference highlighted in the following code segment :

cephv1.GetMonitoringLabels(cephCluster.Spec.Labels).OverwriteApplyToObjectMeta(&serviceMonitor.ObjectMeta)

I'm not certain if this difference is responsible for the issue, but it is worth considering.

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

  • Operator's logs, if necessary

  • Crashing pod(s) logs, if necessary

    To get logs, use kubectl -n <namespace> logs <pod name>
    When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
    Read GitHub documentation if you need help.

Cluster Status to submit:

  • Output of kubectl commands, if necessary

    To get the health of the cluster, use kubectl rook-ceph health
    To get the status of the cluster, use kubectl rook-ceph ceph status
    For more details, see the Rook kubectl Plugin

Environment:

  • OS (e.g. from /etc/os-release): Rockylinux
  • Kernel (e.g. uname -a):
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod): v1.14.3
  • Storage backend version (e.g. for ceph do ceph -v): 18.2.2
  • Kubernetes version (use kubectl version): v1.27.9
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): 1.29.5
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK
@travisn
Copy link
Member

travisn commented Jul 22, 2024

@mohitrajain Could this be the same issue as #14477? There is a PR in progress to fix that.

@mohitrajain
Copy link
Author

@travisn The behavior is same as the issue mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants