You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deviation from expected behavior:
The metadata.labels for serviceMonitorrook-ceph-mgr are being reset inconsistently. We have enabled monitoring for the cephCluster and configured labels accordingly. While the labels are consistently applied to the serviceMonitorrook-ceph-exporter, they are absent for rook-ceph-mgr.
Sharing label and monitoring config for CephCluster as:
labels:
monitoring:
prom: local
monitoring:
enabled: true
It is impacting our monitoring as prometheus operator has a selector:
serviceMonitorSelector:
matchLabels:
prom: local
The image below demonstrates how the label reset on rook-ceph-mgr is impacting our monitoring.
Expected behavior:
We expect the serviceMonitorrook-ceph-mgr to consistently have labels present. The absence of these labels is impacting alerts and overall monitoring effectiveness.
How to reproduce it (minimal and precise):
I'm not sure how this issue is triggered. I've noticed it in other clusters but haven't had the chance to investigate thoroughly. I suspect it might be related to the cluster reconciliation process. I'm sharing logs here for your reference, which correlate with the label reset.
2024-07-15 06:29:06.153499 I | operator: rook-ceph-operator-config-controller done reconciling
2024-07-15 17:05:43.154650 I | operator: rook-ceph-operator-config-controller done reconciling
2024-07-15 17:53:42.867843 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2024-07-15 17:53:47.241130 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:53:47.241232 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:53:47.241315 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:53:47.241396 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:53:47.241481 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:53:47.241558 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:53:47.241635 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:53:47.241721 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:53:47.241826 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:53:47.241900 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:53:47.241971 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:53:47.242046 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:53:47.242125 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:53:47.242205 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:53:47.242290 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:53:47.242370 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:53:47.242449 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:53:47.242527 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:53:47.242605 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:53:55.460718 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2024-07-15 17:54:02.348999 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:02.349122 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:02.349226 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:02.349346 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:02.349453 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:02.349568 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:02.349675 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:02.349808 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:02.349921 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:02.350031 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:02.350136 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:02.350246 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:02.350362 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:02.350472 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:02.350574 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:02.350682 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:02.350804 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:02.350914 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:02.351020 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:02.351131 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:02.351233 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:02.351335 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:02.351442 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:02.351550 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:02.351651 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:02.351762 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:02.351869 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:10.581584 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2024-07-15 17:54:17.461769 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:17.461895 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:17.461998 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:17.462100 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:17.462205 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:17.462310 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:17.462410 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:17.462516 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:17.462617 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:17.462724 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:17.462831 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:17.462936 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:17.463037 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:17.463143 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:17.463244 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:17.463341 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:17.463454 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:17.463558 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:17.463661 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:17.463771 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:17.463870 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:17.463985 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:17.464087 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:17.464184 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:17.464285 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:17.464391 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:17.464489 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:32.577341 I | clusterdisruption-controller: osd "rook-ceph-osd-15" is down but no node drain is detected
2024-07-15 17:54:32.577465 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:32.577562 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:32.577655 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:32.577754 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:32.577849 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:32.577945 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:32.578033 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:32.578122 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:32.578209 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:32.578298 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:32.578387 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:32.578485 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:32.578572 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:32.578655 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:32.578737 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:32.578850 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:32.578941 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:32.579025 I | clusterdisruption-controller: osd "rook-ceph-osd-23" is down but no node drain is detected
2024-07-15 17:54:32.579111 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:32.579200 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:32.579284 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:32.579370 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:32.579459 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:32.579556 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:32.579639 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:32.579726 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:40.361870 I | op-mon: marking mon "j" out of quorum
2024-07-15 17:54:40.362021 W | cephclient: skipping adding mon "j" to config file, detected out of quorum
2024-07-15 17:54:40.369929 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2024-07-15 17:54:40.370067 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2024-07-15 17:54:40.429126 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "all PGs in cluster are clean"
2024-07-15 17:54:40.429165 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-2-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-2-reg-test-com"
2024-07-15 17:54:40.469809 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-3-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-3-reg-test-com"
2024-07-15 17:54:40.528277 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-4-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-4-reg-test-com"
2024-07-15 17:54:40.578689 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-5-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-5-reg-test-com"
2024-07-15 17:54:40.698470 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-6-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-6-reg-test-com"
2024-07-15 17:54:40.698506 W | op-mon: mon "j" not found in quorum, waiting for timeout (599 seconds left) before failover
2024-07-15 17:54:40.723018 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-7-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-7-reg-test-com"
2024-07-15 17:54:40.786116 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-8-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-8-reg-test-com"
2024-07-15 17:54:40.861689 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-9-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-9-reg-test-com"
2024-07-15 17:54:40.914391 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-07-15 17:54:41.090329 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:41.090414 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:41.090493 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:41.090577 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:41.090646 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:41.090717 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:41.090797 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:41.090871 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:41.090945 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:41.091013 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:41.091089 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:41.091159 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:41.091224 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:41.091301 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:41.091371 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:41.091440 I | clusterdisruption-controller: osd "rook-ceph-osd-19" is down but no node drain is detected
2024-07-15 17:54:41.091507 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:41.091585 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:41.091652 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:41.091721 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:41.091802 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:41.091874 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:41.091943 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:41.092018 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:41.092084 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:41.519234 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "cluster is not fully clean. PGs: [{StateName:active clean Count:338} {StateName:active clean laggy Count:6}]"
2024-07-15 17:54:42.578079 I | clusterdisruption-controller: osd "rook-ceph-osd-12" is down but no node drain is detected
2024-07-15 17:54:42.578205 I | clusterdisruption-controller: osd "rook-ceph-osd-6" is down but no node drain is detected
2024-07-15 17:54:42.578294 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:54:42.578381 I | clusterdisruption-controller: osd "rook-ceph-osd-24" is down but no node drain is detected
2024-07-15 17:54:42.578466 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:54:42.578548 I | clusterdisruption-controller: osd "rook-ceph-osd-14" is down but no node drain is detected
2024-07-15 17:54:42.578633 I | clusterdisruption-controller: osd "rook-ceph-osd-10" is down but no node drain is detected
2024-07-15 17:54:42.578714 I | clusterdisruption-controller: osd "rook-ceph-osd-18" is down but no node drain is detected
2024-07-15 17:54:42.578803 I | clusterdisruption-controller: osd "rook-ceph-osd-9" is down but no node drain is detected
2024-07-15 17:54:42.578887 I | clusterdisruption-controller: osd "rook-ceph-osd-20" is down but no node drain is detected
2024-07-15 17:54:42.578966 I | clusterdisruption-controller: osd "rook-ceph-osd-22" is down but no node drain is detected
2024-07-15 17:54:42.579052 I | clusterdisruption-controller: osd "rook-ceph-osd-25" is down but no node drain is detected
2024-07-15 17:54:42.579136 I | clusterdisruption-controller: osd "rook-ceph-osd-8" is down but no node drain is detected
2024-07-15 17:54:42.579219 I | clusterdisruption-controller: osd "rook-ceph-osd-17" is down but no node drain is detected
2024-07-15 17:54:42.579298 I | clusterdisruption-controller: osd "rook-ceph-osd-11" is down but no node drain is detected
2024-07-15 17:54:42.579385 I | clusterdisruption-controller: osd "rook-ceph-osd-26" is down but no node drain is detected
2024-07-15 17:54:42.579465 I | clusterdisruption-controller: osd "rook-ceph-osd-3" is down but no node drain is detected
2024-07-15 17:54:42.579545 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected
2024-07-15 17:54:42.579629 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down but no node drain is detected
2024-07-15 17:54:42.579714 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
2024-07-15 17:54:42.579801 I | clusterdisruption-controller: osd "rook-ceph-osd-16" is down but no node drain is detected
2024-07-15 17:54:42.579882 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:54:42.579961 I | clusterdisruption-controller: osd "rook-ceph-osd-21" is down but no node drain is detected
2024-07-15 17:54:42.580041 I | clusterdisruption-controller: osd "rook-ceph-osd-13" is down but no node drain is detected
2024-07-15 17:54:43.015593 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-1-reg-test-com". pg health: "cluster is not fully clean. PGs: [{StateName:stale active clean Count:260} {StateName:active clean Count:73} {StateName:stale active clean laggy Count:8} {StateName:active clean laggy Count:3}]"
2024-07-15 17:55:13.023272 I | clusterdisruption-controller: osd "rook-ceph-osd-2" is down but no node drain is detected
2024-07-15 17:55:13.023351 I | clusterdisruption-controller: osd "rook-ceph-osd-7" is down but no node drain is detected
2024-07-15 17:55:13.023419 I | clusterdisruption-controller: osd "rook-ceph-osd-5" is down but no node drain is detected
2024-07-15 17:55:13.456604 I | clusterdisruption-controller: osd is down in failure domain "ops-k8s-storage-3-reg-test-com". pg health: "all PGs in cluster are clean"
2024-07-15 17:55:13.456638 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-host-ops-k8s-storage-1-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-1-reg-test-com"
2024-07-15 17:55:13.468982 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-3-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-3-reg-test-com"
2024-07-15 17:55:26.154483 I | op-mon: marking mon "j" back in quorum
2024-07-15 17:55:26.157068 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2024-07-15 17:55:26.157161 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2024-07-15 17:55:26.175798 I | op-mon: mon "j" is back in quorum, removed from mon out timeout list
2024-07-15 17:55:44.004127 I | clusterdisruption-controller: all PGs are active clean. Restoring default OSD pdb settings
2024-07-15 17:55:44.004140 I | clusterdisruption-controller: creating the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2024-07-15 17:55:44.017383 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-1-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-1-reg-test-com"
2024-07-15 17:55:44.058718 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-2-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-2-reg-test-com"
2024-07-15 17:55:44.084699 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-4-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-4-reg-test-com"
2024-07-15 17:55:44.151186 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-5-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-5-reg-test-com"
2024-07-15 17:55:44.175744 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-6-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-6-reg-test-com"
2024-07-15 17:55:44.192532 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-7-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-7-reg-test-com"
2024-07-15 17:55:44.199887 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-8-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-8-reg-test-com"
2024-07-15 17:55:44.208876 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-host-ops-k8s-storage-9-reg-test-com" with maxUnavailable=0 for "host" failure domain "ops-k8s-storage-9-reg-test-com"
2024-07-16 03:42:20.084641 I | operator: rook-ceph-operator-config-controller done reconciling
I was reviewing the code to get some insights and noticed that the implementation of applyCephExporterLabels in exporter.go differs from mgr.go. Specifically, there is a difference highlighted in the following code segment :
I'm not certain if this difference is responsible for the issue, but it is worth considering.
File(s) to submit:
Cluster CR (custom resource), typically called cluster.yaml, if necessary
Logs to submit:
Operator's logs, if necessary
Crashing pod(s) logs, if necessary
To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.
Cluster Status to submit:
Output of kubectl commands, if necessary
To get the health of the cluster, use kubectl rook-ceph health
To get the status of the cluster, use kubectl rook-ceph ceph status
For more details, see the Rook kubectl Plugin
Environment:
OS (e.g. from /etc/os-release): Rockylinux
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod): v1.14.3
Storage backend version (e.g. for ceph do ceph -v): 18.2.2
Kubernetes version (use kubectl version): v1.27.9
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): 1.29.5
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK
The text was updated successfully, but these errors were encountered:
Is this a bug report or feature request?
Deviation from expected behavior:
The
metadata.labels
forserviceMonitor
rook-ceph-mgr
are being reset inconsistently. We have enabledmonitoring
for thecephCluster
and configured labels accordingly. While the labels are consistently applied to theserviceMonitor
rook-ceph-exporter
, they are absent forrook-ceph-mgr
.Sharing metadata for both services here:
Sharing label and monitoring config for CephCluster as:
It is impacting our monitoring as prometheus operator has a selector:
The image below demonstrates how the label reset on
rook-ceph-mgr
is impacting our monitoring.Expected behavior:
We expect the
serviceMonitor
rook-ceph-mgr
to consistently have labels present. The absence of these labels is impacting alerts and overall monitoring effectiveness.How to reproduce it (minimal and precise):
I'm not sure how this issue is triggered. I've noticed it in other clusters but haven't had the chance to investigate thoroughly. I suspect it might be related to the cluster reconciliation process. I'm sharing logs here for your reference, which correlate with the label reset.
I was reviewing the code to get some insights and noticed that the implementation of
applyCephExporterLabels
in exporter.go differs from mgr.go. Specifically, there is a difference highlighted in the following code segment :I'm not certain if this difference is responsible for the issue, but it is worth considering.
File(s) to submit:
cluster.yaml
, if necessaryLogs to submit:
Operator's logs, if necessary
Crashing pod(s) logs, if necessary
To get logs, use
kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the
insert code
button from the Github UI.Read GitHub documentation if you need help.
Cluster Status to submit:
Output of kubectl commands, if necessary
To get the health of the cluster, use
kubectl rook-ceph health
To get the status of the cluster, use
kubectl rook-ceph ceph status
For more details, see the Rook kubectl Plugin
Environment:
uname -a
):rook version
inside of a Rook Pod): v1.14.3ceph -v
): 18.2.2kubectl version
): v1.27.9ceph health
in the Rook Ceph toolbox): HEALTH_OKThe text was updated successfully, but these errors were encountered: