Improve the Node/Worker Pool Overview dashboard #6595

istvanballok · 2022-08-29T14:25:27Z

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

Improve the Node/Worker Pool Overview dashboard

The query of the panel
"Average # of Pods on each Node in Worker Group(s) $worker_group"
was incorrect: it was off by 1.

The new query uses count instead of sum because actually we want to
count the pods.

When the 2 vectors are merged with the operator to enrich the series
with the worker group label, we should add 0 instead of 1 to avoid
manipulating the result.

Note that we used the new Prometheus feature to format this query:

avg by (label_worker_gardener_cloud_pool) (
    count by (node) (kube_pod_info{type="shoot"})
    on (node) group_right ()
    sum by (label_worker_gardener_cloud_pool, node) (
      0 * kube_node_labels{label_worker_gardener_cloud_pool=~"etcd. "}
    )
)

The query of the panel
"Number of Nodes in Worker Group(s) $worker_group"
could be simplified.

The query of the panel
"Average Memory Usage of Nodes in Worker Group(s) $worker_group"
was slightly improved.

The query of the panel
"Average CPU Usage of Nodes in Worker Group(s) $worker_group"
was improved to exclude the iowait and steal CPU times.

The "Failed Connection Tracking Inserts" panel was missing the
rate function for the counter metric.

The queries of the "Overview of all nodes" table were also formatted
and improved in a similar way.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

/cc @rickardsjp

Release note:

Improve the Node/Worker Pool Overview dashboard

The query of the panel "Average # of Pods on each Node in Worker Group(s) $worker_group" was incorrect: it was off by 1. The new query uses count instead of sum because actually we want to count the pods. When the 2 vectors are merged with the operator to enrich the series with the worker group label, we should add 0 instead of 1 to avoid manipulating the result. Note that we used the new Prometheus feature to format this query: ``` avg by (label_worker_gardener_cloud_pool) ( count by (node) (kube_pod_info{type="shoot"}) on (node) group_right () sum by (label_worker_gardener_cloud_pool, node) ( 0 * kube_node_labels{label_worker_gardener_cloud_pool=~"etcd. "} ) ) ``` The query of the panel "Number of Nodes in Worker Group(s) $worker_group" could be simplified. The query of the panel "Average Memory Usage of Nodes in Worker Group(s) $worker_group" was slightly improved. The query of the panel "Average CPU Usage of Nodes in Worker Group(s) $worker_group" was improved to exclude the iowait and steal CPU times. The "Failed Connection Tracking Inserts" panel was missing the rate function for the counter metric. The queries of the "Overview of all nodes" table were also formatted and improved in a similar way. Co-authored-by: Wesley Bermbach <[email protected]> Co-authored-by: Istvan Zoltan Ballok <[email protected]> Co-authored-by: Jeremy Rickards <[email protected]>

ialidzhikov · 2022-08-30T09:49:59Z

charts/seed-monitoring/charts/grafana/dashboards/owners/node-pool-dashboard.json

@@ -553,51 553,51 @@
 ],
 "targets": [
 {
- "expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (node)  on (node) group_right sum(kube_node_labels{label_worker_gardener_cloud_pool=~\"$worker_group\"}) by (label_worker_gardener_cloud_pool, node) - 1",
+ "expr": " sum by (node) (rate(node_cpu_seconds_total{mode!~\"idle|iowait|steal\"}[$__rate_interval]))\n on (node) group_right ()\n  sum by (label_worker_gardener_cloud_pool, node) (\n 0 * kube_node_labels{label_worker_gardener_cloud_pool=~\"$worker_group\"}\n )",


Do we need the additional whitespace at the beginning of the expression (is it part of the fancy formatting or can it be removed)?

The leading whitespace is part of the formatting.
This is the new Prometheus formatting feature: prometheus/prometheus#11039

ialidzhikov

/approve

gardener-prow · 2022-08-30T12:15:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ialidzhikov, rickardsjp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ialidzhikov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ialidzhikov · 2022-08-31T15:26:42Z

charts/seed-monitoring/charts/grafana/dashboards/owners/node-pool-dashboard.json

@@ -46,7 46,7 @@
 "steppedLine": false,
 "targets": [
 {
- "expr": "avg(sum(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (node)  on (node) group_right sum(kube_node_labels{label_worker_gardener_cloud_pool=~\"$worker_group\"}) by (label_worker_gardener_cloud_pool, node) - 1) by (label_worker_gardener_cloud_pool)",
+ "expr": "avg by (label_worker_gardener_cloud_pool) (\n sum by (node) (rate(node_cpu_seconds_total{mode!~\"idle|iowait|steal\"}[$__rate_interval]))\n  on (node) group_right ()\n sum by (label_worker_gardener_cloud_pool, node) (\n 0 * kube_node_labels{label_worker_gardener_cloud_pool=~\"$worker_group\"}\n )\n)",


There are also other occurrences of

node_cpu_seconds_total{mode!~\"idle\"}

gardener/charts/seed-monitoring/charts/grafana/dashboards/owners/node-details-dashboard.json

Lines 265 to 273 in d31e21c

{

"exemplar": true,

"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\", node=~\"$Node\"}[$__rate_interval]))",

"format": "time_series",

"interval": "",

"intervalFactor": 1,

"legendFormat": "Used",

"refId": "A"

},

Don't they need to be adapted to make sure that all dashboards show the same CPU usage value? Otherwise I expect some minor differences.

gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels Aug 29, 2022

gardener-prow bot requested review from plkokanov and rfranzke August 29, 2022 14:25

gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 29, 2022

rickardsjp approved these changes Aug 30, 2022

View reviewed changes

gardener-prow bot assigned rickardsjp Aug 30, 2022

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Aug 30, 2022

ialidzhikov reviewed Aug 30, 2022

View reviewed changes

gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2022

gardener-prow bot merged commit 6fa306e into gardener:master Aug 30, 2022

ialidzhikov reviewed Aug 31, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the Node/Worker Pool Overview dashboard #6595

Improve the Node/Worker Pool Overview dashboard #6595

istvanballok commented Aug 29, 2022 •

edited

Loading

ialidzhikov Aug 30, 2022 •

edited

Loading

istvanballok Aug 30, 2022

ialidzhikov left a comment

gardener-prow bot commented Aug 30, 2022

ialidzhikov Aug 31, 2022

	{
	"exemplar": true,
	"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\", node=~\"$Node\"}[$__rate_interval]))",
	"format": "time_series",
	"interval": "",
	"intervalFactor": 1,
	"legendFormat": "Used",
	"refId": "A"
	},

Improve the Node/Worker Pool Overview dashboard #6595

Improve the Node/Worker Pool Overview dashboard #6595

Conversation

istvanballok commented Aug 29, 2022 • edited Loading

ialidzhikov Aug 30, 2022 • edited Loading

Choose a reason for hiding this comment

istvanballok Aug 30, 2022

Choose a reason for hiding this comment

ialidzhikov left a comment

Choose a reason for hiding this comment

gardener-prow bot commented Aug 30, 2022

ialidzhikov Aug 31, 2022

Choose a reason for hiding this comment

istvanballok commented Aug 29, 2022 •

edited

Loading

ialidzhikov Aug 30, 2022 •

edited

Loading