Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the Node/Worker Pool Overview dashboard #6595

Merged
merged 1 commit into from
Aug 30, 2022

Conversation

istvanballok
Copy link
Contributor

@istvanballok istvanballok commented Aug 29, 2022

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

Improve the Node/Worker Pool Overview dashboard

The query of the panel
"Average # of Pods on each Node in Worker Group(s) $worker_group"
was incorrect: it was off by 1.

The new query uses count instead of sum because actually we want to
count the pods.

When the 2 vectors are merged with the operator to enrich the series
with the worker group label, we should add 0 instead of 1 to avoid
manipulating the result.

Note that we used the new Prometheus feature to format this query:

avg by (label_worker_gardener_cloud_pool) (
    count by (node) (kube_pod_info{type="shoot"})
    on (node) group_right ()
    sum by (label_worker_gardener_cloud_pool, node) (
      0 * kube_node_labels{label_worker_gardener_cloud_pool=~"etcd. "}
    )
)

The query of the panel
"Number of Nodes in Worker Group(s) $worker_group"
could be simplified.

The query of the panel
"Average Memory Usage of Nodes in Worker Group(s) $worker_group"
was slightly improved.

The query of the panel
"Average CPU Usage of Nodes in Worker Group(s) $worker_group"
was improved to exclude the iowait and steal CPU times.

The "Failed Connection Tracking Inserts" panel was missing the
rate function for the counter metric.

The queries of the "Overview of all nodes" table were also formatted
and improved in a similar way.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

/cc @rickardsjp

Release note:

Improve the Node/Worker Pool Overview dashboard

The query of the panel
"Average # of Pods on each Node in Worker Group(s) $worker_group"
was incorrect: it was off by 1.

The new query uses count instead of sum because actually we want to
count the pods.

When the 2 vectors are merged with the   operator to enrich the series
with the worker group label, we should add 0 instead of 1 to avoid
manipulating the result.

Note that we used the new Prometheus feature to format this query:

```
avg by (label_worker_gardener_cloud_pool) (
    count by (node) (kube_pod_info{type="shoot"})
    on (node) group_right ()
    sum by (label_worker_gardener_cloud_pool, node) (
      0 * kube_node_labels{label_worker_gardener_cloud_pool=~"etcd. "}
    )
)
```

The query of the panel
"Number of Nodes in Worker Group(s) $worker_group"
could be simplified.

The query of the panel
"Average Memory Usage of Nodes in Worker Group(s) $worker_group"
was slightly improved.

The query of the panel
"Average CPU Usage of Nodes in Worker Group(s) $worker_group"
was improved to exclude the iowait and steal CPU times.

The "Failed Connection Tracking Inserts" panel was missing the
rate function for the counter metric.

The queries of the "Overview of all nodes" table were also formatted
and improved in a similar way.

Co-authored-by: Wesley Bermbach <[email protected]>
Co-authored-by: Istvan Zoltan Ballok <[email protected]>
Co-authored-by: Jeremy Rickards <[email protected]>
@gardener-prow gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels Aug 29, 2022
@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 29, 2022
@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Aug 30, 2022
@@ -553,51 553,51 @@
],
"targets": [
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (node) on (node) group_right sum(kube_node_labels{label_worker_gardener_cloud_pool=~\"$worker_group\"}) by (label_worker_gardener_cloud_pool, node) - 1",
"expr": " sum by (node) (rate(node_cpu_seconds_total{mode!~\"idle|iowait|steal\"}[$__rate_interval]))\n on (node) group_right ()\n sum by (label_worker_gardener_cloud_pool, node) (\n 0 * kube_node_labels{label_worker_gardener_cloud_pool=~\"$worker_group\"}\n )",
Copy link
Member

@ialidzhikov ialidzhikov Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the additional whitespace at the beginning of the expression (is it part of the fancy formatting or can it be removed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leading whitespace is part of the formatting.
This is the new Prometheus formatting feature: prometheus/prometheus#11039

image

image

Copy link
Member

@ialidzhikov ialidzhikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Aug 30, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ialidzhikov, rickardsjp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2022
@gardener-prow gardener-prow bot merged commit 6fa306e into gardener:master Aug 30, 2022
@@ -46,7 46,7 @@
"steppedLine": false,
"targets": [
{
"expr": "avg(sum(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (node) on (node) group_right sum(kube_node_labels{label_worker_gardener_cloud_pool=~\"$worker_group\"}) by (label_worker_gardener_cloud_pool, node) - 1) by (label_worker_gardener_cloud_pool)",
"expr": "avg by (label_worker_gardener_cloud_pool) (\n sum by (node) (rate(node_cpu_seconds_total{mode!~\"idle|iowait|steal\"}[$__rate_interval]))\n on (node) group_right ()\n sum by (label_worker_gardener_cloud_pool, node) (\n 0 * kube_node_labels{label_worker_gardener_cloud_pool=~\"$worker_group\"}\n )\n)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also other occurrences of

node_cpu_seconds_total{mode!~\"idle\"}

{
"exemplar": true,
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\", node=~\"$Node\"}[$__rate_interval]))",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Used",
"refId": "A"
},

Don't they need to be adapted to make sure that all dashboards show the same CPU usage value? Otherwise I expect some minor differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants