Increase periodicity in domainstats migration metrics #12441

machadovilaca · 2024-07-23T21:12:11Z

What this PR does

Before this PR:

Currently migration data is reported with the domainstats collector every 30 seconds.
This data includes the migrated and remaining memory bytes, the memory transfer rate
and the dirty rate.

Such a long periodicity for an ephemeral job reduces visibility about the it's progress.
Users are not capable of following the data migration in "real time", and, for most VMs,
the 30 seconds periodicity, causes the data to only be reported once during the job,
and sometimes only after it ended.

After this PR:

To increase the periodicity, we moved the migration stats to a separate collector.
This collector has an event handler to track changes in the VMI status, to start
a job to gather migration information when a new migration begins. This runs
every 5 seconds (configurable).

jira-ticket: https://issues.redhat.com/browse/CNV-44897

Fixes #

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Design: A design document was considered and is present (link) or not required
PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
Community: Announcement to kubevirt-dev was considered

Release note

Increase periodicity in domainstats migration metrics

kubevirt-bot · 2024-07-23T21:12:14Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

machadovilaca · 2024-07-23T21:12:22Z

/test all

machadovilaca · 2024-07-24T09:16:27Z

/retest

machadovilaca · 2024-07-25T14:47:10Z

/test all

machadovilaca · 2024-07-25T15:01:09Z

/cc @sradco @assafad @avlitman

machadovilaca · 2024-07-25T15:04:13Z

@enp0s3 since we need to get information on when VMIM is started and finished, virt-handler now needs to have access to virtualmachineinstancemigrations resource, so I added get, list and watch permissions

I tried this implementation in virt-controller first, to avoid adding this overhead and permissions, but virt-controller, unlike virt-handler, is no capable to find and connect to the sockets to fetch the info from libvirt

Can you please take a look?

machadovilaca · 2024-07-26T12:56:31Z

/retest

machadovilaca · 2024-07-26T15:02:01Z

/retest

pkg/monitoring/metrics/virt-handler/migrationdomainstats/migrationstats_collector.go

pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go

fossedihelm

Thank you @machadovilaca!
I think we are close :)

fossedihelm · 2024-09-12T05:54:32Z

pkg/monitoring/metrics/virt-handler/domainstats/block_metrics_test.go

@@ -28,6 28,7 @@ import (
 k6tv1 "kubevirt.io/api/core/v1"

 "kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/stats"
+ "kubevirt.io/kubevirt/tests/libmonitoring"


I am not sure we want the tests pkg to be imported in the production one.
I remember we wanted to avoid such things. @vladikr thoughts?

So @machadovilaca I think we should remove the second commit, or if you prefer you can create a new /testing pkg under pkg/monitoring/metrics/virt-handler and move the metricMatcher there.
Consider that this pattern is already used elsewhere, i.e. https://github.com/kubevirt/kubevirt/tree/main/pkg/virt-controller/watch/testing
Thank you!

i created under pkg/monitoring/metrics but named 'tests', some errors in e2e though
will rename too

Sorry, I missed the last commit

pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go

pkg/monitoring/metrics/virt-handler/migrationdomainstats/queue.go

pkg/monitoring/metrics/virt-handler/metrics.go

pkg/monitoring/metrics/virt-handler/migrationdomainstats/migrationstats_collector.go

Signed-off-by: machadovilaca <[email protected]>

machadovilaca · 2024-09-17T17:29:54Z

pull-kubevirt-e2e-k8s-1.29-sig-monitoring ✔️

fossedihelm

@machadovilaca Thank you! One final doubt from my side

fossedihelm · 2024-09-18T07:00:04Z

pkg/monitoring/metrics/virt-handler/migrationdomainstats/queue.go

+ values, err := q.scrapeDomainStats()
+ if err != nil {
+ log.Log.Reason(err).Errorf("failed to scrape domain stats for VMI %s/%s", q.vmi.Namespace, q.vmi.Name)
+ return
+ }
+
+ r := result{
+ vmi: q.vmi.Name,
+ namespace: q.vmi.Namespace,
+
+ domainJobInfo: *values.MigrateDomainJobInfo,
+ timestamp: time.Now(),
+ }
+
+ q.mutex.Lock()
+ defer q.mutex.Unlock()
+ q.results.Value = r
+ q.results = q.results.Next()


I'm wondering if we should lock before the q.scrapeDomainStats().
Let me explain:

a metric.Collect() is executed while we are scraping the domain, it will lock the queue because it is reading results through q.all().

the migration finishes at that moment, and the key is deleted from the handler

the queue lock is released so the last results from queue.collect() are written

these results are never consumed

Maybe it's an edge case, since in the next ticker the queue.collect() will notice that the migration is finished and will cancel the ticker, but those results are never picked.
Are we fine with that?
Another (maybe worst) case is if an error occurs while we are reading the informer during point 2, which will cause the vmi to be dropped from the vmiStats loop, while the migration is still in progress.
Is this something that was considered and it is an "acceptable" risk?
Thank you

for the first point, I think it's fine to lose that last bit of data, in terms of user observability about migration, not having that data right before the migration finishes won't affect the outcome significantly (and locking before the scrape can lead to much bigger locking times)

in terms of removing the vmi, should still be fine, the previous data is still collected, and in the next vmi status update, if the vmi migration is still progressing it will be readded

so IMO we should be fine as we are

pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go

vladikr · 2024-09-16T18:40:51Z

pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go

+ vmiStore cache.Store
+ vmiStats map[string]*queue
+
+ mutex *sync.Mutex


This could just be sync.Mutex, or rather even sync.RWMutex
there can be multiple readers in this case, i think.

in our use case we expect to have just 1 reader

Collect() can't be a read-lock as it also changes the data struct to delete the key from a finished migration. I thought about read-locking when creating the output result and then a write-lock to remove the key, but this wouldn't work well because the isActive value can become stale between the reading and the write-lock.

question: same for the queue struct?

updated for queue too

pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go

pkg/monitoring/metrics/virt-handler/migrationdomainstats/queue.go

fossedihelm · 2024-09-20T09:14:37Z

@machadovilaca migration failure is relevant

machadovilaca · 2024-09-20T13:47:57Z

@machadovilaca migration failure is relevant

These migration metrics are now timestamped in the output
so they appear in the <metric_name>{<labels...>} <timestamp> <value> format
which the libinfra package was not expecting

updated to handle both cases

Before this commit, migration data was reported with the domainstats collector every 30 seconds. The data includes the migrated and remaining memory bytes, the memory transfer rate and the dirty rate. Such a long periodicity for an ephemeral job reduces visibility about the it's progress. Users are were not capable of following the data migration in "real time", and, for most VMs, the 30 seconds periodicity, caused the data to only be reported once during the job, and sometimes only after it ended. To increase the periodicity, we moved the migration stats to a separate collector. This collector has an event handler to track changes in the VMI status, to start a job to gather migration information when a new migration begins. This runs every 5 seconds (configurable). Signed-off-by: João Vilaça <[email protected]>

Signed-off-by: João Vilaça <[email protected]>

fossedihelm · 2024-09-20T14:08:16Z

Thank you @machadovilaca!
/lgtm
@vladikr I leave it to you to unhold if everything is ok, thanks

kubevirt-commenter-bot · 2024-09-20T14:08:27Z

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-windows2016
/test pull-kubevirt-e2e-kind-1.27-vgpu
/test pull-kubevirt-e2e-kind-sriov
/test pull-kubevirt-e2e-k8s-1.30-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.29-sig-network
/test pull-kubevirt-e2e-k8s-1.29-sig-storage
/test pull-kubevirt-e2e-k8s-1.29-sig-compute
/test pull-kubevirt-e2e-k8s-1.29-sig-operator
/test pull-kubevirt-e2e-k8s-1.30-sig-network
/test pull-kubevirt-e2e-k8s-1.30-sig-storage
/test pull-kubevirt-e2e-k8s-1.30-sig-compute
/test pull-kubevirt-e2e-k8s-1.30-sig-operator

machadovilaca · 2024-09-21T15:15:21Z

/retest

vladikr · 2024-09-23T12:41:04Z

/unhold

@machadovilaca Thanks!

kubevirt-commenter-bot · 2024-09-24T00:55:05Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-bot · 2024-09-24T01:15:59Z

@machadovilaca: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations	`ce37ad4`	link	true	`/test pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations`
pull-kubevirt-e2e-arm64	`8f40ace`	link	false	`/test pull-kubevirt-e2e-arm64`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kubevirt-commenter-bot · 2024-09-24T05:55:05Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-bot added size/XL sig/buildsystem Denotes an issue or PR that relates to changes in the build system. labels Jul 23, 2024

kubevirt-bot requested review from assafad, avlitman and iholder101 July 23, 2024 21:12

kubevirt-bot added the sig/observability Denotes an issue or PR that relates to observability. label Jul 23, 2024

machadovilaca force-pushed the migration-metrics branch 3 times, most recently from f1b4be0 to a14f0b3 Compare July 25, 2024 14:40

kubevirt-bot added the sig/scale label Jul 25, 2024

machadovilaca force-pushed the migration-metrics branch 2 times, most recently from dd66a66 to 1daba6d Compare July 25, 2024 14:46

machadovilaca marked this pull request as ready for review July 25, 2024 15:00

kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2024

kubevirt-bot requested a review from alicefr July 25, 2024 15:00

kubevirt-bot requested a review from sradco July 25, 2024 15:01

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 6, 2024

machadovilaca force-pushed the migration-metrics branch from 1daba6d to ffed11f Compare August 6, 2024 13:53

kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 6, 2024

vladikr reviewed Sep 12, 2024

View reviewed changes

pkg/monitoring/metrics/virt-handler/migrationdomainstats/migrationstats_collector.go Show resolved Hide resolved

vladikr reviewed Sep 12, 2024

View reviewed changes

pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go Outdated Show resolved Hide resolved

fossedihelm reviewed Sep 12, 2024

View reviewed changes

machadovilaca force-pushed the migration-metrics branch 2 times, most recently from 8f40ace to f20851e Compare September 17, 2024 10:27

machadovilaca added 2 commits September 17, 2024 11:58

Update operator-observability version

8b16600

Signed-off-by: machadovilaca <[email protected]>

Move metric and collector matcher to pkg/monitoring/metrics/testing

efc1794

Signed-off-by: machadovilaca <[email protected]>

machadovilaca force-pushed the migration-metrics branch from f20851e to be7121b Compare September 17, 2024 14:21

fossedihelm reviewed Sep 18, 2024

View reviewed changes

vladikr reviewed Sep 18, 2024

View reviewed changes

machadovilaca force-pushed the migration-metrics branch 2 times, most recently from c716d4e to 76f01e5 Compare September 19, 2024 12:47

machadovilaca force-pushed the migration-metrics branch from 76f01e5 to 6cc9920 Compare September 20, 2024 13:46

machadovilaca added 2 commits September 20, 2024 14:49

Setup virt-handler metrics after cache.WaitForCacheSync

6a2df53

Signed-off-by: João Vilaça <[email protected]>

machadovilaca force-pushed the migration-metrics branch from 6cc9920 to 6a2df53 Compare September 20, 2024 13:49

kubevirt-bot assigned fossedihelm Sep 20, 2024

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 20, 2024

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2024

kubevirt-bot merged commit 4da8b87 into kubevirt:main Sep 24, 2024
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase periodicity in domainstats migration metrics #12441

Increase periodicity in domainstats migration metrics #12441

machadovilaca commented Jul 23, 2024 •

edited

Loading

kubevirt-bot commented Jul 23, 2024

machadovilaca commented Jul 23, 2024

machadovilaca commented Jul 24, 2024

machadovilaca commented Jul 25, 2024

machadovilaca commented Jul 25, 2024

machadovilaca commented Jul 25, 2024

machadovilaca commented Jul 26, 2024

machadovilaca commented Jul 26, 2024

fossedihelm left a comment

fossedihelm Sep 12, 2024

vladikr Sep 16, 2024

fossedihelm Sep 17, 2024

machadovilaca Sep 17, 2024

fossedihelm Sep 17, 2024

machadovilaca Sep 17, 2024

machadovilaca commented Sep 17, 2024

fossedihelm left a comment

fossedihelm Sep 18, 2024

machadovilaca Sep 19, 2024

vladikr Sep 16, 2024

machadovilaca Sep 19, 2024

fossedihelm Sep 19, 2024

machadovilaca Sep 19, 2024

fossedihelm commented Sep 20, 2024

machadovilaca commented Sep 20, 2024 •

edited

Loading

fossedihelm commented Sep 20, 2024

kubevirt-commenter-bot commented Sep 20, 2024

machadovilaca commented Sep 21, 2024

vladikr commented Sep 23, 2024

kubevirt-commenter-bot commented Sep 24, 2024

kubevirt-bot commented Sep 24, 2024 •

edited

Loading

kubevirt-commenter-bot commented Sep 24, 2024

Increase periodicity in domainstats migration metrics #12441

Increase periodicity in domainstats migration metrics #12441

Conversation

machadovilaca commented Jul 23, 2024 • edited Loading

What this PR does

Why we need it and why it was done in this way

Special notes for your reviewer

Checklist

Release note

kubevirt-bot commented Jul 23, 2024

machadovilaca commented Jul 23, 2024

machadovilaca commented Jul 24, 2024

machadovilaca commented Jul 25, 2024

machadovilaca commented Jul 25, 2024

machadovilaca commented Jul 25, 2024

machadovilaca commented Jul 26, 2024

machadovilaca commented Jul 26, 2024

fossedihelm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machadovilaca commented Sep 17, 2024

fossedihelm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fossedihelm commented Sep 20, 2024

machadovilaca commented Sep 20, 2024 • edited Loading

fossedihelm commented Sep 20, 2024

kubevirt-commenter-bot commented Sep 20, 2024

machadovilaca commented Sep 21, 2024

vladikr commented Sep 23, 2024

kubevirt-commenter-bot commented Sep 24, 2024

kubevirt-bot commented Sep 24, 2024 • edited Loading

kubevirt-commenter-bot commented Sep 24, 2024

machadovilaca commented Jul 23, 2024 •

edited

Loading

machadovilaca commented Sep 20, 2024 •

edited

Loading

kubevirt-bot commented Sep 24, 2024 •

edited

Loading