Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase periodicity in domainstats migration metrics #12441

Merged
merged 4 commits into from
Sep 24, 2024

Conversation

machadovilaca
Copy link
Member

@machadovilaca machadovilaca commented Jul 23, 2024

What this PR does

Before this PR:

Currently migration data is reported with the domainstats collector every 30 seconds.
This data includes the migrated and remaining memory bytes, the memory transfer rate
and the dirty rate.

Such a long periodicity for an ephemeral job reduces visibility about the it's progress.
Users are not capable of following the data migration in "real time", and, for most VMs,
the 30 seconds periodicity, causes the data to only be reported once during the job,
and sometimes only after it ended.

After this PR:

To increase the periodicity, we moved the migration stats to a separate collector.
This collector has an event handler to track changes in the VMI status, to start
a job to gather migration information when a new migration begins. This runs
every 5 seconds (configurable).

jira-ticket: https://issues.redhat.com/browse/CNV-44897

Fixes #

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note

Increase periodicity in domainstats migration metrics

@kubevirt-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jul 23, 2024
@machadovilaca
Copy link
Member Author

/test all

@kubevirt-bot kubevirt-bot added size/XL sig/buildsystem Denotes an issue or PR that relates to changes in the build system. labels Jul 23, 2024
@kubevirt-bot kubevirt-bot added the sig/observability Denotes an issue or PR that relates to observability. label Jul 23, 2024
@machadovilaca
Copy link
Member Author

/retest

@machadovilaca machadovilaca force-pushed the migration-metrics branch 3 times, most recently from f1b4be0 to a14f0b3 Compare July 25, 2024 14:40
@machadovilaca machadovilaca force-pushed the migration-metrics branch 2 times, most recently from dd66a66 to 1daba6d Compare July 25, 2024 14:46
@machadovilaca
Copy link
Member Author

/test all

@machadovilaca machadovilaca marked this pull request as ready for review July 25, 2024 15:00
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2024
@machadovilaca
Copy link
Member Author

/cc @sradco @assafad @avlitman

@kubevirt-bot kubevirt-bot requested a review from sradco July 25, 2024 15:01
@machadovilaca
Copy link
Member Author

@enp0s3 since we need to get information on when VMIM is started and finished, virt-handler now needs to have access to virtualmachineinstancemigrations resource, so I added get, list and watch permissions

I tried this implementation in virt-controller first, to avoid adding this overhead and permissions, but virt-controller, unlike virt-handler, is no capable to find and connect to the sockets to fetch the info from libvirt

Can you please take a look?

@machadovilaca
Copy link
Member Author

/retest

1 similar comment
@machadovilaca
Copy link
Member Author

/retest

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 6, 2024
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 6, 2024
Copy link
Contributor

@fossedihelm fossedihelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @machadovilaca!
I think we are close :)

@@ -28,6 28,7 @@ import (
k6tv1 "kubevirt.io/api/core/v1"

"kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/stats"
"kubevirt.io/kubevirt/tests/libmonitoring"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we want the tests pkg to be imported in the production one.
I remember we wanted to avoid such things. @vladikr thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @machadovilaca I think we should remove the second commit, or if you prefer you can create a new /testing pkg under pkg/monitoring/metrics/virt-handler and move the metricMatcher there.
Consider that this pattern is already used elsewhere, i.e. https://github.com/kubevirt/kubevirt/tree/main/pkg/virt-controller/watch/testing
Thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i created under pkg/monitoring/metrics but named 'tests', some errors in e2e though
will rename too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the last commit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

pkg/monitoring/metrics/virt-handler/metrics.go Outdated Show resolved Hide resolved
@machadovilaca machadovilaca force-pushed the migration-metrics branch 2 times, most recently from 8f40ace to f20851e Compare September 17, 2024 10:27
@machadovilaca
Copy link
Member Author

pull-kubevirt-e2e-k8s-1.29-sig-monitoring ✔️

Copy link
Contributor

@fossedihelm fossedihelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machadovilaca Thank you! One final doubt from my side

Comment on lines 106 to 118
values, err := q.scrapeDomainStats()
if err != nil {
log.Log.Reason(err).Errorf("failed to scrape domain stats for VMI %s/%s", q.vmi.Namespace, q.vmi.Name)
return
}

r := result{
vmi: q.vmi.Name,
namespace: q.vmi.Namespace,

domainJobInfo: *values.MigrateDomainJobInfo,
timestamp: time.Now(),
}

q.mutex.Lock()
defer q.mutex.Unlock()
q.results.Value = r
q.results = q.results.Next()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should lock before the q.scrapeDomainStats().
Let me explain:

  1. a metric.Collect() is executed while we are scraping the domain, it will lock the queue because it is reading results through q.all().
  2. the migration finishes at that moment, and the key is deleted from the handler
  3. the queue lock is released so the last results from queue.collect() are written
  4. these results are never consumed

Maybe it's an edge case, since in the next ticker the queue.collect() will notice that the migration is finished and will cancel the ticker, but those results are never picked.
Are we fine with that?
Another (maybe worst) case is if an error occurs while we are reading the informer during point 2, which will cause the vmi to be dropped from the vmiStats loop, while the migration is still in progress.
Is this something that was considered and it is an "acceptable" risk?
Thank you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • for the first point, I think it's fine to lose that last bit of data, in terms of user observability about migration, not having that data right before the migration finishes won't affect the outcome significantly (and locking before the scrape can lead to much bigger locking times)

  • in terms of removing the vmi, should still be fine, the previous data is still collected, and in the next vmi status update, if the vmi migration is still progressing it will be readded

so IMO we should be fine as we are

vmiStore cache.Store
vmiStats map[string]*queue

mutex *sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be sync.Mutex, or rather even sync.RWMutex
there can be multiple readers in this case, i think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • in our use case we expect to have just 1 reader

  • Collect() can't be a read-lock as it also changes the data struct to delete the key from a finished migration. I thought about read-locking when creating the output result and then a write-lock to remove the key, but this wouldn't work well because the isActive value can become stale between the reading and the write-lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: same for the queue struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated for queue too

@machadovilaca machadovilaca force-pushed the migration-metrics branch 2 times, most recently from c716d4e to 76f01e5 Compare September 19, 2024 12:47
@fossedihelm
Copy link
Contributor

@machadovilaca migration failure is relevant

@machadovilaca
Copy link
Member Author

machadovilaca commented Sep 20, 2024

@machadovilaca migration failure is relevant

These migration metrics are now timestamped in the output
so they appear in the <metric_name>{<labels...>} <timestamp> <value> format
which the libinfra package was not expecting

updated to handle both cases

Before this commit, migration data was reported with the domainstats collector
every 30 seconds. The data includes the migrated and remaining memory bytes,
the memory transfer rate and the dirty rate.

Such a long periodicity for an ephemeral job reduces visibility about the it's progress.
Users are were not capable of following the data migration in "real time", and, for most
VMs, the 30 seconds periodicity, caused the data to only be reported once during the job,
and sometimes only after it ended.

To increase the periodicity, we moved the migration stats to a separate collector.
This collector has an event handler to track changes in the VMI status, to start
a job to gather migration information when a new migration begins. This runs
every 5 seconds (configurable).

Signed-off-by: João Vilaça <[email protected]>
@fossedihelm
Copy link
Contributor

Thank you @machadovilaca!
/lgtm
@vladikr I leave it to you to unhold if everything is ok, thanks

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 20, 2024
@kubevirt-commenter-bot
Copy link

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-windows2016
/test pull-kubevirt-e2e-kind-1.27-vgpu
/test pull-kubevirt-e2e-kind-sriov
/test pull-kubevirt-e2e-k8s-1.30-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.29-sig-network
/test pull-kubevirt-e2e-k8s-1.29-sig-storage
/test pull-kubevirt-e2e-k8s-1.29-sig-compute
/test pull-kubevirt-e2e-k8s-1.29-sig-operator
/test pull-kubevirt-e2e-k8s-1.30-sig-network
/test pull-kubevirt-e2e-k8s-1.30-sig-storage
/test pull-kubevirt-e2e-k8s-1.30-sig-compute
/test pull-kubevirt-e2e-k8s-1.30-sig-operator

@machadovilaca
Copy link
Member Author

/retest

@vladikr
Copy link
Member

vladikr commented Sep 23, 2024

/unhold

@machadovilaca Thanks!

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2024
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Sep 24, 2024

@machadovilaca: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations ce37ad4 link true /test pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations
pull-kubevirt-e2e-arm64 8f40ace link false /test pull-kubevirt-e2e-arm64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot kubevirt-bot merged commit 4da8b87 into kubevirt:main Sep 24, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/observability Denotes an issue or PR that relates to observability. sig/scale size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants