-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase periodicity in domainstats migration metrics #12441
Conversation
Skipping CI for Draft Pull Request. |
/test all |
/retest |
f1b4be0
to
a14f0b3
Compare
dd66a66
to
1daba6d
Compare
/test all |
@enp0s3 since we need to get information on when VMIM is started and finished, virt-handler now needs to have access to I tried this implementation in virt-controller first, to avoid adding this overhead and permissions, but virt-controller, unlike virt-handler, is no capable to find and connect to the sockets to fetch the info from libvirt Can you please take a look? |
/retest |
1 similar comment
/retest |
1daba6d
to
ffed11f
Compare
pkg/monitoring/metrics/virt-handler/migrationdomainstats/migrationstats_collector.go
Show resolved
Hide resolved
pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @machadovilaca!
I think we are close :)
@@ -28,6 28,7 @@ import ( | |||
k6tv1 "kubevirt.io/api/core/v1" | |||
|
|||
"kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/stats" | |||
"kubevirt.io/kubevirt/tests/libmonitoring" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we want the tests
pkg to be imported in the production one.
I remember we wanted to avoid such things. @vladikr thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So @machadovilaca I think we should remove the second commit, or if you prefer you can create a new /testing
pkg under pkg/monitoring/metrics/virt-handler
and move the metricMatcher there.
Consider that this pattern is already used elsewhere, i.e. https://github.com/kubevirt/kubevirt/tree/main/pkg/virt-controller/watch/testing
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i created under pkg/monitoring/metrics but named 'tests', some errors in e2e though
will rename too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed the last commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go
Outdated
Show resolved
Hide resolved
pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go
Outdated
Show resolved
Hide resolved
pkg/monitoring/metrics/virt-handler/migrationdomainstats/queue.go
Outdated
Show resolved
Hide resolved
pkg/monitoring/metrics/virt-handler/migrationdomainstats/migrationstats_collector.go
Show resolved
Hide resolved
8f40ace
to
f20851e
Compare
Signed-off-by: machadovilaca <[email protected]>
Signed-off-by: machadovilaca <[email protected]>
f20851e
to
be7121b
Compare
pull-kubevirt-e2e-k8s-1.29-sig-monitoring ✔️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@machadovilaca Thank you! One final doubt from my side
values, err := q.scrapeDomainStats() | ||
if err != nil { | ||
log.Log.Reason(err).Errorf("failed to scrape domain stats for VMI %s/%s", q.vmi.Namespace, q.vmi.Name) | ||
return | ||
} | ||
|
||
r := result{ | ||
vmi: q.vmi.Name, | ||
namespace: q.vmi.Namespace, | ||
|
||
domainJobInfo: *values.MigrateDomainJobInfo, | ||
timestamp: time.Now(), | ||
} | ||
|
||
q.mutex.Lock() | ||
defer q.mutex.Unlock() | ||
q.results.Value = r | ||
q.results = q.results.Next() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should lock before the q.scrapeDomainStats()
.
Let me explain:
- a
metric.Collect()
is executed while we are scraping the domain, it will lock the queue because it is reading results throughq.all()
. - the migration finishes at that moment, and the key is deleted from the handler
- the queue lock is released so the last results from
queue.collect()
are written - these results are never consumed
Maybe it's an edge case, since in the next ticker the queue.collect()
will notice that the migration is finished and will cancel the ticker, but those results are never picked.
Are we fine with that?
Another (maybe worst) case is if an error occurs while we are reading the informer during point 2, which will cause the vmi to be dropped from the vmiStats loop, while the migration is still in progress.
Is this something that was considered and it is an "acceptable" risk?
Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
for the first point, I think it's fine to lose that last bit of data, in terms of user observability about migration, not having that data right before the migration finishes won't affect the outcome significantly (and locking before the scrape can lead to much bigger locking times)
-
in terms of removing the vmi, should still be fine, the previous data is still collected, and in the next vmi status update, if the vmi migration is still progressing it will be readded
so IMO we should be fine as we are
pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go
Outdated
Show resolved
Hide resolved
pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go
Outdated
Show resolved
Hide resolved
vmiStore cache.Store | ||
vmiStats map[string]*queue | ||
|
||
mutex *sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be sync.Mutex
, or rather even sync.RWMutex
there can be multiple readers in this case, i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
in our use case we expect to have just 1 reader
-
Collect()
can't be a read-lock as it also changes the data struct to delete the key from a finished migration. I thought about read-locking when creating the output result and then a write-lock to remove the key, but this wouldn't work well because theisActive
value can become stale between the reading and the write-lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: same for the queue
struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated for queue too
pkg/monitoring/metrics/virt-handler/migrationdomainstats/handler.go
Outdated
Show resolved
Hide resolved
pkg/monitoring/metrics/virt-handler/migrationdomainstats/queue.go
Outdated
Show resolved
Hide resolved
c716d4e
to
76f01e5
Compare
@machadovilaca migration failure is relevant |
76f01e5
to
6cc9920
Compare
These migration metrics are now timestamped in the output updated to handle both cases |
Before this commit, migration data was reported with the domainstats collector every 30 seconds. The data includes the migrated and remaining memory bytes, the memory transfer rate and the dirty rate. Such a long periodicity for an ephemeral job reduces visibility about the it's progress. Users are were not capable of following the data migration in "real time", and, for most VMs, the 30 seconds periodicity, caused the data to only be reported once during the job, and sometimes only after it ended. To increase the periodicity, we moved the migration stats to a separate collector. This collector has an event handler to track changes in the VMI status, to start a job to gather migration information when a new migration begins. This runs every 5 seconds (configurable). Signed-off-by: João Vilaça <[email protected]>
Signed-off-by: João Vilaça <[email protected]>
6cc9920
to
6a2df53
Compare
Thank you @machadovilaca! |
Required labels detected, running phase 2 presubmits: |
/retest |
/unhold @machadovilaca Thanks! |
/retest-required |
@machadovilaca: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/retest-required |
What this PR does
Before this PR:
Currently migration data is reported with the domainstats collector every 30 seconds.
This data includes the migrated and remaining memory bytes, the memory transfer rate
and the dirty rate.
Such a long periodicity for an ephemeral job reduces visibility about the it's progress.
Users are not capable of following the data migration in "real time", and, for most VMs,
the 30 seconds periodicity, causes the data to only be reported once during the job,
and sometimes only after it ended.
After this PR:
To increase the periodicity, we moved the migration stats to a separate collector.
This collector has an event handler to track changes in the VMI status, to start
a job to gather migration information when a new migration begins. This runs
every 5 seconds (configurable).
jira-ticket: https://issues.redhat.com/browse/CNV-44897
Fixes #
Why we need it and why it was done in this way
The following tradeoffs were made:
The following alternatives were considered:
Links to places where the discussion took place:
Special notes for your reviewer
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.
Release note