Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(resource metrics): throttle resource metrics handling error message, ensure metrics when restarting or resetting metrics #14226

Open
wants to merge 4 commits into
base: release-58
Choose a base branch
from

Conversation

thalesmg
Copy link
Contributor

@thalesmg thalesmg commented Nov 14, 2024

Fixes https://emqx.atlassian.net/browse/EMQX-13496

Release version: v/e5.8.3

Summary

If for whatever reason handling a "hot" metric such as matched breaks, it's likely that logs will be flooded with identical messages.

For still unknown reasons (e.g.: emqx_metrics_worker process might die?), metrics
might be lost for a running resource, and future attempts to bump them result in
errors. As mitigation, we ensure such metrics are created here so that restarting
the resource or resetting its metrics can recreate them.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Added property-based tests for code which performs user input validation
  • Changed lines covered in coverage report
  • Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

If for whatever reason handling a "hot" metric such as `matched` breaks, it's likely that
logs will be flooded with identical messages.
…rce or resetting their metrics

Fixes https://emqx.atlassian.net/browse/EMQX-13496

The reasons for the original issue that led to `emqx_metrics_worker` losing all metrics
are still unknown.  Here, we introduce some mitigations to avoid more drastic measures
such as restarting the node.
@thalesmg thalesmg force-pushed the 20241114-r58-resource-missing-metrics branch from 11aeb4d to e0ea863 Compare November 14, 2024 15:12
@thalesmg thalesmg changed the title fix(resource metrics): throttle resource metrics handling error message fix(resource metrics): throttle resource metrics handling error message, ensure metrics when restarting or resetting metrics Nov 14, 2024
@thalesmg thalesmg marked this pull request as ready for review November 14, 2024 17:07
@thalesmg thalesmg requested a review from a team as a code owner November 14, 2024 17:07
@zmstone
Copy link
Member

zmstone commented Nov 14, 2024

Maybe it's action leaked in cache ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants