Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failpoint for nospace on puts #16018

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Jun 6, 2023

This CR introduces a new failput that will trigger a member to report no space.


This might need some more legwork, it's the first time I'm adding a failpoint here.

member := clus.Procs[rand.Int()%len(clus.Procs)]
for member.IsRunning() {
lg.Info("Setting up gofailpoint", zap.String("failpoint", f.Name()))
err := member.Failpoints().Setup(ctx, f.Name(), "return")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over time, this might render the cluster entirely unusable - what's the best way to turn this off for a given member? After x-minutes? Only if there's still quorum?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness tests iteration (create cluster, run traffic, inject failpoint and delete cluster) takes usually 5s, max a minute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems I was out of the loop for too long, are the longer nightly linearizability tests not a thing anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are a thing, but we run 100x 5s iteration

if err != nil {
panic(err)
}
if v.LessThan(version.V3_6) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this sensible? I believe this should be easy to backport to 3.5 and 3.4 however

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not be needed if you used code from goPanicFailpoint that checks list of failpoints exposed by gofail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added and reused, thanks

lg.Info("Setting up gofailpoint", zap.String("failpoint", f.Name()))
err := member.Failpoints().Setup(ctx, f.Name(), "return")
if err != nil {
lg.Info("goFailpoint setup failed", zap.String("failpoint", f.Name()), zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to assert that failpoint was executed at least once. Please follow #14729 on how to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, it's much easier than sleep testing

@serathius
Copy link
Member

Please give me more context on why you want to introduce this failpoint. I want to make sure that we not only code it, but also address the overall issues with no space alarms.

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 7, 2023

Please give me more context on why you want to introduce this failpoint. I want to make sure that we not only code it, but also address the overall issues with no space alarms.

I was briefly seeing a panic after playing around the authbackend implementation paths, which led me to run the linearization test for longer while dropping out a node at random with this alarm. I'll add a more specific test once I've figured out how to properly repro this...

(setting to draft in the meantime)

@tjungblu tjungblu marked this pull request as draft June 7, 2023 12:38
tjungblu added a commit to tjungblu/etcd that referenced this pull request Jun 14, 2023
Adding a new flag to retain e2e etcd process logs after stop and
saving next to the visualized model.

Spun out of etcd-io#16018 where I used it for easier local debugging on model
violations.

Signed-off-by: Thomas Jungblut <[email protected]>
tjungblu added a commit to tjungblu/etcd that referenced this pull request Jun 14, 2023
Adding a new flag to retain e2e etcd process logs after stop and
saving next to the visualized model.

Spun out of etcd-io#16018 where I used it for easier local debugging on model
violations.

Signed-off-by: Thomas Jungblut <[email protected]>
tjungblu added a commit to tjungblu/etcd that referenced this pull request Jun 14, 2023
Adding a new flag to retain e2e etcd process logs after stop and
saving next to the visualized model.

Spun out of etcd-io#16018 where I used it for easier local debugging on model
violations. Fixes etcd-io#15079 partially.

Signed-off-by: Thomas Jungblut <[email protected]>
tjungblu added a commit to tjungblu/etcd that referenced this pull request Jun 14, 2023
Adding a new flag to retain e2e etcd process logs after stop and
saving next to the visualized model.

Spun out of etcd-io#16018 where I used it for easier local debugging on model
violations. Fixes etcd-io#15079 partially.

Signed-off-by: Thomas Jungblut <[email protected]>
tjungblu added a commit to tjungblu/etcd that referenced this pull request Jun 15, 2023
Adding a set of functions which retain e2e etcd process logs after stop and
saving next to the visualized model during robustness tests.

Spun out of etcd-io#16018 where I used it for easier local debugging on model
violations. Fixes etcd-io#15079 partially.

Signed-off-by: Thomas Jungblut <[email protected]>
@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@jmhbnz
Copy link
Member

jmhbnz commented Jan 18, 2024

Discussed during sig-etcd triage meeting. @tjungblu do you have capacity to resolve conflicts and finish this off?

@tjungblu
Copy link
Contributor Author

Sure, do you guys want to keep the failpoint? I can remove the remainder.

@tjungblu
Copy link
Contributor Author

rebased, updated and removed the remainder of unrelated changes

@tjungblu tjungblu marked this pull request as ready for review January 19, 2024 14:07
This CR introduces a new failput that will trigger a member to report
no space.

Signed-off-by: Thomas Jungblut <[email protected]>
@k8s-ci-robot
Copy link

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-unit-test-amd64 2213f06 link true /test pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64 2213f06 link true /test pull-etcd-unit-test-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants