test: cleanup killing of microVMs during teardown #4660

roypat · 2024-07-01T13:47:24Z

Changes

Sort out some spurious warnings about "No such process" when killing Firecracker during test teardown. This is done in two steps:

Turn the warning into a hard error (e.g. test failure)
Analyze each failure, and mark the ones that are expected (for example because we expect the Firecracker process to exit during test execution, and thus cannot kill it again during teardown).

Reason

This allows us to catch issues such as #4662

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this
PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet
contribution quality standards.

This functionality cannot be added in rust-vmm.

codecov · 2024-07-01T13:59:21Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.11%. Comparing base (e439ef8) to head (853ab6a).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4660    /-   ##
=======================================
  Coverage   82.11%   82.11%           
=======================================
  Files         255      255           
  Lines       31261    31261           
=======================================
  Hits        25670    25670           
  Misses       5591     5591

Flag	Coverage Δ
4.14-c5n.metal	`79.60% <ø> (-0.01%)`	⬇️
4.14-m5n.metal	`79.59% <ø> (-0.01%)`	⬇️
4.14-m6a.metal	`78.96% <ø> ( 0.13%)`	⬆️
4.14-m6g.metal	`76.63% <ø> (ø)`
4.14-m6i.metal	`?`
4.14-m7g.metal	`76.63% <ø> (ø)`
5.10-c5n.metal	`82.12% <ø> (ø)`
5.10-m5n.metal	`82.11% <ø> ( <0.01%)`	⬆️
5.10-m6a.metal	`81.54% <ø> ( 0.11%)`	⬆️
5.10-m6g.metal	`79.41% <ø> (ø)`
5.10-m6i.metal	`82.10% <ø> (ø)`
5.10-m7g.metal	`79.41% <ø> (ø)`
6.1-c5n.metal	`82.12% <ø> (-0.01%)`	⬇️
6.1-m5n.metal	`82.11% <ø> (ø)`
6.1-m6a.metal	`81.55% <ø> ( 0.13%)`	⬆️
6.1-m6g.metal	`79.40% <ø> (-0.01%)`	⬇️
6.1-m6i.metal	`82.10% <ø> (ø)`
6.1-m7g.metal	`79.41% <ø> ( <0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

If no Firecracker process with `self.firecracker_pid` is found, this can have two reasons: - Firecracker died unexpectedly. In this case, we want to log a test failure - Firecracker exited expectedly (say, in a negative test for CLI arguments). In this case, we want to be explicit about that in the test (by setting `self._killed` to `True`, or similar). Thus, just silently logging an error message in this scenario is undesirable, as it hides genuine errors. Instead, log the firecracker process log_data and re-raise the exception. Follow up commits will fix up the second case above. Signed-off-by: Patrick Roy <[email protected]>

tests/framework/microvm.py

There's library methods for killing processes in python, so let's just use those instead of shelling out. Also move the `os.kill` call into the try-except block, to get logging in case Firecracker unexpected died before we could kill it. The only change in behavior is that we no longer silently ignore the case where we fail to kill Firecracker because it is already dead. Signed-off-by: Patrick Roy <[email protected]>

The comment about "it being too late to assert" was outdated, as it only applied back when `kill` was called inside of destructors, and before we configured pytest to stop ignoring exceptions in destructors. Nowadays, we call `kill` inside of test teardown, were it is very possible indeed to raise exceptions. So do that. Signed-off-by: Patrick Roy <[email protected]>

- utils.check_output already asserts `rc == 0` - `output.find("firecracker") == -1` is an unidiomatic version of "firecracker" not in output. Signed-off-by: Patrick Roy <[email protected]>

The open-coded subprocess.Popen sequence in Microvm.spawn was equivalent to `utils.check_output(..., shell=False)`, so just use that. Signed-off-by: Patrick Roy <[email protected]>

Call this method on microvms that have met an untimely demise during test execution (for example, because we are specifically testing that firecracker exits correctly after receiving a signal). The function raises an exception if the Firecracker process is not actually dead, to avoid leaking processes. Signed-off-by: Patrick Roy <[email protected]>

tests/integration_tests/functional/test_api.py

tests/integration_tests/functional/test_error_code.py

This is so that we do not try to kill them again in `Microvm.kill`, which would result in a test failure. Also remove various variants of "wait for Firecracker process to go away", as `mark_killed` has a `wait_for_termination` inside of it. Signed-off-by: Patrick Roy <[email protected]>

Use a library function instead of calling out to a shell. Signed-off-by: Patrick Roy <[email protected]>

In `test_signals.py::test_sigxfsz_handler`, we seem to need to wait on average 6 seconds. Prior to unifying all the "wait for FC to die" tests, this one just had an endless loop, and the retry parameters passed to `wait_process_termination` which it is using now instead do not provide sufficient waiting. So simply increase the waiting in `wait_process_termination` to 10s. Signed-off-by: Patrick Roy <[email protected]>

Document how we use `depends_on` to make sure that a failure of the Style step does not prevent all the other integration tests from running. Signed-off-by: Patrick Roy <[email protected]>

All tests that formerly set this property now use `mark_killed`. The only other location that set this value was `kill()` itself, to avoid errors when kill() ends up being called twice. However, this is already managed by the `Microvm._killed` attribute, so this guarding is not necessary. Thus, remove the `expect_kill_by_signal`, as it is useless. Signed-off-by: Patrick Roy <[email protected]>

A PID is an integer, so return it as one. Having it be a string means it cannot be directly passed to things such as `os.kill`. Signed-off-by: Patrick Roy <[email protected]>

Explicitly call out that the whole "grep for jailer_id" dance is a regression test. Signed-off-by: Patrick Roy <[email protected]>

Use os.kill in test_serial_io instead of using a shell to invoke `kill`. Signed-off-by: Patrick Roy <[email protected]>

roypat force-pushed the kill-cleanup branch 7 times, most recently from cbcff7e to a0829de Compare July 2, 2024 14:31

roypat mentioned this pull request Jul 2, 2024

fix(test): Stop guest from panicking in boottime tests on AMD #4662

Merged

9 tasks

roypat force-pushed the kill-cleanup branch from a0829de to 33ccd4d Compare July 2, 2024 15:03

roypat force-pushed the kill-cleanup branch from 33ccd4d to 8f6ed1f Compare July 2, 2024 15:26

roypat self-assigned this Jul 2, 2024

roypat marked this pull request as ready for review July 2, 2024 15:47

roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Jul 2, 2024

roypat requested review from zulinx86 and pb8o July 2, 2024 15:50

pb8o reviewed Jul 2, 2024

View reviewed changes

tests/framework/microvm.py Show resolved Hide resolved

tests/framework/microvm.py Outdated Show resolved Hide resolved

tests/framework/microvm.py Outdated Show resolved Hide resolved

tests/framework/microvm.py Show resolved Hide resolved

roypat added 5 commits July 3, 2024 12:47

test: simplify assertion that ensures Firecracker got killed

dc0c123

- utils.check_output already asserts `rc == 0` - `output.find("firecracker") == -1` is an unidiomatic version of "firecracker" not in output. Signed-off-by: Patrick Roy <[email protected]>

test: use utils.check_output for daemonization

da65f48

The open-coded subprocess.Popen sequence in Microvm.spawn was equivalent to `utils.check_output(..., shell=False)`, so just use that. Signed-off-by: Patrick Roy <[email protected]>

roypat force-pushed the kill-cleanup branch 3 times, most recently from cff853b to 799b4d8 Compare July 3, 2024 12:56

zulinx86 previously approved these changes Jul 4, 2024

View reviewed changes

tests/integration_tests/functional/test_api.py Show resolved Hide resolved

tests/integration_tests/functional/test_error_code.py Outdated Show resolved Hide resolved

roypat added 3 commits July 4, 2024 16:13

refactor(test) simplify utils.wait_process_termination

30b6e2d

Use a library function instead of calling out to a shell. Signed-off-by: Patrick Roy <[email protected]>

roypat added 5 commits July 4, 2024 16:14

docs(ci): document usage of depends_on in pipeline generation

bea7f1c

Document how we use `depends_on` to make sure that a failure of the Style step does not prevent all the other integration tests from running. Signed-off-by: Patrick Roy <[email protected]>

refactor(test): Have Microvm.screen_pid return int instead of str

c35e97d

A PID is an integer, so return it as one. Having it be a string means it cannot be directly passed to things such as `os.kill`. Signed-off-by: Patrick Roy <[email protected]>

refactor(test): Clarify logic in Microvm.kill()

3543d9e

Explicitly call out that the whole "grep for jailer_id" dance is a regression test. Signed-off-by: Patrick Roy <[email protected]>

refactor(test): prefer os.kill over shelling out

551c0bc

Use os.kill in test_serial_io instead of using a shell to invoke `kill`. Signed-off-by: Patrick Roy <[email protected]>

roypat dismissed zulinx86’s stale review via 551c0bc July 4, 2024 15:14

roypat force-pushed the kill-cleanup branch from f563030 to 551c0bc Compare July 4, 2024 15:14

Merge branch 'main' into kill-cleanup

a87774b

roypat requested review from zulinx86 and pb8o July 4, 2024 15:22

zulinx86 approved these changes Jul 4, 2024

View reviewed changes

pb8o approved these changes Jul 8, 2024

View reviewed changes

Merge branch 'main' into kill-cleanup

853ab6a

roypat merged commit 4100110 into firecracker-microvm:main Jul 8, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: cleanup killing of microVMs during teardown #4660

test: cleanup killing of microVMs during teardown #4660

roypat commented Jul 1, 2024 •

edited

Loading

codecov bot commented Jul 1, 2024 •

edited

Loading

test: cleanup killing of microVMs during teardown #4660

test: cleanup killing of microVMs during teardown #4660

Conversation

roypat commented Jul 1, 2024 • edited Loading

Changes

Reason

License Acceptance

PR Checklist

codecov bot commented Jul 1, 2024 • edited Loading

Codecov Report

roypat commented Jul 1, 2024 •

edited

Loading

codecov bot commented Jul 1, 2024 •

edited

Loading