-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Rework: More Aggressive Culling Of FPGA Resources, Slack/PR Notifications #1316
Conversation
Started up the workflow monitor for CI run 3625480853 |
Something went wrong in the workflow monitor for CI run 3625480853. Verify CI instances are terminated properly. Must be checked before submitting the PR. |
7ac2d51
to
afb44e5
Compare
570e3a6
to
a337f1e
Compare
While not working 100%. This PR is now ready for the 1st round of reviews. |
8ec629e
to
113120a
Compare
cc4ccc4
to
39dc33d
Compare
This may be adjacent to this PR in particular, but if the aim is to reduce CI costs, you can also try a more aggressive filtering approach right? For example, I think manager/python library changes will run the full gamut of scala tests and chipyard tests, but is that really necessary? |
I agree. However, I'm treating that as future work |
2bbdac2
to
273f6b2
Compare
|
Workflow monitor started for CI run: 3633857500 |
|
Something went wrong in the workflow monitor for CI run 3655073569. Verify CI instances are terminated properly. Must be checked before submitting the PR. Exception Message:
Traceback Message:
|
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
17 similar comments
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR. |
Something went wrong in the workflow monitor for CI run 3661037859. Verify CI instances are terminated properly. Must be checked before submitting the PR. Exception Message:
Traceback Message:
|
796207a
to
46880ce
Compare
@t14916 This should fully work now. Sorry for the delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR attempts to fix/alleviate some CI issues that have caused resource leakages (of both manager and FPGA instances).
Existing issues found:
It does the following:
workflow_monitor.py
andsetup_workflow_monitor.py
ci_variables.py
. Thus, instead of setting anenv.sh
file that holds needed env. variables, a newci_env
dict is created (see theci_variables.py
changes) and set for the correspondingrun()
calls.check_and_terminate_select_instances
that has a 45m timeout on FPGA instances (seeplatform_lib.py
for more info).try/catch
to check if failures happen (at least it should hopefully send the PR comment if something fails).ci_variables.py
- A newci_env
dict is populated with all env. variables used throughout CI. Instead of using python variable names for environment variables (which don't have the same naming convention) use the env. variable name for the key in the dict. All code locations are changed to account for this. This dict is type checked so that all keys are considered present. Future CI type checking will happen in a followup PR.firesim-cleanup.yml
- This workflow now sends a Slack message if something goes wrong. This is tied to thecull-instances
script now returning an error if it culls something it expected not to cull.cull-old-ci-insts.py
- This now explicitly checks for FPGA instances that are live for over 1hr. Also now exits non-zero if something was terminated unexpectedly (triggers a slack message).{github_}common.py
- Moved GitHub specific items to a new filegithub_common.py
to avoid a circular dependency with theplatform_lib
file. A new issues URL and issue function is added for PR comment posts.platform_lib.py
:find_select_ci_instances
func. for each platform to find "select" CI instances (i.e. FPGA instances) and return them.check_and_terminate_select_instances
func. to find "select" instances and terminate them. If instances are terminated then it sends a PR comment.There are now multiple checks for FPGA instances (for extra extra extra security):
Related PRs / Issues
N/A
UI / API Impact
N/A
Verilog / AGFI Compatibility
N/A
Contributor Checklist
changelog:<topic>
label?ci:fpga-deploy
label?Please Backport
label?Reviewer Checklist (only modified by reviewer)
Note: to run CI on PRs from forks, comment
@Mergifyio copy main
and manage the change from the new PR.changelog:<topic>
label?