Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool: fix infinite loop with --start-cycle-point #5604

Merged
merged 2 commits into from
Jun 29, 2023

Conversation

oliver-sanders
Copy link
Member

  • Closes pool: infinite loop on startup #5603
  • Fixes an infinite loop which could occur where one or more task sequences terminate before the start point.
  • This triggered a bug whereby the runahead limit was disabled on startup causing an infinite spawning bug.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

* Closes cylc#5603
* Fixes an infinite loop which could occur where one or more task sequences
  terminate before the start point.
* This triggered a bug whereby the runahead limit was disabled on
  startup causing an infinite spawning bug.
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Jun 28, 2023
@oliver-sanders oliver-sanders added this to the cylc-8.2.0 milestone Jun 28, 2023
@oliver-sanders oliver-sanders self-assigned this Jun 28, 2023
@oliver-sanders oliver-sanders changed the title 5603 pool: fix infinite loop with --start-cycle-point Jun 28, 2023
Comment on lines 702 to 703
self._configure_contact()
await self.configure()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: Swap the order of these two so that the contact file gets written earlier.

This reduces the window of opportunity for two schedulers to start up for the same workflow. Taking a look at the contact file fields, I can't see anything set during Scheduler.configure which is used in _configure_contact so the order should be arbitrary.

seq.get_first_point(self.config.start_point)
for seq in self.config.sequences
}
if point is not None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was caused by a None value in this list. Due to the logic lower down, this caused the runahead limit point to be set to None.

* Write the contact file earlier to minimise the window of opportunity
  for two schedulers to start up for the same workflow.
@@ -1807,7 1807,11 @@ async def _shutdown(self, reason: BaseException) -> None:
sys.stdout.flush()
sys.stderr.flush()

if self.contact_data and self.task_job_mgr:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the previous, because contact_data is set a little earlier, this test needed bodging a little as we could get into the situation where the workflow failed to startup, but did configure its contact file triggering remote-tidy to run on shutdown which caused the database to be created when it shouldn't have been.

Test changed to only run when the database is present, which is a better test, as remote-tidy needs database info to function.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good.

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's a horrid bug!

@wxtim wxtim merged commit 8acb8e7 into cylc:master Jun 29, 2023
@oliver-sanders oliver-sanders deleted the 5603 branch June 29, 2023 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pool: infinite loop on startup
3 participants