Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prevent a future sub-graph from running #6221

Closed
hjoliver opened this issue Jul 10, 2024 · 5 comments
Closed

How to prevent a future sub-graph from running #6221

hjoliver opened this issue Jul 10, 2024 · 5 comments

Comments

@hjoliver
Copy link
Member

hjoliver commented Jul 10, 2024

We recently kind-of agreed that:

Remove-like functionality is the right way to chop off a bit of graph

However the cylc remove extension proposal does not achieve that except for current active tasks:

  • for past tasks (n<0) it erases flow history, primarily to allow easy re-run in the same flow
  • for active tasks (n=0) it does chop off the downstream graph
  • for future tasks (n>0) it does nothing, and punts the problem to skip mode

[Aside: it's a pity we didn't call it cylc erase]

Skip mode is natural for "skipping over" a bunch of tasks that can easily be identified as a group (e.g. a whole cycle, or a family). But it is not so good for the fairly common use case of preventing an arbitrary future side-graph from running downstream of a particular task. By "arbitrary" I mean, in particular, that I may not have foreseen the need for this and so the entire sub-graph was not configured in advance to be in a family expressly to make use of skip mode easy.

Example, in a multi-model workflow external circumstances dictate that I no longer need to run model-x in the next cycle, and by implication its entire post-processing and product-generation side-graph. The natural way to do this is to simply force-expire next/model-x (expire means: for external reasons we no longer need to run this task).

However, this will cause a future stall if we did not have the foresight to set model-x:expire? as optional.

Here I am deliberately and knowingly chopping off future graph for a good reason, so a future stall is extremely unhelpful. I can't prevent the unwanted stall, I have to wait for it to happen before I can deal with it.


[Note this is not a contrived use case - it is exactly like clock-expire in every respect except that the external reason for expiration is not linked to the clock time - and hence potentially not easily identified as an expire use case before starting the workflow.]


Proposal

A two-step intervention that makes the potential danger clear to the user:

  1. cylc set --out=expire next/model-x
    • scheduler expires the tasks and warns next/model-x did not complete required outputs
    • without further intervention this will cause a future stall, when the flow encounters the task
  2. cylc remove --expire next/model-x
    - prompt: warning: this will cut the graph off at next/model-x, do you really want to do this?
    - has the same effect as removing the task once the future stall has occurred, so the DB must record the removal rather than simply erase the history (which would cause it to run again when the flow reaches it)
@hjoliver hjoliver added this to the 8.x milestone Jul 10, 2024
@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Jul 12, 2024
@oliver-sanders
Copy link
Member

oliver-sanders commented Jul 12, 2024

At Cylc 7 we only had explicit task "subtraction" [1]. I.E, if you want to skip a chain of tasks or a sub-system within a workflow, you had to group them explicitly e.g. by using a family. These Cylc 7 use cases can all be handled by skip-mode which can match the same behaviour.

This issue suggests a new mechanism for implicit task "subtraction" where we determine the tasks to "subtract" by traversing downstream of the selected task(s) [2]. This avoids the need to pre-group tasks for intervention purposes. Presumably for your use cases, it is not possible to pre-empt the chains you want to "subtract" ruling out explicit "subtraction"?

In a strange way, this is actually a similar problem to "reflow" in that both are, in effect performing implicit task selection by traversing downstream from selected task(s). Reflow is implicitly determining the tasks to run, whereas implicit "subtraction" is determine the tasks not to run.

Implicit "subtraction" shares a similar difficulty with implicit reflow regards downstream consequences. E.G. if we explicitly "subtract" the task r, then we will implicitly subtract the chain x => y => z as intended:

a => r => x => y => z
a => housekeep

However, if we add this inter-cycle dependency:

x[-P1] => x

Or this intra-cycle dependency:

z => housekeep

Then the workflow will subsequently stall as a result of the "subtraction". Cylc doesn't presently have a graph-traversal utility to inform the user what is downstream of a selected task, and most real-world graphs are difficult to inspect graphically, so at present it is very hard for the user to tell whether a "subtraction" operation will cause a subsequent stall or not.

So how do we avoid the potential for an unintended future stall?

We have talked about termination mechanisms to restrict the scope of reflow, these could also apply here, however, these methods will rely on grouping which defeats the object of implicit task selection.

[1] Note: Using "subtract" in place of "remove" to avoid confusion with cylc remove. By "subtract" I mean subtracting a task from the graph.
[2] Note: The implementation might not actually traverse the graph, but this is, in effect exactly what it is doing.

@hjoliver
Copy link
Member Author

hjoliver commented Jul 16, 2024

This issue suggests a new mechanism for implicit task "subtraction" where we determine the tasks to "subtract" by traversing downstream of the selected task(s) [2]. This avoids the need to pre-group tasks for intervention purposes.

Yes, but I don't entirely agree with the characterization of this as "implicit", although I understand what you mean.

a => b => c => d & e  # with a currently running

If I explicitly expire c, and I'm not a complete idiot (which is arguable, to be fair) then I have explicitly told the scheduler not to run c and by obvious implication anything downstream of c.

That this should cause a stall is predicated on the following:

  • the intervention may have been a mistake
  • and (1) (if it was a mistake) silently omitting a bit of the graph is bad

But I'm countering this with the following:

  • the intervention likely was not a mistake
  • and (2) (if it was not a mistake) causing an unwanted future stall is bad

In fact (1) and (2) could be equally bad - they could both radically delay throughput at a time when I'm not around to fix the problem - by (1) restarting and triggering after a premature shutdown; or (2) removing the expired task causing the unwanted stall.

So I'm just saying our response to an intervention - especially when the consequences may occur later in time - should not be entirely based on assuming the user might have been wrong to do it in the first place.

Presumably for your use cases, it is not possible to pre-empt the chains you want to "subtract" ruling out explicit "subtraction"?

I presume by pre-empt you mean pre-emptively configuring a family that covers the whole sub-graph? Well I suppose it's always possible, but that isn't really good enough. The point is, I might not have done so, for whatever reason. Task families are primarily for inheritance of runtime settings, so it's perfectly reasonable not to think ahead and create additional families just in case they may be needed for particular interventions at run time. During development in particular, it's useful to be able to run or block completely arbitrary sub-graphs.

@hjoliver
Copy link
Member Author

So how do we avoid the potential for an unintended future stall?

Unfortunately we can't know if the user "understands" what they are doing when they perform a potentially dangerous intervention. But despite that, such interventions are sometimes necessary.

So I think the best we can do is to warn the user and require explicit opt-in after the warning is issued.

(And by the way, even premature shutdown is not that hard to recover from in Cylc 8 - it's certainly not on the same scale as mistaken use of rm -rf for example).

@hjoliver
Copy link
Member Author

hjoliver commented Jul 16, 2024

Then the workflow will subsequently stall as a result of the "subtraction". Cylc doesn't presently have a graph-traversal utility to inform the user what is downstream of a selected task, and most real-world graphs are difficult to inspect graphically, so at present it is very hard for the user to tell whether a "subtraction" operation will cause a subsequent stall or not.

You raise a good point there ... posting a new issue: #6237

@hjoliver
Copy link
Member Author

I'm closing this, not because it's invalid, but because the way I put the problem unfortunately created the impression of a competition between "branch cutting" vs "skipping" as ways to prevent a future sub-graph from running. The important thing is really that we need the ability to head off future stalls (due to future final-incomplete tasks) without having to wait unnecessarily for the inevitable stall to actually happen. With that ability, it enables branch cutting as a option where appropriate - but without circumventing our output completion safeguards. I'll put up a new issue for that.

@hjoliver hjoliver added superseded and removed question Flag this as a question for the next Cylc project meeting. labels Sep 24, 2024
@hjoliver hjoliver removed this from the 8.x milestone Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants