Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scheduler): Add max elapsed duration for model load/unload #5819

Merged
merged 11 commits into from
Aug 5, 2024

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Aug 4, 2024

Previously we had a fixed elapsed duration defaulting to 15 minutes for model load retries. For loading large llms (e.g. llama3 70b), this fixed max elapsed time didnt allow for any retries consequently as the size of the model will take the duration of the load towards this limit.

This change then introduces the following:

  • Allowing to define a max elapsed duration for model load (defaults to 2 hours) and unload (defaults to 15 minutes)
  • Increase an individual load timeout call to 1 hour.
  • Reduced the number of retries to 5 from 10.

In the case of retries, hf can resume downloads in the case one download has not finished completely with a single call.

Note that this change doesnt make these values configurable yet for the user to specify.

Fixes: INFRA-1114 (internal).

@sakoush sakoush requested a review from lc525 as a code owner August 4, 2024 12:46
@sakoush sakoush added the v2 label Aug 4, 2024
@@ -47,6 47,10 @@ const (
maxElapsedTimeReadySubServiceBeforeStart = 15 * time.Minute // 15 mins is the default MaxElapsedTime
// period for subservice ready "cron"
periodReadySubService = 60 * time.Second
// max time to wait for a model server to load a model, including retrues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit s/retrues//retries

@@ -112,7 111,7 @@ func (b *backOffWithMaxCount) Reset() {
}

func (b *backOffWithMaxCount) NextBackOff() time.Duration {
if b.currentCount >= b.maxCount {
if b.currentCount >= b.maxCount-1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm: does this now retry once if b.maxCount is one? or is it that we now consider the initial function call as part of the "maxCount"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backoff.RetryNotify will run the function at least once and if there is an error returned will then use a backoff policy to decide on the retries. From that perspective the above code was slightly wrong and now fixed in the test as well TestBackOffPolicyWithMaxCount.

Copy link
Member

@lc525 lc525 Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is whether for a config parameter named something like (max*RetryCount) one would expect a maximum number of max*RetryCount retries, or they would expect that the function runs a total of max*RetryCount times. Either way should be fine as long as we're explicit to users (when this becomes configurable externally)

Copy link
Member Author

@sakoush sakoush Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a ticket to create docs to clarify these semantic differences and what we actually do in core 2.
Currently we have: one would expect a maximum number of max*RetryCount retries bounded by the maximum Elapsed duration

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, agreed that we should probably make this configurable

@sakoush sakoush merged commit 4b233f9 into SeldonIO:v2 Aug 5, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants