feat(scheduler): Add max elapsed duration for model load/unload #5819

sakoush · 2024-08-04T12:46:28Z

Previously we had a fixed elapsed duration defaulting to 15 minutes for model load retries. For loading large llms (e.g. llama3 70b), this fixed max elapsed time didnt allow for any retries consequently as the size of the model will take the duration of the load towards this limit.

This change then introduces the following:

Allowing to define a max elapsed duration for model load (defaults to 2 hours) and unload (defaults to 15 minutes)
Increase an individual load timeout call to 1 hour.
Reduced the number of retries to 5 from 10.

In the case of retries, hf can resume downloads in the case one download has not finished completely with a single call.

Note that this change doesnt make these values configurable yet for the user to specify.

Fixes: INFRA-1114 (internal).

lc525 · 2024-08-05T08:31:44Z

scheduler/cmd/agent/main.go

@@ -47,6  47,10 @@ const (
 	maxElapsedTimeReadySubServiceBeforeStart = 15 * time.Minute // 15 mins is the default MaxElapsedTime
 	// period for subservice ready "cron"
 	periodReadySubService = 60 * time.Second
+	// max time to wait for a model server to load a model, including retrues


nit s/retrues//retries

lc525 · 2024-08-05T08:43:18Z

scheduler/pkg/agent/client_utils.go

@@ -112,7  111,7 @@ func (b *backOffWithMaxCount) Reset() {
 }

 func (b *backOffWithMaxCount) NextBackOff() time.Duration {
-	if b.currentCount >= b.maxCount {
+	if b.currentCount >= b.maxCount-1 {


Just to confirm: does this now retry once if b.maxCount is one? or is it that we now consider the initial function call as part of the "maxCount"?

backoff.RetryNotify will run the function at least once and if there is an error returned will then use a backoff policy to decide on the retries. From that perspective the above code was slightly wrong and now fixed in the test as well TestBackOffPolicyWithMaxCount.

I guess the question is whether for a config parameter named something like (max*RetryCount) one would expect a maximum number of max*RetryCount retries, or they would expect that the function runs a total of max*RetryCount times. Either way should be fine as long as we're explicit to users (when this becomes configurable externally)

I added a ticket to create docs to clarify these semantic differences and what we actually do in core 2.
Currently we have: one would expect a maximum number of max*RetryCount retries bounded by the maximum Elapsed duration

lc525

lgtm, agreed that we should probably make this configurable

sakoush added 8 commits August 4, 2024 12:31

add max load elapsed time in client settings

57c711a

add default maz elapsed time to 2 hours

923d42e

increase default a single load operation timeout to an hour

f5828d8

adjust test after api change

f736553

remove outdate comment

3912468

add max unload elapsed time, defaulting to 15 minutes including retries.

2ba222a

add test coverage

d13234a

fix fmt

d220e47

sakoush requested a review from lc525 as a code owner August 4, 2024 12:46

sakoush added the v2 label Aug 4, 2024

lc525 reviewed Aug 5, 2024

View reviewed changes

lc525 approved these changes Aug 5, 2024

View reviewed changes

sakoush added 3 commits August 5, 2024 12:47

fix spelling mistake

73120b2

reduce the numner of retries to 5 by default

b430239

add rename test

00b425b

sakoush merged commit 4b233f9 into SeldonIO:v2 Aug 5, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): Add max elapsed duration for model load/unload #5819

feat(scheduler): Add max elapsed duration for model load/unload #5819

sakoush commented Aug 4, 2024 •

edited

Loading

lc525 Aug 5, 2024

lc525 Aug 5, 2024

sakoush Aug 5, 2024

lc525 Aug 5, 2024 •

edited

Loading

sakoush Aug 5, 2024 •

edited

Loading

lc525 left a comment

feat(scheduler): Add max elapsed duration for model load/unload #5819

feat(scheduler): Add max elapsed duration for model load/unload #5819

Conversation

sakoush commented Aug 4, 2024 • edited Loading

lc525 Aug 5, 2024

Choose a reason for hiding this comment

lc525 Aug 5, 2024

Choose a reason for hiding this comment

sakoush Aug 5, 2024

Choose a reason for hiding this comment

lc525 Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

sakoush Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

lc525 left a comment

Choose a reason for hiding this comment

sakoush commented Aug 4, 2024 •

edited

Loading

lc525 Aug 5, 2024 •

edited

Loading

sakoush Aug 5, 2024 •

edited

Loading