-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci-matrix] Distribute unknown packages between shards. #339
base: main
Are you sure you want to change the base?
Conversation
Currently all new packages which are not in timing files are assigned to a single bucket with shortest duration. This happened because duration of "unknown" packages is set to 0 but in reality it might be longer. Also empty packages with no tests have 0s duration but there is little overhead to run those tests. To avoid all such packages adding into a single bucket, change how those are distributed between shards. If package run duration is lower than threshold - assign it by using hash function instead of minimal bucket.
P.S. We might want to make this feature opt-in by providing threshold via flag. |
Thank you for the PR! This would be a great thing to improve. It has been a little while since I've looked at this code. I believe the only special case it was handling previously was when no packages had any timing data. In that case it was using round robin to distribute the packages. Generalizing it so that all packages with 0 timing data are round robin distributed seems like a great improvement. A couple questions for you about this approach. What's the reason for using a threshold of 5ms instead of using 0? Is it common for your packages to start out with very little test? What do you think about an approach like this:
|
Thanks for reply!
It is just arbitrary number which I've picked to treat a test as negligible. Even 100 of 5ms tests won't increase duration of a bucket significantly so it might make sense to distribute those among buckets. But I feel we might want to make this value configurable via flag. What do you think?
100% to this. It is better to avoid adding additional tests to the "longest" bucket. I was thinking is it enough just to exclude one or some arbitrary part of buckets such as 20% but no less than 1? Few comments on round robin vs consistent assigning since we need to agree which path to choose. I see two ways how
While cache hit rate for 0s packages is not important but new packages, which are not in distribution file, are also treated as 0s. But in reality those will run longer. If the 2nd approach is used - it is better to keep consistency and assign specific package to the same bucket so build cache could be reused. To stay consistent round robin approach can not be used because a new package would shift all packages after it by one. Let me know if you think this is something what you'd like to adopt and I can make changes to this PR. |
After some investigation I've found that the best way is just to ignore packages without tests at all. Running @dnephin what do you think about excluding packages from matrix if |
One important detail, that may be not be obvious, is that the packages are sorted by elapsed time before attempting to place them into buckets. This means that packages with 0 elapsed time will always be last. I think that makes it safe to round-robin them, because only new packages or packages with no tests could come later. It will never impact existing packages with non-zero elapsed time.
How do we know if the package has no tests, or if the package is new and has no previous runtime? I'm not sure if this is safe, because it sounds like it could accidentally skip tests in new packages.
Do you think it is common for projects to have that many packages without tests? I've seen projects with a few small packages with no tests (maybe 10-20 packages), but generally the time to compile those packages is negligible compared to the time required to run all the other tests. |
Yeah, I agree with you - standard project probably won't have much packages without tests. On a flip side empty packages can be excluded without any changes to I'll update this PR in near time to support round robin approach. |
|
||
func consistentBucket(pkg string, n uint) int { | ||
h := md5.New() | ||
io.WriteString(h, pkg) | ||
hash := fmt.Sprintf("%x", h.Sum(nil)) | ||
decimal, _ := strconv.ParseUint(hash[:10], 16, 64) | ||
return int(decimal) % int(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use maphash.Hash{}
?
func consistentBucket(pkg string, n uint) int { | |
h := md5.New() | |
io.WriteString(h, pkg) | |
hash := fmt.Sprintf("%x", h.Sum(nil)) | |
decimal, _ := strconv.ParseUint(hash[:10], 16, 64) | |
return int(decimal) % int(n) | |
var hashSeed = maphash.MakeSeed() | |
func consistentBucket(pkg string, n uint) int { | |
return int(maphash.String(hashSeed, str)) % int(n) |
Maphash was designed specifically for this case.
Props:
- fastest hash from standart library.
- zero allocation.
Thanks!
Currently all new packages which are not in timing files are assigned to a single bucket with shortest duration. This happens because duration of "unknown" packages is set to 0 but in reality it might be longer. There are also empty packages with no tests which have 0s duration but there is little overhead to run those tests.
This patch changes how those are distributed between shards to avoid all all of them added to a single bucket. If package run duration is lower than threshold - assign it by using hash function instead of minimal bucket. Hashing function is consistent what means the same package is assigned to the same bucket if package's name and count of buckets is not changed. This helps to reuse go build cache.
Bucket name is retrieved by calculating md5 hash of package's name, then taking part of hash, converting to integer and calculating modulus by buckets count. Only a part of hash is converted to integer to avoid overflow - converting md5 hex to integer might give a result bigger than
int64
. I've tested distribution of such method and it seems to be pretty balanced. The 1st column shows how many packages were assigned to the bucket.This approach can be used when sharding strategy is not updated on each test run but is reused for extended period (e.g. week) because new tests will be distributed among shards.