Page MenuHomePhabricator

Investigate higher CPU benchmark stdev on the bare metal server for WebPageReplay tests
Closed, ResolvedPublic

Description

Last days we had multiple alerts for stdev for the CPU benchmark on the bare metal server. The graph looks like this:

Screenshot 2023-09-07 at 08.27.44.png (1×2 px, 377 KB)

We can see that it seems like something happened early September. Let me investigate what's going on.

Event Timeline

It was hard to actually see any change on the server. Yesterday I enabled collectd on a test server to verify that it wouldn't add any overhead. It looked good, so today I enabled it on the bare metal server. I plan to do the same on the cloud servers that will be pushed on Monday.

One thing I did see was that there were no pushes on the server at that time (no upgrade, no changes). However one thing that look suspicious is that we start to run the Wiki loves monuments at that same time as the stdev increase. If that's the case, it would mean that we run some JavaScript as the same time as we do the CPU benchmark. Maybe we run the collection of the metrics too early?

I've been working on this today to make the measuring better, I've tried with how we use a worker in our RUM but that can be blocked by CSP, so I think it's better to just run the CPU benchmark just before the test starts. I haven't implemented that yet though.

I updated the physical machine today with latest kernel and other updates and rebooted, hoping it will make any difference.

Hmm this is annyoing. I run the same tests on another bare metal server and I can see that the standard deviation there is lower. Testing the exact same page using WebPageReplay. The only different I know is that server running and older version of Ubuntu.

Screenshot 2023-09-18 at 11.21.57.png (1×2 px, 222 KB)

I've been going through the processes on the server and everything looks ok. There's a problem that it takes quite long time before I get feedback if a change on the server makes any difference, and I plan to fix that in T346637.

Something happened there at the end of August, but I haven't been able to backtrack any changes at that time. I also tried to see if I could get unstable metrics in the CPU benchmark and try to run the benchmark independently of any page, but locally that shows the same value as today. I'll probably push a change where we collect the CPU metric before any tests runs just to be 100% sure there's no mixup with how the metric is collected.

This is strange, the stdev went back and it seems like it correlates almost exactly to the month of September? It went up the first September and went down the first of October:

Screenshot 2023-10-04 at 08.37.18.png (1×2 px, 610 KB)

This is not an issue anymore