Investigate snapshot restore performance on host kernel >= 5.4 #2129

serban300 · 2020-09-11T10:45:20Z

The snapshot restore operation is slower on AMD than on Intel.

We should investigate why and either fix the root cause or change the test_snapshot_resume_latency test in order to check different values for AMD and Intel.

The text was updated successfully, but these errors were encountered:

serban300 · 2020-09-11T14:23:26Z

This is not related to AMD. It seems to be related to the host kernel version. On 5.4 I'm getting restore times that revolve around 30ms. On 4.19 I'm getting restore times that revolve around 4 ms. I'm getting similar results both on AMD and Intel.

serban300 · 2020-09-11T17:34:00Z

On host kernel 5.4 looks like the difference between 4ms and 30ms is caused by the jailer.

serban300 · 2020-09-12T13:24:56Z

On host kernel 5.4 the problem seems to be caused by the cgroups. When I start jailer without applying the cgroups the overhead disappears.

serban300 · 2020-09-14T14:24:36Z

The entire overhead comes from the KVM_CREATE_VM ioctl:

When running within the jailer cgroup:

ioctl(12, KVM_CREATE_VM, 0)             = 14 <0.027335>

When running without cgroups:

ioctl(12, KVM_CREATE_VM, 0)             = 14 <0.000352>

The difference is about 27ms.

serban300 · 2020-09-15T11:41:09Z

For the moment I traced the overhead to the kvm_arch_post_init_vm() function in the host kernel. I will dig deeper.

serban300 · 2020-09-18T15:08:31Z

Tracing the overhead further down the host kernel call stack:

kvm_arch_post_init_vm()
-> kvm_mmu_post_init_vm()
-> kvm_vm_create_worker_thread()
-> kvm_vm_worker_thread()
-> cgroup_attach_task_all()
-> cgroup_attach_task()
-> cgroup_migrate()
-> cgroup_migrate_execute()
-> cpuset_can_attach()
-> percpu_down_write(&cpuset_rwsem)

serban300 · 2020-09-23T14:59:49Z

Looks like the overhead was introduced by this kernel patch.

More specifically these 2 commits:

sched/core: Prevent race condition between cpuset and __sched_setscheduler()
cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem

serban300 · 2020-09-24T10:37:36Z

Actually, to be even more specific, looks like the overhead was introduced only by the following commit:
cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem

I tried to use cpuset_mutex instead of percpu_rwsem and the overhead disappears.

sandreim · 2020-09-25T09:12:33Z

The issue is reproducible on Intel/AMD.

serban300 · 2020-09-28T08:13:29Z

Just a quick update. I stumbled upon some documentation. Looks like this is how a percpu_rwsem is supposed to work. Quoting from https://github.com/torvalds/linux/blob/master/Documentation/locking/percpu-rw-semaphore.rst

Locking for reading is very fast, it uses RCU and it avoids any atomic instruction in the lock and unlock path. On the other hand, locking for writing is very expensive, it calls synchronize_rcu() that can take hundreds of milliseconds.

serban300 · 2020-09-29T12:57:29Z

I managed to reproduce the issue with this simple rust executable:

use kvm_ioctls::Kvm;
use std::time::{Instant, Duration};
use std::thread;

fn main() {
    thread::sleep(Duration::from_millis(500));

    let kvm = Kvm::new().unwrap();
    let start = Instant::now();
    kvm.create_vm().unwrap();
    let elapsed = start.elapsed();
    println!("Elapsed: {:.2?}", elapsed);
}

and this small script that emulates what jailer does:

#!/bin/bash

cgcreate -g cpuset:/firecracker
echo 0-15 >> /sys/fs/cgroup/cpuset/firecracker/cpuset.cpus
echo 0 >> /sys/fs/cgroup/cpuset/firecracker/cpuset.mems

echo $$ > /sys/fs/cgroup/cpuset/firecracker/tasks && ./test

Note that if we don't add the extra sleep:

    thread::sleep(Duration::from_millis(500));

The issue doesn't reproduce.

serban300 · 2020-10-06T12:29:13Z

When a writer requests access to some shared data, the RCU schedules it on a queue and then waits for at least one so-called "grace-period" to pass. A grace period is a period in witch the RCU waits for all the previously acquired read locks to be released. A grace period ends after all the CPUs go through a quiescent state. Grace periods can be quite long.

When we start the process (echo $$ > /sys/fs/cgroup/cpuset/firecracker/tasks && ./test) a write lock request is being performed and a grace period starts. If we don't add the extra thread::sleep(Duration::from_millis(500));, when we get to kvm.create_vm().unwrap();, the previous grace period is already in progress, so AFAIU the RCU does an optimization where the new lock request is "merged" in the same grace-period with the previous one. So we don't have to wait for a new grace period.

Otherwise, if we do thread::sleep(Duration::from_millis(500));, the grace period generated by the first lock request ends, and kvm.create_vm().unwrap(); asks for a new grace period, ending up waiting 20-30ms.

This is the desired RCU behavior. And it looks like the cgroup_init() code has a related optimization to drive down the boot time. Unfortunately this only works before the first use of the rcu_sync.

I don't think there's any easy way around this issue.

dianpopa · 2022-11-14T14:02:41Z

We have tracked down the root cause of this issue to be cgroups v1 implementation on 5.x kernels where x>4.

We were able to replicate the findings from the above investigation and tracked down the latency impact to stem from the cgroup_attach_task_all function. This function is only used by cgroups v1 to attach the current task to all the parent's tasks cgroups.
This issue does not have a quick solution since it lies in the host kernel design.
And indeed the issue does not replicate on an cgroups v2 enabled host.

cgroups V2

ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000734>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000719>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000750>

cgroups V1

ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.026044>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.045146>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.036045>

Another bad thing we can notice in the V1 measurements is that results also vary a lot from run to run.

Since this is a kernel design originating issue we are currently recommending to users to use the snapshot functionality (on kernels higher than 5.4) on cgroups v2 enabled hosts.

The snapshot resume latency results for 5.10 kernel for the currently supported x86 platforms can be found here:
m5d and m6a.

serban300 added the Feature: CPU Support: AMD label Sep 11, 2020

serban300 self-assigned this Sep 11, 2020

serban300 changed the title ~~Investigate snapshot restore performance on AMD~~ Investigate snapshot restore performance on host kernel >= 5.4 Oct 6, 2020

serban300 removed the Feature: CPU Support: AMD label Oct 6, 2020

serban300 mentioned this issue Oct 6, 2020

Run snapshot performance tests on AMD #2159

Closed

serban300 mentioned this issue Jan 11, 2021

Reduce the overhead of shutting down the Firecracker process #1164

Closed

serban300 mentioned this issue Apr 28, 2021

Snapshot load latency is too high. #2027

Closed

dianpopa added the Feature: KernelPolicy label Oct 4, 2021

serban300 removed their assignment Jun 29, 2022

dianpopa self-assigned this Aug 1, 2022

dianpopa closed this as completed Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate snapshot restore performance on host kernel >= 5.4 #2129

Investigate snapshot restore performance on host kernel >= 5.4 #2129

serban300 commented Sep 11, 2020

serban300 commented Sep 11, 2020 •

edited

Loading

serban300 commented Sep 11, 2020

serban300 commented Sep 12, 2020

serban300 commented Sep 14, 2020

serban300 commented Sep 15, 2020

serban300 commented Sep 18, 2020 •

edited

Loading

serban300 commented Sep 23, 2020

serban300 commented Sep 24, 2020

sandreim commented Sep 25, 2020

serban300 commented Sep 28, 2020

serban300 commented Sep 29, 2020 •

edited

Loading

serban300 commented Oct 6, 2020

dianpopa commented Nov 14, 2022 •

edited

Loading

Investigate snapshot restore performance on host kernel >= 5.4 #2129

Investigate snapshot restore performance on host kernel >= 5.4 #2129

Comments

serban300 commented Sep 11, 2020

serban300 commented Sep 11, 2020 • edited Loading

serban300 commented Sep 11, 2020

serban300 commented Sep 12, 2020

serban300 commented Sep 14, 2020

serban300 commented Sep 15, 2020

serban300 commented Sep 18, 2020 • edited Loading

serban300 commented Sep 23, 2020

serban300 commented Sep 24, 2020

sandreim commented Sep 25, 2020

serban300 commented Sep 28, 2020

serban300 commented Sep 29, 2020 • edited Loading

serban300 commented Oct 6, 2020

dianpopa commented Nov 14, 2022 • edited Loading

serban300 commented Sep 11, 2020 •

edited

Loading

serban300 commented Sep 18, 2020 •

edited

Loading

serban300 commented Sep 29, 2020 •

edited

Loading

dianpopa commented Nov 14, 2022 •

edited

Loading