Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate snapshot restore performance on host kernel >= 5.4 #2129

Closed
serban300 opened this issue Sep 11, 2020 · 13 comments
Closed

Investigate snapshot restore performance on host kernel >= 5.4 #2129

serban300 opened this issue Sep 11, 2020 · 13 comments
Assignees
Labels
Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled

Comments

@serban300
Copy link
Contributor

The snapshot restore operation is slower on AMD than on Intel.

We should investigate why and either fix the root cause or change the test_snapshot_resume_latency test in order to check different values for AMD and Intel.

@serban300
Copy link
Contributor Author

serban300 commented Sep 11, 2020

This is not related to AMD. It seems to be related to the host kernel version. On 5.4 I'm getting restore times that revolve around 30ms. On 4.19 I'm getting restore times that revolve around 4 ms. I'm getting similar results both on AMD and Intel.

@serban300
Copy link
Contributor Author

On host kernel 5.4 looks like the difference between 4ms and 30ms is caused by the jailer.

@serban300
Copy link
Contributor Author

On host kernel 5.4 the problem seems to be caused by the cgroups. When I start jailer without applying the cgroups the overhead disappears.

@serban300
Copy link
Contributor Author

The entire overhead comes from the KVM_CREATE_VM ioctl:

When running within the jailer cgroup:

ioctl(12, KVM_CREATE_VM, 0)             = 14 <0.027335>

When running without cgroups:

ioctl(12, KVM_CREATE_VM, 0)             = 14 <0.000352>

The difference is about 27ms.

@serban300
Copy link
Contributor Author

For the moment I traced the overhead to the kvm_arch_post_init_vm() function in the host kernel. I will dig deeper.

@serban300
Copy link
Contributor Author

serban300 commented Sep 18, 2020

Tracing the overhead further down the host kernel call stack:

kvm_arch_post_init_vm()
-> kvm_mmu_post_init_vm()
-> kvm_vm_create_worker_thread()
-> kvm_vm_worker_thread()
-> cgroup_attach_task_all()
-> cgroup_attach_task()
-> cgroup_migrate()
-> cgroup_migrate_execute()
-> cpuset_can_attach()
-> percpu_down_write(&cpuset_rwsem)

@serban300
Copy link
Contributor Author

Looks like the overhead was introduced by this kernel patch.

More specifically these 2 commits:

sched/core: Prevent race condition between cpuset and __sched_setscheduler()
cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem

@serban300
Copy link
Contributor Author

Actually, to be even more specific, looks like the overhead was introduced only by the following commit:
cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem

I tried to use cpuset_mutex instead of percpu_rwsem and the overhead disappears.

@sandreim
Copy link
Contributor

The issue is reproducible on Intel/AMD.

@serban300
Copy link
Contributor Author

Just a quick update. I stumbled upon some documentation. Looks like this is how a percpu_rwsem is supposed to work. Quoting from https://github.com/torvalds/linux/blob/master/Documentation/locking/percpu-rw-semaphore.rst

Locking for reading is very fast, it uses RCU and it avoids any atomic instruction in the lock and unlock path. On the other hand, locking for writing is very expensive, it calls synchronize_rcu() that can take hundreds of milliseconds.

@serban300
Copy link
Contributor Author

serban300 commented Sep 29, 2020

I managed to reproduce the issue with this simple rust executable:

use kvm_ioctls::Kvm;
use std::time::{Instant, Duration};
use std::thread;

fn main() {
    thread::sleep(Duration::from_millis(500));

    let kvm = Kvm::new().unwrap();
    let start = Instant::now();
    kvm.create_vm().unwrap();
    let elapsed = start.elapsed();
    println!("Elapsed: {:.2?}", elapsed);
}

and this small script that emulates what jailer does:

#!/bin/bash

cgcreate -g cpuset:/firecracker
echo 0-15 >> /sys/fs/cgroup/cpuset/firecracker/cpuset.cpus
echo 0 >> /sys/fs/cgroup/cpuset/firecracker/cpuset.mems

echo $$ > /sys/fs/cgroup/cpuset/firecracker/tasks && ./test

Note that if we don't add the extra sleep:

    thread::sleep(Duration::from_millis(500));

The issue doesn't reproduce.

@iulianbarbu iulianbarbu added Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled and removed Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` labels Oct 5, 2020
@serban300
Copy link
Contributor Author

When a writer requests access to some shared data, the RCU schedules it on a queue and then waits for at least one so-called "grace-period" to pass. A grace period is a period in witch the RCU waits for all the previously acquired read locks to be released. A grace period ends after all the CPUs go through a quiescent state. Grace periods can be quite long.

When we start the process (echo $$ > /sys/fs/cgroup/cpuset/firecracker/tasks && ./test) a write lock request is being performed and a grace period starts. If we don't add the extra thread::sleep(Duration::from_millis(500));, when we get to kvm.create_vm().unwrap();, the previous grace period is already in progress, so AFAIU the RCU does an optimization where the new lock request is "merged" in the same grace-period with the previous one. So we don't have to wait for a new grace period.

Otherwise, if we do thread::sleep(Duration::from_millis(500));, the grace period generated by the first lock request ends, and kvm.create_vm().unwrap(); asks for a new grace period, ending up waiting 20-30ms.

This is the desired RCU behavior. And it looks like the cgroup_init() code has a related optimization to drive down the boot time. Unfortunately this only works before the first use of the rcu_sync.

I don't think there's any easy way around this issue.

@serban300 serban300 changed the title Investigate snapshot restore performance on AMD Investigate snapshot restore performance on host kernel >= 5.4 Oct 6, 2020
@serban300 serban300 removed their assignment Jun 29, 2022
@dianpopa dianpopa self-assigned this Aug 1, 2022
@dianpopa
Copy link
Contributor

dianpopa commented Nov 14, 2022

We have tracked down the root cause of this issue to be cgroups v1 implementation on 5.x kernels where x>4.

We were able to replicate the findings from the above investigation and tracked down the latency impact to stem from the cgroup_attach_task_all function. This function is only used by cgroups v1 to attach the current task to all the parent's tasks cgroups.
This issue does not have a quick solution since it lies in the host kernel design.
And indeed the issue does not replicate on an cgroups v2 enabled host.

  • cgroups V2
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000734>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000719>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.000750>
  • cgroups V1
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.026044>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.045146>
ioctl(13, KVM_CREATE_VM, 0)             = 14 <0.036045>

Another bad thing we can notice in the V1 measurements is that results also vary a lot from run to run.

Since this is a kernel design originating issue we are currently recommending to users to use the snapshot functionality (on kernels higher than 5.4) on cgroups v2 enabled hosts.

The snapshot resume latency results for 5.10 kernel for the currently supported x86 platforms can be found here:
m5d and m6a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled
Projects
None yet
Development

No branches or pull requests

4 participants