QEMU之CPU虚拟化

概述

KVM是由以色列初创公司Qumranet在CPU推出硬件虚拟化之后开发的一个基于内核的虚拟机监控器。

KVM是一个虚拟化的统称方案,除了x86外,ARM等其他架构也有自己的方案,所以KVM的主体代码位于内核树virt/kvm目录下面,表示所有CPU架构的公共代码,这也是内核kvm.ko对应的源码。

CPU架构代码位于arch/目录下面,如x86的架构相关的代码在arch/x86/kvm下。当然,同一个架构可能会有多种不同的实现,如KVM就有Intel和AMD两家的CPU实现,所以在x86目录下面就有多种实现代码,如Intel的vmx.c(对应intel VM-X方案)、AMD的svm.c(对应AMD-V方案),ioapic.c和lapic.c是中断控制器的代码,这也是intel-kvm.ko和amd-kvm.ko的来源。这种源码组织架构也常见于Linux内核的其他子系统。

KVM的所有虚拟化实现(Intel和AMD)都会向KVM模块注册一个kvm_x86_ops结构体,这样,KVM中的一些函数就是一个外壳,它可能首先会调用kvm_arch_xxx函数,表示的是调用CPU架构相关的函数,而如果kvm_arch_xxx函数需要调用到实现相关的代码,则会调用kvm_x86_ops结构中的相关回调函数。

kvm_intel.ko 与 kvm.ko 的关系:

/dev/kvm
vmx_x86_ops
kvm_init/kvm_exit
user space
kvm.ko
kvm_intel.ko

VM创建

qemu侧虚机创建

qemu中支持kvm的代码入口主要都在kvm-all.c中,其中初始化函数kvm_init()。

qemu
accel
kvm
kvm-all.c
xen
xen-all.c
...

当运行qemu时,如果命令行中带有--enable-kvm参数,则在qemu_init()函数中会处理:

case QEMU_OPTION_enable_kvm:olist = qemu_find_opts("machine");qemu_opts_parse_noisily(olist, "accel=kvm", false);break;

machine optslist这个参数项加了一个accel=kvm参数,之后main函数会调用configure_accelerator(current_machine),该函数会从machine的参数列表中取出accel的值,找出所属的类型,然后调用accel_init_machine。

int accel_init_machine(AccelState *accel, MachineState *ms)
{AccelClass *acc = ACCEL_GET_CLASS(accel);        /*获取指定类型(这里是kvm)的accel类*/int ret;ms->accelerator = accel;*(acc->allowed) = true;ret = acc->init_machine(ms);    /* 执行其对应的 init_machine 函数*/if (ret < 0) {ms->accelerator = NULL;*(acc->allowed) = false;object_unref(OBJECT(accel));} else {object_set_accelerator_compat_props(acc->compat_props);}return ret;
}

那么accel=kvm的init_machine函数是谁呢?

#define TYPE_KVM_ACCEL ACCEL_CLASS_NAME("kvm")    #定义TYPE_KVM_ACCEL 就是 kvm-accel

然后在kvm-all.c中,构造kvm_accel_type结构体时设置了其init_machine钩子函数:

static void kvm_accel_class_init(ObjectClass *oc, void *data)
{AccelClass *ac = ACCEL_CLASS(oc);ac->name = "KVM";ac->init_machine = kvm_init;        /* 这里初始化kvm accel的init_machine 函数为 kvm_init()*/ac->has_memory = kvm_accel_has_memory;ac->allowed = &kvm_allowed;...
}/* 初始化kvm_accel_type结构体 */
static const TypeInfo kvm_accel_type = {.name = TYPE_KVM_ACCEL,.parent = TYPE_ACCEL,.instance_init = kvm_accel_instance_init,.class_init = kvm_accel_class_init,.instance_size = sizeof(KVMState),
};static void kvm_type_init(void)
{type_register_static(&kvm_accel_type);    /* 注册kvm_accel_type结构体 */
}type_init(kvm_type_init);

kvm-all.c中 kvm_init()函数

static int kvm_init(MachineState *ms)
{/* 省略代码... */s = KVM_STATE(ms->accelerator);/* 省略代码... */s->fd = qemu_open("/dev/kvm", O_RDWR);    /* 打开 /dev/kvm 得到fd句柄 *//* 省略代码... */do {ret = kvm_ioctl(s, KVM_CREATE_VM, type);    /* ioctl打开的/dev/kvm的fd句柄,KVM_CREATE_VM命令通知kvm.ko模块创建虚机*/} while (ret == -EINTR);/* 省略代码... */ret = kvm_arch_init(ms, s);        /* 做一些架构相关的初始化操作*//* 省略代码... */return ret;
}
kvm_init
qemu_open, 打开/dev/kvm
kvm_ioctl(KVM_CREATE_VM)
kvm_arch_init

kvm_init()的主要作用就是调用/dev/kvm提供的一系列ioctl接口,在内核KVM中创建一台虚拟机。一个QEMU进程对应一台虚拟机VM。

kvm侧虚机创建

内核kvm模块的主要代码入口在kvm_main.c中,以kvm与intel组合为例,后面的分析涉及架构都是intel:

Linux
virt
arch
kvm
kvm_main.c eventfd.c 等等
x86
kvm
x86.c vmx svm 等等
数据结构

内核kvm模块中,struct kvm其实就代表一台虚拟机。
在这里插入图片描述

初始化/dev/kvm

kvm_init()函数中初始化/dev/kvm设备,留给qemu去访问,并初始化对应的options操作函数。

vmx_init
kmv_init
misc_register(&kvm_dev)

x86架构下,kvm的options对象kvm_x86_ops。

arch/x86/kvm/x86.c中定义了全局变量 kvm_x86_ops

struct kvm_x86_ops kvm_x86_ops __read_mostly;
EXPORT_SYMBOL_GPL(kvm_x86_ops);

kvm_x86_ops结构体中是一系列函数指针,其具体的函数初始化是vmx_x86_ops中初始化的。

struct kvm_x86_ops {int (*hardware_enable)(void);void (*hardware_disable)(void);void (*hardware_unsetup)(void);bool (*cpu_has_accelerated_tpr)(void);bool (*has_emulated_msr)(u32 index);void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);unsigned int vm_size;int (*vm_init)(struct kvm *kvm);void (*vm_destroy)(struct kvm *kvm);/*省略一大堆函数指针*/
}

x86架构的vmx.c中vmx_init函数在调用kvm_init时传入的是vmx_init_ops:

    r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),__alignof__(struct vcpu_vmx), THIS_MODULE);

主要起作用的是vmx_x86_ops,在/arch/x86/kvm/vmx/vmx.c中初始化:

static struct kvm_x86_init_ops vmx_init_ops __initdata = {.cpu_has_kvm_support = cpu_has_kvm_support,.disabled_by_bios = vmx_disabled_by_bios,.check_processor_compatibility = vmx_check_processor_compat,.hardware_setup = hardware_setup,.runtime_ops = &vmx_x86_ops,
};

其中,vmx_x86_ops也是一个全局静态对象,其具体内容:

static struct kvm_x86_ops vmx_x86_ops __initdata = {.hardware_unsetup = hardware_unsetup,.hardware_enable = hardware_enable,.hardware_disable = hardware_disable,.cpu_has_accelerated_tpr = report_flexpriority,.has_emulated_msr = vmx_has_emulated_msr,.vm_size = sizeof(struct kvm_vmx),.vm_init = vmx_vm_init,/*省略...*/
};

内核kvm_main.c中,定义了kvm的设备、字符设备ioctl、vm虚机的ioctl、vcpu的iotctl等全局变量以便响应用户态的操作。

static struct file_operations kvm_vcpu_fops = {.release        = kvm_vcpu_release,.unlocked_ioctl = kvm_vcpu_ioctl,.mmap           = kvm_vcpu_mmap,.llseek        = noop_llseek,KVM_COMPAT(kvm_vcpu_compat_ioctl),
};static struct file_operations kvm_vm_fops = {.release        = kvm_vm_release,.unlocked_ioctl = kvm_vm_ioctl,.llseek        = noop_llseek,KVM_COMPAT(kvm_vm_compat_ioctl),
};static struct file_operations kvm_chardev_ops = {.unlocked_ioctl = kvm_dev_ioctl,.llseek        = noop_llseek,KVM_COMPAT(kvm_dev_ioctl),
};static struct miscdevice kvm_dev = {KVM_MINOR,"kvm",&kvm_chardev_ops,
};kvm_preempt_ops.sched_in = kvm_sched_in;
kvm_preempt_ops.sched_out = kvm_sched_out;

kvm_dev_ioctl

ioctl操作对应处理函数
KVM_GET_API_VERSION
KVM_CREATE_VM创建虚机,kvm_dev_ioctl_create_vm() --> kvm_create_vm()
KVM_CHECK_EXTENSION检查扩展功能,kvm_vm_ioctl_check_extension_generic()
KVM_GET_VCPU_MMAP_SIZE创建qemu与kvm共享内存

kvm_vm_ioctl:

ioctl操作对应处理函数
KVM_CREATE_VCPU创建vcpu,kvm_vm_ioctl_create_vcpu
KVM_ENABLE_CAPkvm_vm_ioctl_enable_cap_generic
KVM_SET_USER_MEMORY_REGIONkvm_vm_ioctl_set_memory_region
KVM_GET_DIRTY_LOGkvm_vm_ioctl_get_dirty_log
KVM_REGISTER_COALESCED_MMIO
KVM_IRQFDkvm_irqfd
KVM_IOEVENTFDkvm_ioeventfd
KVM_CREATE_DEVICEkvm_ioctl_create_device
KVM_CHECK_EXTENSIONkvm_vm_ioctl_check_extension_generic

kvm_vcpu_ioctl:

ioctl操作对应处理函数
KVM_RUN运行vcpu,kvm_arch_vcpu_ioctl_run()
KVM_GET_REGS
KVM_SET_REGS

kvm_dev_ioctl与kvm_vm_ioctl与kvm_vcpu_ioctl之间的关系:
在这里插入图片描述

QEMU创建CPU

qemu中的CPU模型继承关系:
在这里插入图片描述

qemu中支持的x86 CPU都定义在target/i386/cpu.c中的X86CPUDefinition类型的builtin_x86_defs数组中:

/* Base definition for a CPU model */
typedef struct X86CPUDefinition {const char *name;uint32_t level;uint32_t xlevel;/* vendor is zero-terminated, 12 character ASCII string */char vendor[CPUID_VENDOR_SZ + 1];int family;int model;int stepping;FeatureWordArray features;const char *model_id;CPUCaches *cache_info;/* Use AMD EPYC encoding for apic id */bool use_epyc_apic_id_encoding;/** Definitions for alternative versions of CPU model.* List is terminated by item with version == 0.* If NULL, version 1 will be registered automatically.*/const X86CPUVersionDefinition *versions;
} X86CPUDefinition;

其中:

X86CPUDefinition成员作用
nameCPU的名字
levelCPUID指令支持的最大功能号
xlevelCPUID扩展质量支持的最大功能号
vendor、family、model、steppingCPU的基本信息
features记录CPU特性的数组
model_idCPU的全名

builtin_x86_defs数组:

static X86CPUDefinition builtin_x86_defs[] = {{.name = "qemu64",.level = 0xd,.vendor = CPUID_VENDOR_AMD,.family = 6,.model = 6,.stepping = 3,.features[FEAT_1_EDX] =PPRO_FEATURES |CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |CPUID_PSE36,.features[FEAT_1_ECX] =CPUID_EXT_SSE3 | CPUID_EXT_CX16,.features[FEAT_8000_0001_EDX] =CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,.features[FEAT_8000_0001_ECX] =CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM,.xlevel = 0x8000000A,.model_id = "QEMU Virtual CPU version " QEMU_HW_VERSION,},... /*有2000多行代码*/
}

qemu中通过struct X86CPU结构体来实例化一个虚拟的x86 CPU:

在这里插入图片描述

qemu中创建vcpu的函数调用路径:
在这里插入图片描述

其中,qemu中的kvm_init_vcpu()代码

int kvm_init_vcpu(CPUState *cpu)
{/*以下都省略部分代码,只留关心的部分*/ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu));        /*KVM_CREATE_VCPU去创建vcpu*/mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);    /*创建共享内存空间*/cpu->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,cpu->kvm_fd, 0);    /*qemu拿到共享内存后,对其fd进行mmap,kvm中处理函数是kvm_vcpu_mmap()*/ret = kvm_arch_init_vcpu(cpu);return ret;
}

KVM创建CPU

在这里插入图片描述

qemu与kvm共享数据

QEMU与KVM经常需要共享数据,如KVM将VM Exit的信息放到共享内存中,QEMU可以通过共享内存区域获取这些数据。QEMU与KVM之间的数据共享是QEMU在创建VCPU时分配的。

qemu在kvm_init_vcpu()中有kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0),该接口返回的是qemu与kvm共享内存的大小。

kvm中处理该接口的函数是:

static long kvm_dev_ioctl(struct file *filp,unsigned int ioctl, unsigned long arg)
{/*省略部分代码*/case KVM_GET_VCPU_MMAP_SIZE:if (arg)goto out;r = PAGE_SIZE;     /* struct kvm_run */
#ifdef CONFIG_X86r += PAGE_SIZE;    /* pio data page */
#endif
#ifdef CONFIG_KVM_MMIOr += PAGE_SIZE;    /* coalesced mmio ring page */
#endifbreak;return r;
}

ioctl(KVM_GET_VCPU_MMAP_SIZE)可能返回的大小为1个、2个或者3个页。第一页用于kvm_run,该结构体用于与QEMU和KVM进行基本的数据交互,第二页用于虚拟机访问IO端口时存储相应的数据,最后一页用于聚合的MMIO。

然后qemu对共享内存进行mmap操作

static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
{struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;struct page *page;if (vmf->pgoff == 0)page = virt_to_page(vcpu->run);
#ifdef CONFIG_X86else if (vmf->pgoff == KVM_PIO_PAGE_OFFSET)page = virt_to_page(vcpu->arch.pio_data);
#endif
#ifdef CONFIG_KVM_MMIOelse if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
#endifelsereturn kvm_arch_vcpu_fault(vcpu, vmf);get_page(page);vmf->page = page;return 0;
}static const struct vm_operations_struct kvm_vcpu_vm_ops = {.fault = kvm_vcpu_fault,
};static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
{vma->vm_ops = &kvm_vcpu_vm_ops;return 0;
}

QEMU调用mmap映射VCPU的fd这个匿名文件的时候,实际上仅分配了虚拟地址空间,并且设置了这段虚拟地址空间的操作为kvm_vcpu_vm_ops,该操作回调只有一个fault回调函数kvm_vcpu_fault。kvm_vcpu_fault函数会在QEMU访问共享内存产生缺页异常的时候被调用,从其代码可以看到,内核会在QEMU把对应的数据与虚拟地址空间联系起来。

访问共享内存页实际访问
page1kvm_vcpu->run
page2kvm_vcpu->arch
page3kvm->coalesced_mmio_ring

VCPU运行

QEMU运行VCPU

每个VCPU都会有一个对应的VMCS(Virtual Machine Control Structure),该结构是Intel x86处理器中实现CPU虚拟化记录vCPU状态的一个关键数据结构。VMCS的物理地址会作为操作数提供给VMX的指令。VMCS总共有如下4种状态:

  • Inactive:即只是分配和初始化VMCS结构或者是执行VMCLEAR指令之后的状态。
  • working:CPU在一个VMCS上执行了VMPTRLD指令或者产生VM exit之后所处的状态,这个时候CPU还是在VMX root状态。
  • Active:当前VMCS执行了VMPTRLD指令,同一个CPU执行了另一个VCPU的VMPTRLD之后,前一个VMCS所处的状态。
  • controlling:当CPU在一个VMCS上执行了VMLAUNCH指令之后CPU所处的VMX non-root状态。
    在这里插入图片描述

Intel SDM 31.6所描述的要让一个虚拟机运行起来的步骤。

  1. 在非分页内存中分配一个4KB对齐的VMCS区域,其大小通过IA32_VMX_BASIC MSR得到,对于KVM,这个过程主要是通过vmx_create_vcpu调用alloc_vmcs来完成的。
  2. 初始化VMCS区域的版本标识(VMCS区域的前31位),这也是通过IA32_VMX_BASIC SMR得到的,清除VMCS区域前4个字节的31位,对于KVM,这个过程在alloc_vmcs_cpu中完成。
  3. 使用VMCS的物理地址作为操作数执行VMCLEAR指令,这会将当前CPU的working-VMCS指针指向FFFFFFFF_FFFFFFFFH,指令执行完成之后检查RFLAGS.CF=0以及RFLAGS.ZE=0,对于KVM,这个过程主要通过loaded_vmcs_clear函数最终调用vmcs_clear来完成。
  4. 使用VMCS的物理地址执行VMPTRLD指令,这个时候CPU的working-VMCS指针指向VMCS区域的物理地址,对于KVM,这个过程通过vmx_vcpu_load调用vmcs_load来完成。
  5. 执行VMWRITE指令,初始化VMCS的host-state区域,当产生VM exit后,这个区域会用来创建宿主机的CPU状态和上下文,host-state区域包括控制寄存器(CR0、CR3以及CR4),段寄存器(CS、SS、DS、ES、FS、GS、TR)以及RSP、RIP和一些MSR寄存器,对于KVM,这个过程主要在vmx_vcpu_setup函数中完成。
  6. 执行VMWRITE指令,初始化VMCS中的VM-exit control区域、VM-entry control区域以及VM-execution control区域。这些区域的某些数据需要根据VMX capability MSR的报告设置,如MSR寄存器报告在当前CPU上某些位只能设置为0,对于KVM,这个过程主要在vmx_vcpu_setup函数中完成。
  7. 执行VMWRITE指令,初始化guest-state区域,当CPU进入VMX non-root模式时会根据这些数据创建上下文,对于KVM,这个过程主要在vmx_vcpu_reset中完成。
  8. guest-state的设置需要满足如下条件。
  • ① 如果虚拟机需要模拟一个从BIOS启动的完整OS,则需要将guest的状态设置为物理CPU加电时的状态。
  • ② 需要将VMM不能截获的guest-state数据正确设置,如通用寄存器、CR2控制寄存器、调试寄存器、浮点数寄存器等。
  1. 执行VMLAUNCH,使得CPU处于VMX non-root状态,如果这个过程出错,将会设置RFLAGS.CF或者RFLAGS.ZF,对于KVM,这个过程在vmx_vcpu_run中完成。

qemu中vcpu线程的routine函数是

static void *qemu_kvm_cpu_thread_fn(void *arg)
{/*省略*/r = kvm_init_vcpu(cpu);kvm_init_cpu_signals(cpu);/* signal CPU creation */cpu->created = true;qemu_cond_signal(&qemu_cpu_cond);qemu_guest_random_seed_thread_part2(cpu->random_seed);do {if (cpu_can_run(cpu)) {r = kvm_cpu_exec(cpu);    /*vcpu运行的核心代码*/if (r == EXCP_DEBUG) {cpu_handle_guest_debug(cpu);}}qemu_wait_io_event(cpu);    /*vcpu不好运行时,则将CPU等待在cpu->halt_cond条件上*/} while (!cpu->unplug || cpu_can_run(cpu));/*省略*/return NULL;
}

qemu中vcpu运行的核心代码函数kvm_cpu_exec(),其核心也是一个do{}while()循环。

int kvm_cpu_exec(CPUState *cpu)
{/*省略*/do {/*省略*/kvm_arch_pre_run(cpu, run);run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);attrs = kvm_arch_post_run(cpu, run);switch (run->exit_reason) {case KVM_EXIT_IO:DPRINTF("handle_io\n");/* Called outside BQL */kvm_handle_io(run->io.port, attrs,(uint8_t *)run + run->io.data_offset,run->io.direction,run->io.size,run->io.count);ret = 0;break;case KVM_EXIT_MMIO:DPRINTF("handle_mmio\n");/* Called outside BQL */address_space_rw(&address_space_memory,run->mmio.phys_addr, attrs,run->mmio.data,run->mmio.len,run->mmio.is_write);ret = 0;break;/*省略*/case KVM_EXIT_SYSTEM_EVENT:default:DPRINTF("kvm_arch_handle_exit\n");ret = kvm_arch_handle_exit(cpu, run);break;}} while (ret == 0);/*省略*/return ret;
}

kvm_arch_pre_run首先做一些运行前的准备工作,如nmi和smi的中断注入,之后触发VCPU的ioctl(KVM_RUN)使该CPU运行起来,KVM模块在处理该ioctl时,会执行对应的VMX指令,把该VCPU运行的物理CPU从VMX root模式转换成VMX non-root模式,开始运行虚拟机中的代码。虚拟机内部如果遇到一些事件产生VM Exit,就会退出到KVM,如果KVM无法处理就会分发到QEMU,也就是在ioctl(KVM_RUN)返回的时候调用kvm_arch_post_run来进行一些初步处理,然后开始根据QEMU和KVM共享内存kvm_run中的数据来判断退出原因,并做出相应处理,如对于I/O的退出会调用kvm_handle_io进行分发,最终调用到注册该I/O端口的设备回调函数。可以看到,这里用了很多kvm_run里面的数据,如果退出原因是由于访问MMIO,则会调用address_space_rw,这个函数会找到MMIO是由哪个设备注册的,从而调用其相关回调函数。

qemu、kvm与vm之间的关系:
在这里插入图片描述

KVM运行VCPU

kvm_vcpu_ioctl

由kvm_vcpu_ioctl中去处理,最后有arch/x86/kvm/x86.c中的vcpu_run()函数做主要处理:

kvm_vcpu_ioctl
kvm_arch_vcpu_ioctl_run
vcpu_run
static struct file_operations kvm_vcpu_fops = {.release        = kvm_vcpu_release,.unlocked_ioctl = kvm_vcpu_ioctl,.mmap           = kvm_vcpu_mmap,.llseek        = noop_llseek,KVM_COMPAT(kvm_vcpu_compat_ioctl),
};

kvm_vcpu_ioctl()函数如何保证是当前vcpu线程在处理的呢?函数中首先处理如下判断,

if (vcpu->kvm->mm != current->mm)return -EIO;
switch (ioctl) {case KVM_RUN: {struct pid *oldpid;r = -EINVAL;if (arg)goto out;oldpid = rcu_access_pointer(vcpu->pid);        //这里有可能运行该vcpu的线程换了if (unlikely(oldpid != task_pid(current))) {/* The thread running this VCPU changed. */struct pid *newpid;r = kvm_arch_vcpu_run_pid_change(vcpu);if (r)break;newpid = get_task_pid(current, PIDTYPE_PID);rcu_assign_pointer(vcpu->pid, newpid);        //如果换线程了,则更新vcpu->pid为current->pidif (oldpid)synchronize_rcu();put_pid(oldpid);}/*这里可以对vcpu进行特征统计,对运行vcpu的线程进行标记,但是如果统计vcpu特征了,还需要标记线程么?*/r = kvm_arch_vcpu_ioctl_run(vcpu);        //进入具体架构vcpu run代码trace_kvm_userspace_exit(vcpu->run->exit_reason, r);break;}
kvm_arch_vcpu_ioctl_run

进入kvm_arch_vcpu_ioctl_run()函数,这里分析x86架构:

int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{struct kvm_run *kvm_run = vcpu->run;int r;vcpu_load(vcpu);//省略代码if (kvm_run->immediate_exit)r = -EINTR;elser = vcpu_run(vcpu);    //主要是vcpu_run函数out:kvm_put_guest_fpu(vcpu);if (kvm_run->kvm_valid_regs)store_regs(vcpu);post_kvm_run_save(vcpu);kvm_sigset_deactivate(vcpu);vcpu_put(vcpu);return r;
}
vcpu_load 与 vcpu_put

vcpu_load是加载vcpu至对应的物理cpu,vcpu_put则相反。

kvm中定义了一个per cpu变量,kvm_running_vcpu,用于记录是否运行vcpu任务。

static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);

vcpu_load()函数,主要就是kvm_running_vcpu赋值,

/** Switches to specified vcpu, until a matching vcpu_put()*/
void vcpu_load(struct kvm_vcpu *vcpu)
{int cpu = get_cpu();    //关闭抢占,返回cpu的id__this_cpu_write(kvm_running_vcpu, vcpu);    //赋值per-cpu变量kvm_running_vcpu为当前vcpupreempt_notifier_register(&vcpu->preempt_notifier);kvm_arch_vcpu_load(vcpu, cpu);put_cpu();        //开启抢占
}
EXPORT_SYMBOL_GPL(vcpu_load);

vcpu_put()与vcpu_load()是相对使用的。

void vcpu_put(struct kvm_vcpu *vcpu)
{preempt_disable();kvm_arch_vcpu_put(vcpu);preempt_notifier_unregister(&vcpu->preempt_notifier);__this_cpu_write(kvm_running_vcpu, NULL);preempt_enable();
}
EXPORT_SYMBOL_GPL(vcpu_put);
vcpu_run
static int vcpu_run(struct kvm_vcpu *vcpu)
{/*省略*/for (;;) {if (kvm_vcpu_running(vcpu)) {r = vcpu_enter_guest(vcpu);        /*判断的结果是可以运行,则会调用vcpu_enter_guest来进入虚拟机*/} else {r = vcpu_block(kvm, vcpu);        /*如果vcpu_run判断此时VCPU不能运行,不考虑poll机制,则调用schedule()提请调度,让出CPU。*/}if (r <= 0)break;/*省略*/}/*省略*/return r;
}/* 判断两个方面:* 1. vcpu.arch结构的mp_state是否为KVM_MP_STATE_RUNNABLE* 2. vcpu.arch结构中的apf.halted表示的虚拟机中是否存在需要访问却被宿主机swap出去的内存页,如果由于apf而被暂停,则这个时候虚拟CPU也是不能运行的*/
static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
{if (is_guest_mode(vcpu))kvm_x86_ops.nested_ops->check_events(vcpu);return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&!vcpu->arch.apf.halted);
}

如果vcpu_run判断此时VCPU不能运行,则会调用vcpu_block,后者调用kvm_vcpu_block,如果不考虑poll机制,则kvm_vcpu_block会调用schedule()提请调度,让出CPU。

vcpu_block
kvm_vcpu_block
schedule
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{/*省略*/for (;;) {set_current_state(TASK_INTERRUPTIBLE);if (kvm_vcpu_check_block(vcpu) < 0)break;waited = true;schedule();}/*省略*/
}
vcpu_enter_guest

返回1,则vcpu_run()函数就一直在for循环中,否则返回至userspace。

/** Returns 1 to let vcpu_run() continue the guest execution loop without* exiting to the userspace.  Otherwise, the value will be returned to the* userspace.*/
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{/*省略...........................*/r = kvm_mmu_reload(vcpu);if (unlikely(r)) {goto cancel_injection;}preempt_disable();        //关闭抢占kvm_x86_ops.prepare_guest_switch(vcpu);    //这里是保存host主机的state,以便虚拟机退出后能正常运行host/** Disable IRQs before setting IN_GUEST_MODE.  Posted interrupt* IPI are then delayed after guest entry, which ensures that they* result in virtual interrupt delivery.* 这里禁止CPU的外部中断请求 */local_irq_disable();vcpu->mode = IN_GUEST_MODE;        //进入guest mode//省略trace_kvm_entry(vcpu->vcpu_id);        //这里追踪kvm entry,而kvm exit是在vmx_vcpu_run()函数中追踪的//省略exit_fastpath = kvm_x86_ops.run(vcpu);        //这里进入vmx_vcpu_run()函数//省略vcpu->arch.last_vmentry_cpu = vcpu->cpu;vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());vcpu->mode = OUTSIDE_GUEST_MODE;    //退出guest modesmp_wmb();kvm_x86_ops.handle_exit_irqoff(vcpu);    //退出虚机后,处理外部中断/* * Consume any pending interrupts, including the possible source of* VM-Exit on SVM and any ticks that occur between VM-Exit and now.* An instruction is required after local_irq_enable() to fully unblock* interrupts on processors that implement an interrupt shadow, the* stat.exits increment will do nicely.*/kvm_before_interrupt(vcpu);local_irq_enable();++vcpu->stat.exits;                //这里对退出的数据进行统计local_irq_disable();kvm_after_interrupt(vcpu);if (lapic_in_kernel(vcpu)) {s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;if (delta != S64_MIN) {trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta);vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN;}}local_irq_enable();preempt_enable();//省略r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath);    //其实到这里已经没有什么外部中断需要处理了,就是统计虚机退出的一些原因数据return r;cancel_injection:if (req_immediate_exit)kvm_make_request(KVM_REQ_EVENT, vcpu);kvm_x86_ops.cancel_injection(vcpu);if (unlikely(vcpu->arch.apic_attention))kvm_lapic_sync_from_vapic(vcpu);
out:return r;
}

该函数会陷入kvm_vcpu对应的vmx_vcpu_run,当vmx_vcpu_run执行完返回的时候,其实已经完成了一轮VMEntry与VM Exit了。

vcpu->mode有以下几种

enum {OUTSIDE_GUEST_MODE,IN_GUEST_MODE,EXITING_GUEST_MODE,READING_SHADOW_PAGE_TABLES,
};

CPU在guest模式运行时,中断是关闭的,运行着虚拟机代码的CPU不会接收到外部中断,但是外部中断会导致CPU退出guest模式,进入VMX root模式。外部中断的处理是在handle_exit之前进行的,所以后面在handle_exit中处理外部中断的时候就没有什么实际的事可以做了,而只是对统计数据进行了修改。

vmx_vcpu_run
static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{fastpath_t exit_fastpath;struct vcpu_vmx *vmx = to_vmx(vcpu);unsigned long cr3, cr4;reenter_guest:/* Record the guest's net vcpu time for enforced NMI injections. */if (unlikely(!enable_vnmi &&vmx->loaded_vmcs->soft_vnmi_blocked))vmx->loaded_vmcs->entry_time = ktime_get();/* Don't enter VMX if guest state is invalid, let the exit handlerstart emulation until we arrive back to a valid state */if (vmx->emulation_required)return EXIT_FASTPATH_NONE;if (vmx->ple_window_dirty) {vmx->ple_window_dirty = false;vmcs_write32(PLE_WINDOW, vmx->ple_window);}/** We did this in prepare_switch_to_guest, because it needs to* be within srcu_read_lock.*/WARN_ON_ONCE(vmx->nested.need_vmcs12_to_shadow_sync);if (kvm_register_is_dirty(vcpu, VCPU_REGS_RSP))vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP))vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);cr3 = __get_current_cr3_fast();if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {vmcs_writel(HOST_CR3, cr3);vmx->loaded_vmcs->host_state.cr3 = cr3;}cr4 = cr4_read_shadow();if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {vmcs_writel(HOST_CR4, cr4);vmx->loaded_vmcs->host_state.cr4 = cr4;}/* When single-stepping over STI and MOV SS, we must clear the* corresponding interruptibility bits in the guest state. Otherwise* vmentry fails as it then expects bit 14 (BS) in pending debug* exceptions being set, but that's not correct for the guest debugging* case. */if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)vmx_set_interrupt_shadow(vcpu, 0);kvm_load_guest_xsave_state(vcpu);pt_guest_enter(vmx);atomic_switch_perf_msrs(vmx);if (enable_preemption_timer)vmx_update_hv_timer(vcpu);if (lapic_in_kernel(vcpu) &&vcpu->arch.apic->lapic_timer.timer_advance_ns)kvm_wait_lapic_expire(vcpu);/** If this vCPU has touched SPEC_CTRL, restore the guest's value if* it's non-zero. Since vmentry is serialising on affected CPUs, there* is no need to worry about the conditional branch over the wrmsr* being speculatively taken.*/x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);/* The actual VMENTER/EXIT is in the .noinstr.text section. */vmx_vcpu_enter_exit(vcpu, vmx);/** We do not use IBRS in the kernel. If this vCPU has used the* SPEC_CTRL MSR it may have left it on; save the value and* turn it off. This is much more efficient than blindly adding* it to the atomic save/restore list. Especially as the former* (Saving guest MSRs on vmexit) doesn't even exist in KVM.** For non-nested case:* If the L01 MSR bitmap does not intercept the MSR, then we need to* save it.** For nested case:* If the L02 MSR bitmap does not intercept the MSR, then we need to* save it.*/if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)))vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);x86_spec_ctrl_restore_host(vmx->spec_ctrl, 0);/* All fields are clean at this point */if (static_branch_unlikely(&enable_evmcs))current_evmcs->hv_clean_fields |=HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;if (static_branch_unlikely(&enable_evmcs))current_evmcs->hv_vp_id = vcpu->arch.hyperv.vp_index;/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */if (vmx->host_debugctlmsr)update_debugctlmsr(vmx->host_debugctlmsr);#ifndef CONFIG_X86_64/** The sysexit path does not restore ds/es, so we must set them to* a reasonable value ourselves.** We can't defer this to vmx_prepare_switch_to_host() since that* function may be executed in interrupt context, which saves and* restore segments around it, nullifying its effect.*/loadsegment(ds, __USER_DS);loadsegment(es, __USER_DS);
#endifvmx_register_cache_reset(vcpu);pt_guest_exit(vmx);kvm_load_host_xsave_state(vcpu);vmx->nested.nested_run_pending = 0;vmx->idt_vectoring_info = 0;if (unlikely(vmx->fail)) {vmx->exit_reason = 0xdead;return EXIT_FASTPATH_NONE;}vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);if (unlikely((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY))kvm_machine_check();trace_kvm_exit(vmx->exit_reason, vcpu, KVM_ISA_VMX);if (unlikely(vmx->exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))return EXIT_FASTPATH_NONE;vmx->loaded_vmcs->launched = 1;vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);vmx_recover_nmi_blocking(vmx);vmx_complete_interrupts(vmx);if (is_guest_mode(vcpu))return EXIT_FASTPATH_NONE;exit_fastpath = vmx_exit_handlers_fastpath(vcpu);if (exit_fastpath == EXIT_FASTPATH_REENTER_GUEST) {if (!kvm_vcpu_exit_request(vcpu)) {/** FIXME: this goto should be a loop in vcpu_enter_guest,* but it would incur the cost of a retpoline for now.* Revisit once static calls are available.*/if (vcpu->arch.apicv_active)vmx_sync_pir_to_irr(vcpu);goto reenter_guest;}exit_fastpath = EXIT_FASTPATH_EXIT_HANDLED;}return exit_fastpath;
}

该函数首先根据VCPU的状态写一些VMCS的值,接着执行汇编ASM_VMX_VMLAUNCH将CPU置于guest模式,这个时候CPU就开始执行虚拟机的代码,当发生退出时候,其地址是vmx_return。

VCPU退出

x86架构

VCPU的exit事件,由kvm_x86_ops.handle_exit()来处理,在/arch/x86/kvm/x86.c中

static int vcpu_enter_guest(struct kvm_vcpu *vcpu){//省略r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath);
}

退出事件

#define VMX_EXIT_REASONS_FAILED_VMENTRY         0x80000000#define EXIT_REASON_EXCEPTION_NMI       0
#define EXIT_REASON_EXTERNAL_INTERRUPT  1
#define EXIT_REASON_TRIPLE_FAULT        2
#define EXIT_REASON_INIT_SIGNAL            3#define EXIT_REASON_INTERRUPT_WINDOW    7
#define EXIT_REASON_NMI_WINDOW          8
#define EXIT_REASON_TASK_SWITCH         9
#define EXIT_REASON_CPUID               10
#define EXIT_REASON_HLT                 12
#define EXIT_REASON_INVD                13
#define EXIT_REASON_INVLPG              14
#define EXIT_REASON_RDPMC               15
#define EXIT_REASON_RDTSC               16
#define EXIT_REASON_VMCALL              18
#define EXIT_REASON_VMCLEAR             19
#define EXIT_REASON_VMLAUNCH            20
#define EXIT_REASON_VMPTRLD             21
#define EXIT_REASON_VMPTRST             22
#define EXIT_REASON_VMREAD              23
#define EXIT_REASON_VMRESUME            24
#define EXIT_REASON_VMWRITE             25
#define EXIT_REASON_VMOFF               26
#define EXIT_REASON_VMON                27
#define EXIT_REASON_CR_ACCESS           28
#define EXIT_REASON_DR_ACCESS           29
#define EXIT_REASON_IO_INSTRUCTION      30
#define EXIT_REASON_MSR_READ            31
#define EXIT_REASON_MSR_WRITE           32
#define EXIT_REASON_INVALID_STATE       33
#define EXIT_REASON_MSR_LOAD_FAIL       34
#define EXIT_REASON_MWAIT_INSTRUCTION   36
#define EXIT_REASON_MONITOR_TRAP_FLAG   37
#define EXIT_REASON_MONITOR_INSTRUCTION 39
#define EXIT_REASON_PAUSE_INSTRUCTION   40
#define EXIT_REASON_MCE_DURING_VMENTRY  41
#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
#define EXIT_REASON_APIC_ACCESS         44
#define EXIT_REASON_EOI_INDUCED         45
#define EXIT_REASON_GDTR_IDTR           46
#define EXIT_REASON_LDTR_TR             47
#define EXIT_REASON_EPT_VIOLATION       48
#define EXIT_REASON_EPT_MISCONFIG       49
#define EXIT_REASON_INVEPT              50
#define EXIT_REASON_RDTSCP              51
#define EXIT_REASON_PREEMPTION_TIMER    52
#define EXIT_REASON_INVVPID             53
#define EXIT_REASON_WBINVD              54
#define EXIT_REASON_XSETBV              55
#define EXIT_REASON_APIC_WRITE          56
#define EXIT_REASON_RDRAND              57
#define EXIT_REASON_INVPCID             58
#define EXIT_REASON_VMFUNC              59
#define EXIT_REASON_ENCLS               60
#define EXIT_REASON_RDSEED              61
#define EXIT_REASON_PML_FULL            62
#define EXIT_REASON_XSAVES              63
#define EXIT_REASON_XRSTORS             64
#define EXIT_REASON_UMWAIT              67
#define EXIT_REASON_TPAUSE              68
vmx_handle_exit()

退出最终会到vmx_handle_exit()中处理,然后根据事件分发给对应的处理函数

/** The exit handlers return 1 if the exit was handled fully and guest execution* may resume.  Otherwise they set the kvm_run parameter to indicate what needs* to be done to userspace and return 0.*/
static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {[EXIT_REASON_EXCEPTION_NMI]           = handle_exception_nmi,    /*处理不可屏蔽中断non-maskable interrupt*/[EXIT_REASON_EXTERNAL_INTERRUPT]      = handle_external_interrupt, /*总是返回1,没做什么具体处理,可忽略*/[EXIT_REASON_TRIPLE_FAULT]            = handle_triple_fault,    /*总是返回0,kvm exit shutdown*/[EXIT_REASON_NMI_WINDOW]          = handle_nmi_window,    /*总是返回1, 可不考虑*/[EXIT_REASON_IO_INSTRUCTION]          = handle_io,    /*看名字就是IO操作*/[EXIT_REASON_CR_ACCESS]               = handle_cr,    /*操作控制寄存器*/[EXIT_REASON_DR_ACCESS]               = handle_dr,    /*操作调试寄存器*/[EXIT_REASON_CPUID]                   = kvm_emulate_cpuid,    /*模拟cpuid,还是操作eax等寄存器*/[EXIT_REASON_MSR_READ]                = kvm_emulate_rdmsr,    /*模拟rdmsr指令,本质还是操作EAX寄存器*/[EXIT_REASON_MSR_WRITE]               = kvm_emulate_wrmsr,    /*模拟wrmsr指令,操作MSR等寄存器*/[EXIT_REASON_INTERRUPT_WINDOW]        = handle_interrupt_window,    /*总是返回1,可不考虑*/[EXIT_REASON_HLT]                     = kvm_emulate_halt,    /*HLT指令,暂停cpu*/[EXIT_REASON_INVD]              = handle_invd,    /*调用kvm_emulate_instruction*/[EXIT_REASON_INVLPG]              = handle_invlpg,    /*调用kvm_skip_emulate_instruction*/[EXIT_REASON_RDPMC]                   = handle_rdpmc,    /*x86的rdpmc指令,读取PMU寄存器*/[EXIT_REASON_VMCALL]                  = handle_vmcall,    /*vmcall指令,kvm_emulate_hypercall调用*/[EXIT_REASON_VMCLEAR]              = handle_vmx_instruction,[EXIT_REASON_VMLAUNCH]              = handle_vmx_instruction,[EXIT_REASON_VMPTRLD]              = handle_vmx_instruction,[EXIT_REASON_VMPTRST]              = handle_vmx_instruction,[EXIT_REASON_VMREAD]              = handle_vmx_instruction,[EXIT_REASON_VMRESUME]              = handle_vmx_instruction,[EXIT_REASON_VMWRITE]              = handle_vmx_instruction,[EXIT_REASON_VMOFF]              = handle_vmx_instruction,[EXIT_REASON_VMON]              = handle_vmx_instruction,        /*handle_vmx_instruct函数总是返回1*/[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,    /*操作寄存器,函数返回1*/[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,        /*APIC控制器*/[EXIT_REASON_APIC_WRITE]              = handle_apic_write,    /*函数返回1*/[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,    /*函数总返回1*/[EXIT_REASON_WBINVD]                  = handle_wbinvd,    //操作寄存器[EXIT_REASON_XSETBV]                  = handle_xsetbv,    //操作寄存器[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,        //处理模拟进程切换[EXIT_REASON_MCE_DURING_VMENTRY]      = handle_machine_check,    //总是返回1,可忽略[EXIT_REASON_GDTR_IDTR]              = handle_desc,[EXIT_REASON_LDTR_TR]              = handle_desc,[EXIT_REASON_EPT_VIOLATION]          = handle_ept_violation,    //和NMI相关[EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,    //ept配置错误处理[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,        //PAUSE[EXIT_REASON_MWAIT_INSTRUCTION]          = handle_mwait,    //使用NOP指令模拟MWAIT[EXIT_REASON_MONITOR_TRAP_FLAG]       = handle_monitor_trap,    //返回1,可忽略[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_monitor,        //NOP模拟MONITOR[EXIT_REASON_INVEPT]                  = handle_vmx_instruction,[EXIT_REASON_INVVPID]                 = handle_vmx_instruction,[EXIT_REASON_RDRAND]                  = handle_invalid_op,        //返回1,可忽略[EXIT_REASON_RDSEED]                  = handle_invalid_op,[EXIT_REASON_PML_FULL]              = handle_pml_full,        //返回1,可忽略[EXIT_REASON_INVPCID]                 = handle_invpcid,        //和操作内存相关,PCIDs[EXIT_REASON_VMFUNC]              = handle_vmx_instruction,        //返回1,可忽略[EXIT_REASON_PREEMPTION_TIMER]          = handle_preemption_timer,    //返回1,可忽略[EXIT_REASON_ENCLS]              = handle_encls,        //返回1,可忽略
};
vm exit原因

有许多events或者instructions会导致VM exit,其中某些事永久enable开启的,有些是可以通过VMSC控制域开关的。

Unconditional reasons for VM exit include:

  • CPUID
  • RDMSR and WRMSR unless MSR bitmap is used
  • most of VMX instructions
  • INIT signal
  • SIPI signal - does not result in exit if the processor is not in wait-for-SIPI state
  • triple fault
  • task switches (hardware, including
  • VM entry failure

There are too many controllable exit reasons to describe each one separately, but most of them can be classified as one of:

  • interrupts or interrupt windows

  • I/O ports access

  • memory access - controlled by EPT

  • HLT/PAUSE and pre-emption timer - useful for multiple VMs running on one physical CPU

  • changes to descriptor tables and control registers

  • APIC access

kvm_userspace_exit

virt/kvm/kvm_main.c中的kvm_vcpu_ioctl()在处理KVM_RUN中,当从kvm_arch_vcpu_ioctl_run()这个涉及具体架构的vcpu run的处理函数退出时,意味着内核kvm层对vcpu的处理已经无法处理,需要继续退出至qemu去处理,即需要从内核态返回用户态去处理了。

        r = kvm_arch_vcpu_ioctl_run(vcpu);trace_kvm_userspace_exit(vcpu->run->exit_reason, r);

kvm_arch_vcpu_ioctl_run()函数退出时,系统叫它userspace exit

#define KVM_EXIT_UNKNOWN          0
#define KVM_EXIT_EXCEPTION        1
#define KVM_EXIT_IO               2
#define KVM_EXIT_HYPERCALL        3
#define KVM_EXIT_DEBUG            4
#define KVM_EXIT_HLT              5
#define KVM_EXIT_MMIO             6
#define KVM_EXIT_IRQ_WINDOW_OPEN  7
#define KVM_EXIT_SHUTDOWN         8
#define KVM_EXIT_FAIL_ENTRY       9
#define KVM_EXIT_INTR             10
#define KVM_EXIT_SET_TPR          11
#define KVM_EXIT_TPR_ACCESS       12
#define KVM_EXIT_S390_SIEIC       13
#define KVM_EXIT_S390_RESET       14
#define KVM_EXIT_DCR              15 /* deprecated */
#define KVM_EXIT_NMI              16
#define KVM_EXIT_INTERNAL_ERROR   17
#define KVM_EXIT_OSI              18
#define KVM_EXIT_PAPR_HCALL      19
#define KVM_EXIT_S390_UCONTROL      20
#define KVM_EXIT_WATCHDOG         21
#define KVM_EXIT_S390_TSCH        22
#define KVM_EXIT_EPR              23
#define KVM_EXIT_SYSTEM_EVENT     24
#define KVM_EXIT_S390_STSI        25
#define KVM_EXIT_IOAPIC_EOI       26
#define KVM_EXIT_HYPERV           27
#define KVM_EXIT_ARM_NISV         28

VCPU调度

现代处理器通常都是多对称处理,操作系统一般可以自由地将VCPU调度到任何一个物理CPU上运行。当VCPU在不同的物理CPU上运行的时候会影响虚拟机的性能。这是由于在同一个物理CPU上运行VCPU时只需要执行VMRESUME指令即可,但是如果要切换到不同的物理CPU,则需要执行VMCLEAR、VMPTRLD和VMLAUNCH指令


将一个VCPU调度到不同的物理CPU上的简化步骤,实际kvm处理比这复杂:

  1. 在源物理CPU执行VMCLEAR指令,这可以保证将当前CPU关联的VMCS相关缓存数据冲刷到内存中
  2. 在目的VMCS区域以VCPU的VMCS物理地址为操作数执行VMPTRLD指令
  3. 在目的VMCS区域执行VMLAUNCH指令

每个物理CPU会有一个指向VMCS结构体的指针per cpu变量current_vmcs,这是在vmx.c中定义的

DEFINE_PER_CPU(struct vmcs *, current_vmcs);

每一个VCPU也分配了一个VMCS结构,这是在vmx_create_vcpu中创建并保存在vmx_vcpu的loaded_vmcs中vmcs成员中的。VCPU的调度本质上就是让物理CPU的per cpu变量current_vmcs在所有VCPU之间分配,在某一时刻会指向这些VCPU中的一个。
在这里插入图片描述

  1. 内核调用vcpu_load将VCPU1与PCPU1关联起来,如果是第一次调用ioctl(KVM_RUN),则vcpu_load在kvm_vcpu_ioctl函数的开始被调用。如果是被调度进来的,则是在kvm_sched_in中,通过kvm_arch_vcpu_load调用到最终实现的vcpu_load(如vmx_vcpu_load),完成关联过程。
  2. 当PCPU1执行虚拟机代码时,当前线程是禁止抢占以及被中断打断的,但是中断却可以触发VM Exit,也就是让虚拟机退出到宿主机。退出并处理一些必要的工作之后就会开启中断和抢占,这样PCPU1就有可能去调度别的线程或VCPU。
  3. VCPU1的线程被抢占之后调用kvm_sched_out。当又该调度VCPU1时,系统却把它调度到物理CPU2上,那么就需要将VCPU1的状态与PCPU2关联起来。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/488325.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

SpringBoot:自定义starter

点击查看&#xff1a;LearnSpringBoot08starter 点击查看&#xff1a;LearnSpringBoot08starterTest 点击查看更多的SpringBoot教程 一、主要流程 1. 先创建空的project 2. 打开空的project 结构 图选中model 点击 3. 创建 model&#xff08;Maven&#xff09;启动器 提…

如何开通融资融券?哪家好?融资融券业务一文解读(建议收藏)

A股已经连续8个交易日红&#xff0c;春节之后强心针继续打上。很多股民朋友又纷纷回到股市。还有很多朋友感觉什么都想要进&#xff0c;资金不够&#xff0c;就开始加杠杆了。但是杠杆这个东西不能盲目去加&#xff0c;融资融券就是加杠杆&#xff0c;具有怎么回事&#xff1f;…

社区志愿者齐心协力,为社区居民营造温馨和谐环境

近日&#xff0c;在我们的社区里&#xff0c;一场温暖而有力的力量正在悄然兴起。一群热心居民自发组织成为社区志愿者团队&#xff0c;积极投身于服务社区的各项活动中&#xff0c;为居民们营造了一个温馨和谐的生活环境。 在每个周末的清晨&#xff0c;志愿者们早早地聚集在社…

【数据结构-字符串 五】【字符串转换】字符串转为整数

废话不多说&#xff0c;喊一句号子鼓励自己&#xff1a;程序员永不失业&#xff0c;程序员走向架构&#xff01;本篇Blog的主题是【字符串转换】&#xff0c;使用【字符串】这个基本的数据结构来实现&#xff0c;这个高频题的站点是&#xff1a;CodeTop&#xff0c;筛选条件为&…

旷视low-level系列(三):(NAFNet)Simple Baselines for Image Restoration

题目&#xff1a;Simple Baselines for Image Restoration 单位&#xff1a;旷视 收录&#xff1a;ECCV2022 论文&#xff1a;https://arxiv.org/abs/2204.04676 代码&#xff1a;https://github.com/megvii-research/NAFNet 文章目录 1. Motivation2. Contributions3. Methods…

[HTML]Web前端开发技术28(HTML5、CSS3、JavaScript )JavaScript基础——喵喵画网页

希望你开心&#xff0c;希望你健康&#xff0c;希望你幸福&#xff0c;希望你点赞&#xff01; 最后的最后&#xff0c;关注喵&#xff0c;关注喵&#xff0c;关注喵&#xff0c;佬佬会看到更多有趣的博客哦&#xff01;&#xff01;&#xff01; 喵喵喵&#xff0c;你对我真的…

ES6 | (一)ES6 新特性(上) | 尚硅谷Web前端ES6教程

文章目录 &#x1f4da;ES6新特性&#x1f4da;let关键字&#x1f4da;const关键字&#x1f4da;变量的解构赋值&#x1f4da;模板字符串&#x1f4da;简化对象写法&#x1f4da;箭头函数&#x1f4da;函数参数默认值设定&#x1f4da;rest参数&#x1f4da;spread扩展运算符&a…

Optimization for Deep Learning

Notations: : model parameters at time step or : gradient at used to compute : momentum accumulated from time step to time step , which is used to cpmpute Optimization What is Optimization about? 找到一组参数&#xff0c;使得 最小&#xff0c;或者说是…

视频怎么变成gif动图?一招教你在线转换

MP4是一种常见的视频文件格式&#xff0c;它可以包含音频和视频数据&#xff0c;并支持高质量的视频压缩。MP4视频可以呈现连续的动态效果&#xff0c;可以包含平滑的运动、音频等多媒体元素。而GIF动图是由一系列静态图像组成的&#xff0c;通过快速连续播放这些帧来创造出动态…

使用Python制作进度条有多少种方法?看这一篇文章就够了!

前言 偶然间刷到一个视频&#xff0c;说到&#xff1a;当程序正在运算时&#xff0c;会有一个较长时间的空白期&#xff0c;谁也不知道程序运行的进度如何&#xff0c;不如给他加个进度条。 于是我今个就搜寻一下&#xff0c;Python版的进度条都可以怎么写&#xff01; 送书…

一出手就是“天价”,鹰角网络的第二款游戏《来自星尘》,备受游戏行业关注

​还有4天&#xff0c;鹰角网络的第二款游戏《来自星尘》即将面市。 行业内大部分人都在关注这一产品的落地情况&#xff0c;想要知道市场对于这一游戏的反应。 这当然有其原因。 最简单的一点是&#xff0c;这是鹰角网络自《明日方舟》后&#xff0c;时隔五年后才出的第二款…

基础光学系列:(一)光学在机器视觉中的角色:原理、应用与学习途径

光学是一门研究光的产生、传播以及与物质相互作用的科学&#xff0c;对于机器视觉技术的发展至关重要。机器视觉利用计算机和相机系统模拟人类视觉&#xff0c;解释和理解图像&#xff0c;广泛应用于制造业、医疗、安全监控等领域。本文旨在探讨光的传播原理及其在机器视觉中的…