OpenHCL: Evolving Azure’s virtualization model

Azure Boost is a revolutionary accelerator system designed by Microsoft that offloads server virtualization processes traditionally performed by the hypervisor and host OS onto purpose-built software and hardware. This offloading frees up CPU resources for virtual machines, resulting in improved performance and a secure foundation for your cloud workloads.

In this blog, we will talk about some of the advances we’ve made within Azure Host OS that allow us to provide the industry-leading benefits of Azure Boost and improve the security of our customers with other features. Azure Host OS (aka Cloud Host), if you recall, is a purpose-built minimal version of Windows that powers Azure in the data center. These Azure Host advancements in conjunction with Azure Boost have enabled features like Confidential VMs, Trusted Launch, to improve IO performance, harden security, and introduce VM compatibility for seamless feature delivery. These features are powered by a completely new transparent para-virtualized layer that runs within each guest VM instance, named “OpenHCL”. OpenHCL is a para-virtualization layer built from the ground-up in the Rust programming language. Rust is designed with strong memory safety principles, making it ideally suited for the virtualization layer.

Chris Oo from our team has a talk on OpenHCL at the “Linux Plumbers Conference 2024”, which has more technical design and details. The talk titled “OpenHCL: A Linux based paravisor for Confidential VMs” is available [here].

In the upcoming sections, we’ll start by exploring the virtualization landscape and how Azure’s infrastructure has evolved over time to take advantage of the modern hardware architecture. We’ll then talk about the internals of this para-virtualized layer and how it supports some of the core Azure features that our customers depend on.

Virtualization models

Azure Host OS provides core virtualization services for managing compute and memory resources, as well as virtualizing devices for VMs. Under the hood, it partitions physical hardware into logically separated virtual environments, each with their dedicated (virtual) processors, memory, and view of devices (storage, networking).

Traditional device virtualization

In traditional virtualization architecture, the host operating system handles most of the communication between the guest operating system (VM) and the underlying physical hardware (CPU, memory, device IO). For example, if the VM wishes to perform a network or storage operation (i.e. send a packet over the network, read/write data to storage), the guest communicates with the host OS (over a shared channel called VMBus) and the host facilitates the IO operation on the guest’s behalf.

This device virtualization model is referred to as a Para-virtualized IO model [wiki]. The guest OS is “enlightened” or aware that its running virtualized and runs special drivers to communicate with the host. This model is simple, efficient, and widely used across most cloud providers.

One drawback of this mode is that there is significant interaction with the host OS to do IO, which can add latency, affect throughput, or result in noisy neighbor side-effects. The performance of this mode can be significantly improved by allowing the guest VM to directly access the PCIe device instead of relying on the host for communication. Bypassing the host OS data path allows for lower latency, reduced jitter, and improved VM responsiveness. This is typically called “discrete device assignment” in Microsoft documentation or sometime referred to as accelerated device model.

thumbnail image 1 captioned Traditional virtualization architectures rely on the Host OS to carry out device virtualization on behalf of the guest Traditional virtualization architectures rely on the Host OS to carry out device virtualization on behalf of the guest

Accelerated Device IO

As explained to achieve higher IO performance, the virtualization stack supports a direct assigned device or accelerated IO mode, where VMs can directly access and communicate with devices without Host intervention. If the VM wishes to perform an IO operation, the guest leverages special drivers that live within its context to communicate directly with the physical device.

In the same example above, if the VM needs to perform a network operation, it can perform it more efficiently by communicating using the direct path to the network device. This VM is considered fully enlightened-- it possesses the right drivers for direct communication with device hardware. The direct data path reduces overhead in comparison to the additional translations found in the para-virtualized IO model. This leads to improved performance and throughput that is comparable to physical devices running without virtualization.

Discrete Device Assignment (DDA) and Single Root I/O Virtualization (SR-IOV) are two types of accelerated device models used in virtualization. DDA assigns an entire device to a VM and is mostly used in GPU assignment scenarios to provide VMs full access to the GPU’s capabilities for workloads such as AI training and inferencing. SR-IOV divides a single physical device’s resources into multiple virtual interfaces for different VMs. SR-IOV is typically used for network and storage IO devices, as it allows multiple virtual machines to share the same physical hardware resources most efficiently

Some examples in the Azure fleet today include, GPU acceleration via Discrete Device Assignment, Accelerated networking via SR-IOV, and NVMe Direct VMs for storage.

In the next section, we will talk about OpenHCL which is another evolution of the device IO virtualization.

OpenHCL: A privileged guest compatibility layer

Building on the advancements of the accelerated model, we introduced OpenHCL, a new virtualization layer that can transparently provide guest VMs with facilities such as accelerated IO and other security features. This lightweight virtualization environment runs privileged within the guest virtual machine and isolated from the guest operating system. Instead of sharing para-virtualized components exposed by host interfaces, each VM runs its own virtualization instance which enhances security isolation and efficiency. As we’ll discuss below, OpenHCL is essential for Azure Boost guest compatibility scenarios, in which VMs require the appropriate drivers and orchestration to leverage performance enhancements from Boost’s NVMe storage and MANA network accelerated device.

This environment consists of two main components: a minimal Linux kernel and a Rust-based VMM that provides device emulation and I/O translation. This layer equips VMs with the necessary software and drivers to light up functionality such as SR-IOV device assignment for Azure Boost network and storage optimized accelerators without needing any change in the guest OS. This is hugely beneficial to our customers who can now use the same VM image while getting the benefits of Azure Boost – continuing to show our customers how much Microsoft invests in application compatibility.

thumbnail image 2 captioned Guest compatibility virtualization software supports Azure Boost hardware acceleration & virtualized TPM Guest compatibility virtualization software supports Azure Boost hardware acceleration & virtualized TPM

To do this, we leverage Virtual Secure Mode (VSM) technology, a set of Hyper-V capabilities that enable new security boundaries (or “virtual trust levels”) within a VM context. By creating a new isolated Virtual Trust Level (VTL2) within the guest environment, we establish a higher privilege execution environment that can transparently host code in the VM. This allows us to run privileged security functionality like a virtual TPM for Trusted Launch VMs and paravisor for Azure Confidential VMs (we’ll cover these topics in later sections). Within this layer, we can also run device virtualization facilities that enlighten VMs to communicate with Azure Boost hardware.

The VSM isolation model and reduced data path from VM to device adds protective measures by providing more robust multi-tenant isolation and reducing the Trusted Computing Base (TCB) on the Azure Host. By confining the virtualization stack to the tenant’s VM and reducing dependencies on the Host for IO operations, we can eliminate shared host components which narrow down the potential attack surface and enhance security. Shifting the architecture from host providing para-virtualized interface to each VM instance running its own virtualization, additionally allows for greater performance isolation and efficiency. Reiterating this point since its so important with the OpenHCL architecture, each VM receives its own para-virtualized layer and doesn’t share anything with the Host or other VMs. This isolation hugely improves the customer VM experience and isolation.

Zooming into the components that make up this layer, the VTL2 environment is made up of a completely newly written Rust based virtualization stack running on a minimal Linux kernel that provides device emulation and I/O translation. Rust system programming language has emerged as one of the leading memory safe programming languages. Rust’s memory safety & type system features help prevent common vulnerabilities like buffer overflows and dangling pointers. Its concurrency model enhances security in multi-threaded environments by preventing data race conditions. Rust offers robust security benefits making it especially advantageous and critical for sensitive workloads. Together these components make up the para-virtualized VTL2 environment that underpins some of Azure’s key technologies.

In the next section we’ll describe some of the uses of this technology in Azure Boost, Trusted Launch VMs, and Azure Confidential VMs. This virtualization environment was first introduced with Trusted Launch VMs and was later extended to introduce additional capabilities around I/O compatibility and paravisor support for Azure Boost and Azure Confidential VMs respectively.

OpenHCL in Azure Boost

As mentioned, Azure Boost is Microsoft’s hardware acceleration solution that offers industry leading network and storage optimization via Microsoft Azure Network Adapter (MANA) and NVMe storage, by offloading networking and storage operations onto specialized FPGA hardware and software.

Offloading networking and storage tasks onto dedicated Azure Boost hardware frees up CPU for guest VMs and eliminates I/O virtualization bottlenecks. The result is a network capable of 200 Gbps bandwidth via Microsoft’s next generation network interface, Microsoft Azure Network Adapter (MANA), local storage operations reaching 17.3GBps with 3.8 million IOPs, and remote storage operations reaching 12.5 GBps throughput with 650K IOPs. Enhancing Azure's infrastructure by isolating it from hypervisor and host resources boosts performance while reducing latency and jitter.

Using the OpenHCL para-virtualized layer, VMs receive the necessary MANA and NVMe drivers and virtual functions to bootstrap accelerated IO connections. As a result, the guest can begin direct communication with the specialized Azure Boost hardware. On enlightened guest VMs, which come with pre-installed drivers and VMBus support to communicate with the accelerated hardware, the model sets up initial communication and reduces latency and downtime for networking and storage devices as it allows guest VMs to fall back to the software networking path in case of disconnection to the acceleration path.

For unenlightened guest VMs that come with default inbox virtualization drivers, OpenHCL transparently provides the necessary drivers to enable these guest VMs to communicate with the new accelerated hardware without the need to install new images or update the operating system. This allows the existing VM types to get the power of Azure Boost, with no changes to their images. To achieve optimal performance, we recommend adding the appropriate drivers to VM image.

Azure Boost VM SKUs are available today in preview across a variety of VM series to optimize for the demands of varying workloads. To learn more, see Overview of Azure Boost | Microsoft Learn

OpenHCL in Trusted Launch VMs

The OpenHCL virtualization layer also helped launch Trusted Launch for Azure virtual machines on Generation 2 VMs. Trusted Launch VMs introduced virtual Trusted Platform Module (vTPM) and secure boot with guest attestation. Secure Boot establishes a “root of trust” and verifies that only VMs with properly signed OS code can boot, preventing rootkits and boot kits from infecting the OS boot process with malware. A virtual Trusted Platform Module is virtualized hardware that serves as a dedicated storage vault for key and measurements. The vTPM measures and seals the VM’s entire boot chain (UEFI, OS, system, drivers), which allows the guest VM to perform remote guest attestation. Everything from the firmware through the OS drivers are “measured” and chained to a hardware root of trust. The VM can then establish trust with a 3rd party by cryptographically “attesting” or proving its boot integrity and compliance.

Leveraging the privileged OpenHCL VTL2 layer allows us to run a virtualized TPM and execute remote attestation processes directly from within the guest operating system. A virtual TPM cannot run and perform attestation if its running at the same privilege as the rest of the guest operating system as it stores and persists secrets.

OpenHCL in Azure confidential VMs

Confidential VMs provide enhanced security features that allow customers to protect their most sensitive data in use by performing computation inside a hardware based, attested Trusted Execution Environment (TEE). The Trusted Execution Environment is a secure, isolated environment that prevents unauthorized access or modification of applications and data while in use. This increases the security level of organizations that manage sensitive and regulated data.

Azure confidential VMs use the concept of a paravisor to implement enlightenment on behalf of the guest OS so that the guest OS can run mostly unmodified inside a CVM across various hardware providers. With the paravisor, the guest OS does not need to be fully enlightened to run confidential in Azure, meaning that we can support older OS versions on Azure confidential VMs. Without this paravisor, Azure confidential VM support would be limited to specific OS versions with the necessary features. This allows for easier “lift and shift” of sensitive workloads.

OpenHCL is an implementation of the paravisor for Confidential VMs in Azure, that will soon be available in the Azure fleet. Similarly to the virtualized TPM on Trusted Launch VMs, the OpenHCL VTL2 partition is used to securely host the guest paravisor firmware layer for confidential VM support. As mentioned above, OpenHCL will effectively allow guest VMs to run as confidential VMs in Azure, adding support across a wide variety of guest OS’s and confidential hardware providers.

To learn more about Confidential VMs, visit our other blog: Confidential VMs on Azure - Microsoft Community Hub.

Learn more

In this blog, we've explored the evolution of Azure's virtualization architecture which help power industry-leading technologies like Azure Boost, Trusted Launch VMs, and Azure confidential VMs. We've outlined key benefits of this model across hardware acceleration, security isolation, performance, and seamless feature compatibility. As you read through the blog and the links within, if you have any questions, please feel free to comment below.

fonte: https://techcommunity.microsoft.com/t5/windows-os-platform-blog/openhcl-evolving-azure-s-virtualization-model/ba-p/4248345

The World of Windows

Search This Blog