AVArcher
Virtual Infrastructure Specialists
HOME SERVICES ABOUT US PARTNERS CONTACT US

 

VMware

VMware RSS Feed


09/08/2010 09:11 PM
HPC Application Performance on ESX 4.1: Stream
HPC Application Performance on ESX 4.1: Stream

Recently VMware has seen increased interest in migrating High Performance Computing (HPC) applications to virtualized environments. This is due to the many advantages virtualization brings to HPC, including consolidation, support for heterogeneous OSes, ease of application development, security, job migration, and cloud computing (all described here). Currently some subset of HPC applications virtualize well from a performance perspective. Our long-term goal is to extend this to all HPC apps, realizing that large-scale apps with the lowest latency and highest bandwidth requirements will be the most challenging. Users who run HPC apps are traditionally very sensitive to performance overhead, so it is important to quantify the performance cost of virtualization and properly weigh it against the advantages. Compared to commercial apps (databases, web servers, and so on), which are VMware’s bread-and-butter, HPC apps place their own set of requirements on the platform (OS/hypervisor/hardware) in order to execute well. Two common ones are low-latency networking (since a single app is often spread across a cluster of machines) and high memory bandwidth. This article is the first in a series that will explore these and other HPC performance subjects. Our goal will always be to determine what works, what doesn’t, and how to get more of the former. The benchmark reported on here is Stream, which is a standard tool designed to measure memory bandwidth. It is a “worst case” micro-benchmark; real applications will not achieve higher memory bandwidth.

Configuration

All tests were performed on an HP DL380 with two Intel X5570 processors, 48 GB memory (12 × 4 GB DIMMs), and four 1-GbE NICs (Intel Pro/1000 PT Quad Port Server Adapter) connected to a switch. Guest and native OS is RHEL 5.5 x86_64. Hyper-threading is enabled in the BIOS, so 16 logical processors are available. Processors and memory are split between two NUMA nodes. A pre-GA lab version of ESX 4.1 was used, build 254859.

Test Results

The OpenMP version of Stream is used. It is built using a compiler switch as follows:

gcc -O2 -fopenmp stream.c -o stream

The number of simultaneous threads is controlled by an environment variable:

export OMP_NUM_THREADS=8

The array size (N) and number of iterations (NTIMES) are hard-wired in the code as N=108 (for a single machine) and NTIMES=40. The large array size ensures that the processor cache provides little or no benefit. Stream reports maximum memory bandwidth performance in MB/sec for four tests: copy, scale, add, and triad (see the above link for descriptions of these). M stands for 1 million, not 220. Here are the native results, as a function of the number of threads:

Table 1. Native memory bandwidth, MB/s

 

Threads

1

2

4

8

16

Copy

6388

12163

20473

26957

26312

Scalar

5231

10068

17208

25932

26530

Add

7070

13274

21481

29081

29622

Triad

6617

12505

21058

29328

29889

Note that the scaling starts to fall off after two threads and the memory links are essentially saturated at 8 threads. This is one reason why HPC apps often do not see much benefit from enabling Hyper-Threading. To achieve the maximum aggregate memory bandwidth in a virtualized environment, two virtual machines (VMs) with 8 vCPUs each were used. This is appropriate only for modeling apps that can be split across multiple machines. One instance of stream with N=5×107 was run in each VM simultaneously so the total amount of memory accessed was the same as in the native test. The advanced configuration option preferHT=1 is used (see below). Bandwidths reported by the VMs are summed to get the total. The results are shown in Table 2: just slightly greater bandwidth than for the corresponding native case.

Table 2. Virtualized total memory bandwidth, MB/s, 2 VMs, preferHT=1

 

Total threads

2

4

8

16

Copy

12535

22526

27606

27104

Scalar

10294

18824

26781

26537

Add

13578

24182

30676

30537

Triad

13070

23476

30449

30010

It is apparent that the Linux “first-touch” scheduling algorithm together with the simplicity of the Stream algorithm are enough to ensure that nearly all memory accesses in the native tests are “local” (that is, the processor each thread runs on and the memory it accesses both belong to the same NUMA node). In ESX 4.1 NUMA information is not passed to the guest OS and (by default) 8-vCPU VMs are scheduled across NUMA nodes in order to take advantage of more physical cores. This means that about half of memory accesses will be “remote” and that in the default configuration one or two VMs must produce significantly less bandwidth than the native tests. Setting preferHT=1 tells the ESX scheduler to count logical processors (hardware threads) instead of cores when determining if a given VM can fit on a NUMA node. In this case that forces both memory and CPU of an 8-vCPU VM to be scheduled on a single NUMA node. This guarantees all memory accesses are local and the aggregate bandwidth of two VMs can equal or exceed native bandwidth. Note that a single VM cannot match this bandwidth. It will get either half of it (because it’s using the resources of only one NUMA node), or about 70% (because half the memory accesses are remote). In both native and virtual environments, the maximum bandwidth of purely remote memory accesses is about half that of purely local. On machines with more NUMA nodes, remote memory bandwidth may be less and the importance of memory locality even greater.

Summary

In both native and virtualized environments, equivalent maximum memory bandwidth can be achieved as long as the application is written or configured to use only local memory. For native this means relying on the Linux “first-touch” scheduling algorithm (for simple apps) or implementing explicit mechanisms in the code (usually difficult if the code wasn’t designed for NUMA). For virtual a different mindset is needed: the application needs to be able to run across multiple machines, with each VM sized to fit on a NUMA node. On machines with hyper-threading enabled, preferHT=1 needs to be set for the larger VMs. If these requirements can be met, then a valuable feature of virtualization is that the app needs to have no NUMA awareness at all; NUMA scheduling is taken care of by the hypervisor (for all apps, not just for those where Linux is able to align threads and memory on the same NUMA node). For those apps where these requirements can’t be met (ones that need a large single instance OS), current development focus is on relaxing these requirements so they are more like native, while retaining the above advantage for small VMs.

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Redirecting a USB boot device from View Client might make the system unresponsive or unusable (1021409)
Redirecting a USB boot device from View Client might make the system unresponsive or unusable (1021409)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Sharing a composite USB device can shut down the guest operating system (1021413)
Sharing a composite USB device can shut down the guest operating system (1021413)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Events might not be recorded in the event database if the View Connection Server service crashes (1021461)
Events might not be recorded in the event database if the View Connection Server service crashes (1021461)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Configuring cipher suites and security protocols on a View Connection Server instance or security server (1021466)
Configuring cipher suites and security protocols on a View Connection Server instance or security server (1021466)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
View Agent 4.5 blocks BlackBerry devices from being redirected (1021545)
View Agent 4.5 blocks BlackBerry devices from being redirected (1021545)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Determining whether a local desktop is checked out (1022342)
Determining whether a local desktop is checked out (1022342)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Data Collection Tool fails with a runtime error when run in the Japanese locale (1025685)
Data Collection Tool fails with a runtime error when run in the Japanese locale (1025685)

If you run Data Collection Tool for View Client in the Japanese locale, View Client might fail with the following error message:   vdm-support.vbs(1384,3)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Installing a Broadcom 2045 Bluetooth 2.0 USB device driver crashes a client system (1025715)
Installing a Broadcom 2045 Bluetooth 2.0 USB device driver crashes a client system (1025715)

If you use View Client to share a local Bluetooth device with a desktop session, and attempt to use the Found New Hardware wizard to install the Broadcom

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Changing the maximum number of event records that View Administrator can display (1026196)
Changing the maximum number of event records that View Administrator can display (1026196)

To improve performance, View Administrator displays only the most recent 2000 events from the event and event_data tables. You can change this limit by

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
USB redirection over RGS fails for HP thin clients running Windows XPe (1026375)
USB redirection over RGS fails for HP thin clients running Windows XPe (1026375)

If you attempt to share a USB device from an HP thin client running Windows XP Embedded to a remote system over the HP Remote Graphics Software (RGS) connection

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Changed behavior of debug and full level logging in View 4.5 (1026428)
Changed behavior of debug and full level logging in View 4.5 (1026428)

For View versions earlier than 4.5, if you turn off debug or full level logging, View Manager continues to write to the debug log files, but the content is the same as that of the

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
The user_events view does not retrieve desktop reconnection events (1026480)
The user_events view does not retrieve desktop reconnection events (1026480)

The user_events view for the event database does not retrieve desktop reconnection events, which can cause the vdmadmin -I -report events -view user_events command or the

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Safari 5.0.1 does not run on a Windows XP guest with PCoIP Smartcard authentication enabled in View Agent (1026696)
Safari 5.0.1 does not run on a Windows XP guest with PCoIP Smartcard authentication enabled in View Agent (1026696)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.

09/08/2010 09:00 AM
Redirecting logs to a RAM drive on a diskless thin client (1026741)
Redirecting logs to a RAM drive on a diskless thin client (1026741)

       Download VMware Products  | Privacy  | Update Feed Preferences 
        Copyright © 2010 VMware, Inc. All rights reserved.


© 2002-2010 AVArcher