Using vCenter Operations Manager to help in performance testing

I’ve recently been involved in performance testing a SharePoint 2013 farm. This has led to some discoveries on what you can use vCops for when you do performance testing, including what metrics you should look at.

Setup

The setup we used for testing,  was what Microsoft calls a Medium Farm topology. It consists of 2 Web Frontends, 2 application servers for search and index and 1 database server. In front of the web frontends we’ve placed a Citrix Netscaler for load balancing, SSL off load and such. Each server runs Windows 2012, the SharePoint is 2013 and Microsoft SQL is 2012. The SharePoint has a few webparts on the front page that’s been built for the webpage.

Initial configuration:

  • Webservers: 16 vCpu’s  16 Gb ram
  • Application Servers: 8 vCpu’s 8 Gb ram
  • Database Server: 8 vCpu’s and 16 Gb ram

All these were run on a single vSphere 5.0 host with 4 CPU’s and 8 cores each, and 196 Gb of Ram, giving us 32 cores. So at this point we were overcomitting on CPU somewhat, having 56 vCpu’s provisioned. These were not the final destination hosts of these servers, at the final destination hosts we wouldn’t overcommit. However we were certain that if we saw bottlenecks here then they would also be present at the final destination for the farm.

One of the requirements was that the front page should load in less than 2 seconds, with up to 2000 users in 5 minutes. The test runs were run in Visual Studio and simulated users viewing the front page, loading all the javascripts and stylesheets and stuff. To make sure we would get data into vCops that was a full load for the 5 minutes intervals, we ran the test for a full 15 minutes.

Inside Windows we could see the CPU maxing out, to a point where system processes were complaining, and we saw load times of over 30 seconds for the front page.
First thing that happened was someone was looking inside windows and saying more CPU is needed. I, however was looking at vCenter performance and only seeing a max of roughly 40% CPU usage. When looking at vCops i saw a Usage maxing out at 35.63% but a demand of 54.52%.

Demand - Usage 20 februar

I interpreted that result as, the VM actually requested more CPU than vmware was giving it. However it was decided to try and give 8 more vCpu’s to the Webservers, instead of scaling them down to 8 vCpu’s

Running the test again with 24 vCpu webservers yielded the exact same result, more than 30 second load times and CPU inside windows maxing out. Looking at vCops we saw this:

Demand - Usage 21 februar

An even lower % Usage and Demand. I’m thinking we’re overcomitting too much now, and going back into vCops i pulled out the %Ready counters as well, for both feb 20 and 21.

Demand - Usage - Ready 20 februar

Demand - Usage - Ready 21 februar

What that told us was that at 16 vCpu’s %READY maxed out at around 14%, which is kinda bad. And adding 8 more vCpu’s that jumps to 28.75%, meaning close 1/3 of the time the machine was ready, but couldn’t get access to a CPU on the host. One funny thing you can tell from the first picture, is that you can tell when the 8 vCpu’s were added, since that moved the “idle” %READY from around 6% to 14.87%.

The following Monday we scaled the machines down from 24 to 8 vCpus’s and ran the test again. Load times were still at around 30 seconds, so we didn’t really solve the problem, but looking at vCops we saw a completely different picture

Demand - Usage - Ready 24 februar

The graphs shows the %READY dropping from roughly 12% to around 1%, and funny thing here is that, while we’re running the test the %READY drops even further.

At this time we decided to give Michael Monberg from Vmware a call, to ask about what to look for in vCops. He showed us this neat trick, when looking at performance metrics for a given VM, and have some graphs from the VM showing, like the ones above,  you can do this:

Click on the + sign near the health tree

health tree It expands and shows you:

klik server

The host, the VM and the Datastore, if you then single click on the host, you get a new Metric Selector

metric selector

But one from the Host, so expanding the CPU Usage i could select the Core Utilization metric and add that to the graph page. That now gave us graphs from the VM and the Host and the same time. Showing the graphs from Feb 20, 21 and 24 but with the new metric added:

Feb 20, 16 vCpu’s

Core util 20 feb

Feb 21th 24 vCpu’s

Core util 21 feb

Feb 24th 8 vCpu’sCore util 24 feb

From that we could tell that with 16 and 24 vCpu’s the host was totally maxed out on its physical cores, where on 8 vCpu’s we only used around 66%. So when the VM metric only shows 35% CPU usage the Host was maxed out, and thus adding the 8 extra vCpu’s had no positive effect on the VM.

When we ran the test with 8 vCpu’s and got the exact same results, we actually weren’t crossing NUMA nodes, which is what Microsoft recommends. See my blog post on NUMA and vNUMA here

This blog post was written to show some of the nice things you can do with vCenter Operations, and i hope you found it useful.

NUMA and vNUMA

 

*disclaimer* This post is based on my findings so you might experience different stats, but the general guidelines are sound to my knowledge *disclaimer*

NUMA or Non Uniform Memory Access has been around in Intel processors since 2007 when they introduced their Nehalem processors.

NUMA and vNUMA

VMware has since vSphere v5.0 supported that the guest OS is exposed to the NUMA of the underlying processors. It’s automatically enabled if you create a VM that has more than 8 vCPU’s, only requirement is that your VM is at hardware level 8. You can manually edit advanced settings, so that the NUMA topology is exposed even if you have a lower number of vCPU’s. However, it is strongly recommended that you clearly understand how that impacts your VM when you do that.

2x8 numa sockets and core

CPU hot-add and vNuma

One thing you have to be aware of when thinking about deploying machines that cross NUMA boundaries is that if you enable hot-add cpu on your guest, VMware hides the NUMA information from guest OS, meaning that then the guest OS can’t smartly place applications and memory on the same NUMA node. So for performance intensive systems I would recommend that you turn off the hot-add CPU feature. Hot-add memory is however still working.

Images show Coreinfo run on a 4 CPU machine with 4 cores before and after CPU hot-add was enabled.

Coreinfo 4x4 uden hotplug

4x4 med hotadd cpu

Coreinfo 4x4 med hotplug

Crossing a NUMA boundary.

NUMA is a memory thing, but you can cross a NUMA boundary more than one way. If you have a 4 way machine with 8 core CPU’s and 128 GB of memory, that gives a NUMA boundary at 8 vCPU’s and 32GB of ram, meaning if you create a VM with more than 8 vCPU’s OR more than 32GB’s of ram, then that VM might have to access memory from another CPU.

If you cross a NUMA boundary there is a penalty, which can be shown in part by Coreinfo.

numa penalty

This example is a 4 way guest with 8 cores on each socket for a total af 32 Cores, but runs on a 2×10 core system, so 12 of the cores are run in HT. It shows the  penalty that you’re given by access memory thats located on another node, and it has deemed that accessing local memory is the fastest. I have however seen some systems where Node 0 to Node 0 was slower than Node 1 to Node 1.

In normal environments that might not really be a problem as access to Storage or Networking is way slower than accessing memory on another CPU. But in high performance clusters this is something you should consider when building your physical infrastructure. However I’ve spoken to one customer who had seen up to a 35% degradation in overall performance by crossing a NUMA boundary, and because of that they only advises VM’s that can fit into 1 physical CPU and its memory.

Understand your hardware Configurations

One thing you really need to pay attention to is your hardware configuration. Setting the wrong Socket and Core configs in VMware compared to your physical hardware configuration can decrease performance a lot. Mark Achtemichuk has a nice blogpost which shows how different performance can be by selecting various settings in vSphere for Number of virtual sockets and Number of cores per socket.

On top of that your hardware vendor might not be aware that the config they’re selling you is crossing NUMA boundaries. I’ve eg bought 2 CPU blades from Dell that were put on 4-way motherboards, so I could add more CPU’s later. But then to keep the cost down, Dell used 4 GB memory blocks and distributed them evenly across all the banks on the motherboard. Meaning that half my memory can now only be accessed with a penalty. If memory performance is a needed in your environment, then the cost of using 8 GB memory blocks might be worth it.

Microsoft and NUMA support

Since Windows 2003 Enterprise and Datacenter editions, windows server has been NUMA-aware, meaning the OS can schedule its threads so processes that access the same areas of memory are put unto the same NUMA node, this helps reduce the penalty seen by crossing a NUMA boundary. Microsoft SQL Server has NUMA support since 2005, but it seems that is hasnt been fully supported until 2008 R2. But for SQL server it means that each database engine is started on its own NUMA node. Even if there is only 1 database engine it will attempt to start that engine on the second node. This is because the OS is allocating memory on the first node, so NODE 0 has less memory available to other applications than the other Nodes.

For SharePoint farms Microsofts best practice actually advices you NOT to cross NUMA boundaries, but adopt a scale-out option instead. Of course that means selling more SharePoint licenses, but make sure you do test how your performance is impacted when crossing a NUMA boundary on a SharePoint farm.

Links

Relaxed Co-scheduling a clarification – Danish myth busted.

I know this is old news, but at a recent meeting at vmware 4 different customers agreed that they were taught something different by instructors in Denmark about Co-scheduling.

The myth that all 4 of us believed in states that:

“An 8 vCpu machine can only get cpu time, if 8 cores are available to the scheduler”, so if 7 cores were available the machine would wait, which you could see from the cpu-ready counters.

However since ESX 3.X vmware has done something called “relaxed co-scheduling”, meaning that if a machine has 8 vCpu’s but only has a single threaded application it can run if fewer than 8 cores were available to scheduler.

See Duncans post about this from 2008!

We’ve done a load test of a SharePoint 2013 farm which consisted of 2 front-end servers each with 24 vCpu’s that were run on a 4-way server with 8 cores in each processor, on vSphere 5.0.

loadtest showing cpu ready


Had strict co-scheduling still existed we should have seen CPUREADY %’s above 50% instead we got a peak at 33%. That however shows that we’re still over comitting the host too much.

First post.

Why another blog you might ask.

Well I don’t have any ideas about becoming a top 25 blogger, gunning for fame and all that. But I do however occasionally come across stuff that i think others might find interesting, and figured that I might as well write it down somewhere. So why not use a free blogging service for that.

So welcome to my blog. I hope to have some content over time, but I wont promise stuff weekly or even monthly 🙂

Terkel