Using vCenter Operations Manager to help in performance testing

I’ve recently been involved in performance testing a SharePoint 2013 farm. This has led to some discoveries on what you can use vCops for when you do performance testing, including what metrics you should look at.

Setup

The setup we used for testing,  was what Microsoft calls a Medium Farm topology. It consists of 2 Web Frontends, 2 application servers for search and index and 1 database server. In front of the web frontends we’ve placed a Citrix Netscaler for load balancing, SSL off load and such. Each server runs Windows 2012, the SharePoint is 2013 and Microsoft SQL is 2012. The SharePoint has a few webparts on the front page that’s been built for the webpage.

Initial configuration:

  • Webservers: 16 vCpu’s  16 Gb ram
  • Application Servers: 8 vCpu’s 8 Gb ram
  • Database Server: 8 vCpu’s and 16 Gb ram

All these were run on a single vSphere 5.0 host with 4 CPU’s and 8 cores each, and 196 Gb of Ram, giving us 32 cores. So at this point we were overcomitting on CPU somewhat, having 56 vCpu’s provisioned. These were not the final destination hosts of these servers, at the final destination hosts we wouldn’t overcommit. However we were certain that if we saw bottlenecks here then they would also be present at the final destination for the farm.

One of the requirements was that the front page should load in less than 2 seconds, with up to 2000 users in 5 minutes. The test runs were run in Visual Studio and simulated users viewing the front page, loading all the javascripts and stylesheets and stuff. To make sure we would get data into vCops that was a full load for the 5 minutes intervals, we ran the test for a full 15 minutes.

Inside Windows we could see the CPU maxing out, to a point where system processes were complaining, and we saw load times of over 30 seconds for the front page.
First thing that happened was someone was looking inside windows and saying more CPU is needed. I, however was looking at vCenter performance and only seeing a max of roughly 40% CPU usage. When looking at vCops i saw a Usage maxing out at 35.63% but a demand of 54.52%.

Demand - Usage 20 februar

I interpreted that result as, the VM actually requested more CPU than vmware was giving it. However it was decided to try and give 8 more vCpu’s to the Webservers, instead of scaling them down to 8 vCpu’s

Running the test again with 24 vCpu webservers yielded the exact same result, more than 30 second load times and CPU inside windows maxing out. Looking at vCops we saw this:

Demand - Usage 21 februar

An even lower % Usage and Demand. I’m thinking we’re overcomitting too much now, and going back into vCops i pulled out the %Ready counters as well, for both feb 20 and 21.

Demand - Usage - Ready 20 februar

Demand - Usage - Ready 21 februar

What that told us was that at 16 vCpu’s %READY maxed out at around 14%, which is kinda bad. And adding 8 more vCpu’s that jumps to 28.75%, meaning close 1/3 of the time the machine was ready, but couldn’t get access to a CPU on the host. One funny thing you can tell from the first picture, is that you can tell when the 8 vCpu’s were added, since that moved the “idle” %READY from around 6% to 14.87%.

The following Monday we scaled the machines down from 24 to 8 vCpus’s and ran the test again. Load times were still at around 30 seconds, so we didn’t really solve the problem, but looking at vCops we saw a completely different picture

Demand - Usage - Ready 24 februar

The graphs shows the %READY dropping from roughly 12% to around 1%, and funny thing here is that, while we’re running the test the %READY drops even further.

At this time we decided to give Michael Monberg from Vmware a call, to ask about what to look for in vCops. He showed us this neat trick, when looking at performance metrics for a given VM, and have some graphs from the VM showing, like the ones above,  you can do this:

Click on the + sign near the health tree

health tree It expands and shows you:

klik server

The host, the VM and the Datastore, if you then single click on the host, you get a new Metric Selector

metric selector

But one from the Host, so expanding the CPU Usage i could select the Core Utilization metric and add that to the graph page. That now gave us graphs from the VM and the Host and the same time. Showing the graphs from Feb 20, 21 and 24 but with the new metric added:

Feb 20, 16 vCpu’s

Core util 20 feb

Feb 21th 24 vCpu’s

Core util 21 feb

Feb 24th 8 vCpu’sCore util 24 feb

From that we could tell that with 16 and 24 vCpu’s the host was totally maxed out on its physical cores, where on 8 vCpu’s we only used around 66%. So when the VM metric only shows 35% CPU usage the Host was maxed out, and thus adding the 8 extra vCpu’s had no positive effect on the VM.

When we ran the test with 8 vCpu’s and got the exact same results, we actually weren’t crossing NUMA nodes, which is what Microsoft recommends. See my blog post on NUMA and vNUMA here

This blog post was written to show some of the nice things you can do with vCenter Operations, and i hope you found it useful.