I’ve recently been involved in performance testing a SharePoint 2013 farm. This has led to some discoveries on what you can use vCops for when you do performance testing, including what metrics you should look at.
The setup we used for testing, was what Microsoft calls a Medium Farm topology. It consists of 2 Web Frontends, 2 application servers for search and index and 1 database server. In front of the web frontends we’ve placed a Citrix Netscaler for load balancing, SSL off load and such. Each server runs Windows 2012, the SharePoint is 2013 and Microsoft SQL is 2012. The SharePoint has a few webparts on the front page that’s been built for the webpage.
- Webservers: 16 vCpu’s 16 Gb ram
- Application Servers: 8 vCpu’s 8 Gb ram
- Database Server: 8 vCpu’s and 16 Gb ram
All these were run on a single vSphere 5.0 host with 4 CPU’s and 8 cores each, and 196 Gb of Ram, giving us 32 cores. So at this point we were overcomitting on CPU somewhat, having 56 vCpu’s provisioned. These were not the final destination hosts of these servers, at the final destination hosts we wouldn’t overcommit. However we were certain that if we saw bottlenecks here then they would also be present at the final destination for the farm.
Inside Windows we could see the CPU maxing out, to a point where system processes were complaining, and we saw load times of over 30 seconds for the front page.
First thing that happened was someone was looking inside windows and saying more CPU is needed. I, however was looking at vCenter performance and only seeing a max of roughly 40% CPU usage. When looking at vCops i saw a Usage maxing out at 35.63% but a demand of 54.52%.
I interpreted that result as, the VM actually requested more CPU than vmware was giving it. However it was decided to try and give 8 more vCpu’s to the Webservers, instead of scaling them down to 8 vCpu’s
Running the test again with 24 vCpu webservers yielded the exact same result, more than 30 second load times and CPU inside windows maxing out. Looking at vCops we saw this:
An even lower % Usage and Demand. I’m thinking we’re overcomitting too much now, and going back into vCops i pulled out the %Ready counters as well, for both feb 20 and 21.
What that told us was that at 16 vCpu’s %READY maxed out at around 14%, which is kinda bad. And adding 8 more vCpu’s that jumps to 28.75%, meaning close 1/3 of the time the machine was ready, but couldn’t get access to a CPU on the host. One funny thing you can tell from the first picture, is that you can tell when the 8 vCpu’s were added, since that moved the “idle” %READY from around 6% to 14.87%.
The following Monday we scaled the machines down from 24 to 8 vCpus’s and ran the test again. Load times were still at around 30 seconds, so we didn’t really solve the problem, but looking at vCops we saw a completely different picture
The graphs shows the %READY dropping from roughly 12% to around 1%, and funny thing here is that, while we’re running the test the %READY drops even further.
At this time we decided to give Michael Monberg from Vmware a call, to ask about what to look for in vCops. He showed us this neat trick, when looking at performance metrics for a given VM, and have some graphs from the VM showing, like the ones above, you can do this:
Click on the + sign near the health tree
It expands and shows you:
The host, the VM and the Datastore, if you then single click on the host, you get a new Metric Selector
But one from the Host, so expanding the CPU Usage i could select the Core Utilization metric and add that to the graph page. That now gave us graphs from the VM and the Host and the same time. Showing the graphs from Feb 20, 21 and 24 but with the new metric added:
Feb 20, 16 vCpu’s
Feb 21th 24 vCpu’s
Feb 24th 8 vCpu’s
From that we could tell that with 16 and 24 vCpu’s the host was totally maxed out on its physical cores, where on 8 vCpu’s we only used around 66%. So when the VM metric only shows 35% CPU usage the Host was maxed out, and thus adding the 8 extra vCpu’s had no positive effect on the VM.
When we ran the test with 8 vCpu’s and got the exact same results, we actually weren’t crossing NUMA nodes, which is what Microsoft recommends. See my blog post on NUMA and vNUMA here
This blog post was written to show some of the nice things you can do with vCenter Operations, and i hope you found it useful.