Smart Sizing: 2 – Analysis

In my last post, I offered a brief discussion of some of the tools one might use to collect data on an existing environment, with the intention of moving some / all of the workloads therein to another set of new infrastructure… This is a task I have often had to accomplish for assisting customers with moving to new servers, off arrays and onto hyperconverged infrastructure (specifically vSAN), or moving to cloud-based IaaS solutions (such as VMware Cloud on AWS).

As mentioned in that last post, there are LOTS of tools that may be used to collect data, and I only mentioned a few – LiveOptics, RVTools, vROPs, etc. There are more – not to mention tools like HCIBench which can be used to stress test HCI infrastructure. This post, however, is going to focus on what I believe the most important aspect of any sizing exercise – understanding your workloads. I will be analyzing the collected data with an eye toward migrating into VMC on AWS.

Credit where credit is due – I have to admit for a long time I simply would aggregate all the collected data for an environment, and generate an average workload profile- average configured vCPU, vRAM, VMDK, etc. . However, one of my colleagues, Chris Saunders demonstrated to me one day how to use Python and Pandas to get smarter at actually understanding your workload distribution. Chris is a fellow SE on the VMware Cloud team, and you may know him from the “All Systems Go” podcast (https://open.spotify.com/show/2poerhSf1cy7Z5mAHr1JXr). What he showed me in Python I am hoping to show you how to do in Excel… and perhaps in a future post I will endeavor to get fancier with Python myself!

Data Gaps…

A short note about data gaps – or those data points that you are / were not able to capture. Some of the tools mentioned in my previous post may not be available to you… And therefore, some of the data you are hoping to collect may not be collected. That may be OK – it is not always easy to know everything about an environment, how it is being used, availability requirements, performance requirements, etc. Far too often the individual or team that may own or run the application may have requested too much / too little CPU or memory, or they aren’t available to ask questions of.

Sizing for a new environment is as much about understanding your knowledge gaps and addressing them as it is about simply gathering data and using arithmetic. I like to think of it as a discussion about unknowns… as an employee of a software vendor, I am frequently faced with unknowns – whether they be SLAs, availability requirements, security concerns – simply because the individual or team I am working with doesn’t have immediate access to the application owners, or there are restrictions in the environment that prevent the use of a tool that might give us the data we are looking for.

If or when you find there are blanks you can’t fill in – say for example, storage IOPs – make some assumptions. It’s OK if they aren’t correct – as long as you let the folks you are working with understand the variables you have made assumptions for, then you can have a discussion about different / better assumptions.

Over-Allocation

Now that you have your data collected, and it is sitting in a dashboard, CSV or Excel file, why is it not enough to simply add up the total of vCPU, vRAM, and VMDK? A simple answer – over-allocation. That simply means that your virtual machines aren’t actually using or needing all the resources they may have been configured with. ESXi is one of the most advanced resource schedulers available, and it does an excellent job of time-slicing between virtual machines. Furthermore, it is possible for virtual machines to share memory pages, as well as have disks that have been thin-provisioned. All of which means it is possible for your physical environment to provide the necessary resources for your workloads to run in a satisfactory manner even though the environment may be much smaller than if those workloads were deployed individually on their own servers.

Therefore, in addition to understanding the configured resources for your individual virtual machines, it is also highly beneficial to know the actual utilization of those virtual machines, so an appropriate virtual-to-physical ratio may be calculated. The sizer for VMC on AWS includes some generally accepted averages for CPU and memory over-allocation – it is OK to use these averages, but personally I like to ensure my client understands the variables in use, as well as any assumptions that may be used (see above).

In the examples that follow, I have simply used the defaults for some of the utilization numbers.

Aggregate View

I like to start by simply generating an overview of the environment – how many virtual machines, as well as the range and average of the configured resources for vCPU, vRAM, and consumed disk space. The table below is an example of a recent analysis I performed, and will provide an example of such an overview.

vm count756
MinMaxSumAvg
vCPU12926263.5
vRAM (GB)212869719.2
Used Disk Space (GB)1.316985428372

This allows me to immediately get a grasp on the environment – how big a box are we going to be building? Also, the MAX values allow me to see if there’s anything that might be too big to move, or at least something we might need to take into consideration. In the case above, I can see right away there is at least 1 virtual machine with 128 vCPU configured – that might be a topic of discussion. I can also see there is at least 1 virtual machine consuming over 1TB of disk – again, a possible topic of discussion.

I will usually run a quick sizing scenario using the average values for the entire environment to generate an upper bound on the size of the required infrastructure. I use the publicly available VMC on AWS sizing tool – found here: https://vmc.vmware.com/sizer.

In this case, using those values, I come up with the following:

  • VMC Server Pack
    • 14 ESXi nodes
  • Recommended Calculation Is
    • Memory bound
  • Instance Type
    • i3
  • Used Storage
    • 78.81 TB
  • Free Storage with vSAN Policy ( FTT-2 , RAID-6)
    • 44.23 TB

Workload Profile View

Many times, getting an average is enough. I have begun to find, however, that the more detail I can extract from an environment, the better the conversation I can generate with the application owner(s), clients, stakeholders, etc. Pivot Tables are a wonderful tool for extracting some additional level of detail out of the environment, and I will use them to build something like the following.

Profile by vCPUVM CountAverage of Provisioned Memory (GB)Average of Guest VM Disk Used (GB)
1-4620669
5-8952052
9-122125172
13-161034168
17-2042267
21-24530128
25-29116119

Now we begin to see a much clearer picture of what is actually in the environment… Furthermore, each row can now become its own profile to be used in running a sizing exercise. As you can see, each row has different average values for configured vRAM, consumed storage, etc. Here is a screenshot using the first profile of 620 virtual machines:

Again, using the VMC Sizer, I built another scenario, this time creating separate workload profiles for each of the lines above, and this time receive the following:

  • VMC Server Pack
    • 13 ESXi nodes
  • Recommended Calculation Is
    • Memory bound
  • Instance Type
    • i3
  • Used Storage
    • 78.71 TB
  • Free Storage with vSAN Policy ( FTT-2 , RAID-6)
    • 37.38 TB

OK – not much difference in this case, though it did drop the host count by 1 node.

‘Pilot-Light’ View

I usually run one additional scenario – many customers are interested in a DR solution, and we only need as many hosts as are necessary to host the replicated data. For this scenario I change in the configured vCPU and vRAM, as well as the CPU and memory utilization – I want to force the sizer to only consider storage – below is a screen shot of a sample workload profile – you can see I have left the storage numbers alone, but minimized the CPU and RAM configurations. Once again, I am only including a screen shot of the details using the first profile of 620 virtual machines from above.

Now we see that the node count has dropped to only 8 hosts – this would represent the necessary “pilot light” environment to host the replication of the virtual machines. In the case of a DR event, the customer can begin to turn on workloads, and Elastic DRS will automatically scale the environment out to accommodate the virtual machines as their CPU and memory utilization expand.

  • VMC Server Pack
    • 8 ESXi nodes
  • Recommended Calculation Is
    • Storage bound
  • Instance Type
    • i3
  • Used Storage
    • 78.79 TB
  • Free Storage with vSAN Policy ( FTT-2 , RAID-6)
    • 2.77 TB

Outliers

Finally, it is common to find a handful of workloads that don’t conform to the average, or in fact skew the average due to their size. These are frequently large application or database servers which are either consuming a large amount of storage, or else have large CPU / memory resources allocated to them.

I find it important to identify those workloads to the customer / end user, and have a conversation as to whether they are going to be part of the migration, or if there are opportunities to reconfigure them. In my next post I will provide a couple examples of this.

Summary of Sizing

Now that we have run a collection, and analyzed the data, we can have a conversation with the customer. Depending on how intent / desire, they are going to have to provision at least 8, perhaps as many as 14 hosts in VMware Cloud on AWS, using the i3 instance.

In a future post, I will provide some additional real-life examples, including the Excel files.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s