Designing for Performance: Memory and CPU best practices for Skylake/Xeon Scalable processors

Jeffrey Thompson

Enterprise Architecture & Scalable Operations | I bring order to chaos

Published Jul 24, 2018

A big part of designing effective, scalable compute environments is understanding and applying basic performance concepts around your applications, in accordance with Intel CPU and memory selection and configuration best practices. This article will focus on virtualized workloads, since they are most common, and there’s more variables to plan against.

Understanding Requirements

Questions you should be asking before embarking on this journey:

What is your current virtual to physical core over-subscription rates, and how is observed ready time and application performance indicating if it should be lower or higher?
Do you have established benchmark baselines from current or previous hardware?
How does your application function, and does it benefit more from a scale-up or scale-out approach?
Do you have licensing, security, or functional restrictions on use that require partitioning different workloads to different clusters?
What are your target subscription rates (eg % utilized), n x redundancy targets, and operational agility needs (think patching/update frequency).

Memory Population Guidelines

Once you understand what you’re trying to build towards, it’s time to select the ideal memory layout for you. Xeon Scalable processors in particular have a total of 6 channels per processor socket. Last year’s Intel v4 procs were 4, comparatively. This means for 2-socket boxes (most common), you should be procuring memory in sets of 12 identical DIMMs each, with a maximum of 24 per server (varies by make/model, be careful!). For servers with 4 sockets, memory should be populated in minimum sets of 24, with a maximum of 48. Following this theme, what are the common resulting configurations in a 2-socket configuration?

96GB = 12x8GB
192GB = 12x16GB
288GB = 12x16GB 12x8GB
384GB = 24x16GB or 12x32GB
480GB = 12x32GB 12x8GB
576GB = 12x32GB 12x16GB
768GB = 24x32GB

Note: Most people aren’t purchasing 64GB DIMMs quite yet outside of specialized workloads or high density deployments yet because they haven’t come close enough to price parity with 2x 32GB DIMMs, but there’s nothing stopping you from following this format for any memory configuration your heart desires.

Recognize you’re not seeing previously common configurations of 256GB, or 512GB? If you’re being quoted a 2-socket Skylake/Xeon Scalable processor configuration and see those capacities, it means you’re forgoing 1/3rd of your memory bandwidth because only 4/6 channels are in use per processor. You may not notice it, but you’re being shortchanged nonetheless! This causes real, measurable performance differences at the application layer between properly configured and reduced memory servers, especially for workloads that churn on memory IO like databases or large memory-hungry apps.

Ok, so why should we populate in increments of 6 per processor if it’s not hard and fast required? Think of a processor’s access to memory as RAID 0. If you complete a full bank of DIMMs, you’re effectively spreading access and throughput across the channels in a near-linear format. Rough math tells us that 1 DIMM per processor would yield 1/6th of the bandwidth. 4 would yield us 4/6th, and so on. The only notable exception of this rule is 5/6 DIMMs populated – which causes a dramatic cliff of performance before hitting 6/6.

Dell has always maintained impeccable memory studies on every recent generation of hardware released, and is a highly encouraged read. These findings are universal for the majority of server configurations as they all share a common Intel architecture. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2018/01/31/skylake-memory-study

Credit: Dell

Figure 1 above provides a visual explanation of relative memory bandwidth by population, and there’s a wealth of other information located on the page as well including various advanced concepts.

CPU Selection Guidelines

Next up, it’s time to understand processor options. Here’s where there’s no simple formula and we need to apply a bit more contextual analysis than simply meeting capacity requirements. The Xeon Scalable processor family currently has 35 (!) unique processor models (excluding some derivatives) and they have a wide range of speeds and capacities:

4 to 28 cores per socket
1.7Ghz to 3.6Ghz base frequency
Different memory speeds, hyperthreading, cache, etc.

If memory is pretty cut and dry once you understand the architecture and processes around it, choosing the right processor type is far from it. How do we process this much information and make the right selection? Let’s go back to the original questions asked in the article and apply them to a processor selection context with a few rules of thumb:

Oversubscription

V:P oversubscription ratios vary heavily by the type of hosted system, however here are a few good starting points ranging from conservative to aggressive. All figures are virtual cores against physical cores – not logical hyperthreaded cores.

Database: 1:1 to 3:1
Web/Application/Ops: 3:1 to 6:1
VDI: 6:1 to 12:1 (some go even crazier here)

Clock speed

Database: As high as you can afford. Under most circumstances, the money put into hardware pales in comparison to RDBMS licensing for the likes of SQL and Oracle
Web/App/Ops: Anything in the 2.4-2.7Ghz range. Unless an application has been specifically benchmarked and measured against processor types, this is a good middle ground to be safe in.
If the application(s) are truly scale-out, you can simply buy based on whatever price-per-ghz model works best.

Bringing It All Together

Lastly, you’ll need to marry these two data points together. Assuming you understand your total vCPU and vRAM capacity, simple math is to simply divide your capacity requirements by your capacity options. Let’s use a hypothetical scenario:

Workload Type: General Web/App/Ops Server Virtualization
Target Subscription Ratio: 5:1
Target Clock Speed: 2.4-2.7
Total vCPUs: 1000
Total vRAM: 4000GB

For this example, I’m going to arbitrarily choose some middle ground CPUs, and run some math against them assuming 2 socket configurations.

Xeon Gold 6126 = 12 Cores Per Socket @ 2.6Ghz, 24 Cores Total. 1000 (vCPU) / 5 (v:p) / 32 (cores) = 8.33 Servers
Xeon Gold 6150 = 18 Cores Per Socket @ 2.7Ghz, 36 Cores Total. 1000 (vCPU) / 5 (v:p) / 36 (cores) = 5.55 Servers
Xeon Gold 8168 = 22 Cores Per Socket @ 2.7Ghz, 44 Cores Total. 1000 (vCPU) / 5 (v:p) / 44 (cores) = 4.55 Servers

Now against memory:

4000 (vRAM) / 192 (GB/server) = 20.83 Servers
4000 (vRAM) / 384 (GB/server) = 10.42 Servers
4000 (vRAM) / 576 (GB/server) = 6.94 Servers
4000 (vRAM) / 768 (GB/server) = 5.21 Servers

The objective of this exercise is to find compatible options with a roughly equivalent total server count across processor and memory. Having one primary constraint far away from another can result in increased solution costs if you’re not careful, and by keeping the resulting server count for each factor close together, you net the ideal price/capacity balance against your performance requirements. Based on the above data, we can conclude a 768GB memory configuration is ideal, and marries up well against either the 6150 processor option. Obviously there’s many options in between and there may be better options with deeper analysis – ideally aided by some kind of spreadsheet or capacity analysis tool.

Let’s take this more extreme, shall we? What if we wanted to shoot for maximum density with this same workload using the Xeon 8180 processors?

Xeon Gold 8180 = 28 Cores Per Socket @ 2.5Ghz, 56 Cores Total. 1000 (vCPU) / 5 (v:p) / 56 (cores) = 3.57 Servers
4000 (vRAM) / 1152 (GB/server) = 3.47 Servers

*Note: 1152GB is accomplished with 12x64GB 12x32GB

You may find this is ideal in space constrained datacenters, but the price premium for the 8180 processors and 64GB DIMMs may be discouraging.

That seems … laborious.

Yep, it sure does. On the upside, building a simple spreadsheet to automate this logic allows this process to be quick and effective and only took 5-10 minutes.

Want to take it to the next level? Getting comfortable with concepts like conditional formatting, vlookup, or better yet index and match functions within Excel will allow you to do cool things like filter and aid in balanced configuration selection. Even fiscal analysis could be added to this framework if needed - or what if this was all auto-built this with something as simple as a raw RVTools export?

What’s Next?

Once you have a grasp for properly selecting and designing a compute environment to satisfy your proposed workloads, you’re off to the races. But beyond that, there’s many other considerations, which I’ll briefly touch on a few here:

These design principles only account for initial builds, but in all reality, a large scale virtualized environment is a leaving, breathing entity. How are you going to track the effectiveness of your design, react to changing workload requirements, plan, and scale for the future? Sure – you could partway by whipping up a fancy spreadsheet, but the reality is you’ll likely need some kind of software instrumentation to enable to and support your people and technology going forward.
Start standardizing. Nobody wants a datacenter full of snowflakes, and maximizing the cost and performance effectiveness of individual sets of workloads is exciting, but ultimately re-doing your design for every add/change/remove is not operationally feasible. Run this process on adequate representative sets in your environment, and select a standard that’s a good trade-off between “effectiveness” and “uniqueness.” Perhaps you can have 1-3 standards for your virtual environment? Choose wisely, but know they can always be re-visited.
Consider specialized workloads and how you need to accommodate them from a performance standpoint. There are numerous advanced concepts such as NUMA locality you may want to keep in mind when it comes to selecting the right memory and CPU configuration. If CPU1 needs to access memory attached to CPU2, it’ll go through a bottleneck called QPI or UPI. This matters for massive databases or memory-hungry apps.
Know you do NOT need to build dedicated clusters for Oracle or MSSQL. There are numerous legally and contractually defensible ways to partition within a larger cluster. This comes as a surprise to most organizations – or they’re otherwise uncomfortable or unwilling to put the measures in place to make this viable.
Considerations for cluster failover capacity, operational requirements, etc need to be factored in. What do you want to be your primary constraint - memory is easiest to see, but oversubscribing could result in memory swap - bad for your environment, even if just temporarily. If you have to pull more hosts than planned in your n x failover capacity, do you want a bump in cpu oversubscription and ready time, or do you want memory swapping to disk? Do you have the sophistication to measure failover capacity in a rapidly changing environment? Crucial considerations on a budget.

I hope this article helped bring some of the concepts associated to compute and memory selection and architecture down to earth in a practical manner. Please don’t hesitate to reach out regarding the basic or advanced concepts in this article – I welcome feedback and participation.

Designing for Performance: Memory and CPU best practices for Skylake/Xeon Scalable processors

Jeffrey Thompson

Enterprise Architecture & Scalable Operations | I bring order to chaos

Understanding Requirements

Memory Population Guidelines

CPU Selection Guidelines

Oversubscription

Clock speed

Bringing It All Together

That seems … laborious.

What’s Next?

More articles by this author

Insights from the community

Others also viewed

Chinese Chipmaker Unveils Speedy 64-Core ARM Processor

52 varieties of CPUs and why they matter to your business

Are you on a first-name basis with your Dell EMC server?

Intel's 3D XPoint Will Impact Server Memory before Flash Storage Market

Small SBC Computers

Shared Nothing or Loosely Coupled?

Why do control plane CNFs hate isolcpu?

Is x86 still relevant in the datacenter space?

Clearing the Confusion – vSphere Virtual Cores, Virtual Sockets, and Virtual CPU (vCPU)

Dell PowerEdge FX2 FC830

Explore topics

Understanding Requirements

Memory Population Guidelines

CPU Selection Guidelines

Oversubscription

Clock speed

Bringing It All Together

That seems … laborious.

What’s Next?

The Hidden Cost of Poor IT Decisions

Dec 11, 2018

Go ahead, break it. You know you want to.

Nov 21, 2018

What's in the box? Opening the *aaS black box and controlling "Cloud First" risk.

Nov 8, 2018

The case against recommended system requirements

Oct 24, 2018

WIIFMs - Recognizing and leveraging personal motivators for effective business.

Oct 11, 2018

Changing The Game: Moving from surviving to thriving in operations.

Sep 25, 2018

Why failures aren't failures

Sep 19, 2018

Sorry, your RTO is wrong.

Sep 12, 2018

Which processor is right for me? Cross-generation performance analysis with spec.org.

Aug 28, 2018

Sorry, HCI Sucks.

Aug 14, 2018

Insights from the community

Others also viewed

Chinese Chipmaker Unveils Speedy 64-Core ARM Processor

52 varieties of CPUs and why they matter to your business

Are you on a first-name basis with your Dell EMC server?

Intel's 3D XPoint Will Impact Server Memory before Flash Storage Market

Small SBC Computers

Shared Nothing or Loosely Coupled?

Why do control plane CNFs hate isolcpu?

Is x86 still relevant in the datacenter space?

Clearing the Confusion – vSphere Virtual Cores, Virtual Sockets, and Virtual CPU (vCPU)

Dell PowerEdge FX2 FC830

Explore topics