Designing for Performance: Memory and CPU best practices for Skylake/Xeon Scalable processors

Designing for Performance: Memory and CPU best practices for Skylake/Xeon Scalable processors

A big part of designing effective, scalable compute environments is understanding and applying basic performance concepts around your applications, in accordance with Intel CPU and memory selection and configuration best practices. This article will focus on virtualized workloads, since they are most common, and there’s more variables to plan against.

Understanding Requirements

Questions you should be asking before embarking on this journey:

  • What is your current virtual to physical core over-subscription rates, and how is observed ready time and application performance indicating if it should be lower or higher?
  • Do you have established benchmark baselines from current or previous hardware?
  • How does your application function, and does it benefit more from a scale-up or scale-out approach?
  • Do you have licensing, security, or functional restrictions on use that require partitioning different workloads to different clusters?
  • What are your target subscription rates (eg % utilized), n x redundancy targets, and operational agility needs (think patching/update frequency).

Memory Population Guidelines

Once you understand what you’re trying to build towards, it’s time to select the ideal memory layout for you. Xeon Scalable processors in particular have a total of 6 channels per processor socket. Last year’s Intel v4 procs were 4, comparatively. This means for 2-socket boxes (most common), you should be procuring memory in sets of 12 identical DIMMs each, with a maximum of 24 per server (varies by make/model, be careful!). For servers with 4 sockets, memory should be populated in minimum sets of 24, with a maximum of 48. Following this theme, what are the common resulting configurations in a 2-socket configuration?

  • 96GB = 12x8GB
  • 192GB = 12x16GB
  • 288GB = 12x16GB 12x8GB
  • 384GB = 24x16GB or 12x32GB
  • 480GB = 12x32GB 12x8GB
  • 576GB = 12x32GB 12x16GB
  • 768GB = 24x32GB

Note: Most people aren’t purchasing 64GB DIMMs quite yet outside of specialized workloads or high density deployments yet because they haven’t come close enough to price parity with 2x 32GB DIMMs, but there’s nothing stopping you from following this format for any memory configuration your heart desires.

Recognize you’re not seeing previously common configurations of 256GB, or 512GB? If you’re being quoted a 2-socket Skylake/Xeon Scalable processor configuration and see those capacities, it means you’re forgoing 1/3rd of your memory bandwidth because only 4/6 channels are in use per processor. You may not notice it, but you’re being shortchanged nonetheless! This causes real, measurable performance differences at the application layer between properly configured and reduced memory servers, especially for workloads that churn on memory IO like databases or large memory-hungry apps.

Ok, so why should we populate in increments of 6 per processor if it’s not hard and fast required? Think of a processor’s access to memory as RAID 0. If you complete a full bank of DIMMs, you’re effectively spreading access and throughput across the channels in a near-linear format. Rough math tells us that 1 DIMM per processor would yield 1/6th of the bandwidth. 4 would yield us 4/6th, and so on. The only notable exception of this rule is 5/6 DIMMs populated – which causes a dramatic cliff of performance before hitting 6/6.

Dell has always maintained impeccable memory studies on every recent generation of hardware released, and is a highly encouraged read. These findings are universal for the majority of server configurations as they all share a common Intel architecture. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2018/01/31/skylake-memory-study

Credit: Dell

Figure 1 above provides a visual explanation of relative memory bandwidth by population, and there’s a wealth of other information located on the page as well including various advanced concepts.

CPU Selection Guidelines

Next up, it’s time to understand processor options. Here’s where there’s no simple formula and we need to apply a bit more contextual analysis than simply meeting capacity requirements. The Xeon Scalable processor family currently has 35 (!) unique processor models (excluding some derivatives) and they have a wide range of speeds and capacities:

  • 4 to 28 cores per socket
  • 1.7Ghz to 3.6Ghz base frequency
  • Different memory speeds, hyperthreading, cache, etc.

If memory is pretty cut and dry once you understand the architecture and processes around it, choosing the right processor type is far from it. How do we process this much information and make the right selection? Let’s go back to the original questions asked in the article and apply them to a processor selection context with a few rules of thumb:

Oversubscription

V:P oversubscription ratios vary heavily by the type of hosted system, however here are a few good starting points ranging from conservative to aggressive. All figures are virtual cores against physical cores – not logical hyperthreaded cores.

  • Database: 1:1 to 3:1
  • Web/Application/Ops: 3:1 to 6:1
  • VDI: 6:1 to 12:1 (some go even crazier here)

Clock speed

  • Database: As high as you can afford. Under most circumstances, the money put into hardware pales in comparison to RDBMS licensing for the likes of SQL and Oracle
  • Web/App/Ops: Anything in the 2.4-2.7Ghz range. Unless an application has been specifically benchmarked and measured against processor types, this is a good middle ground to be safe in.
  • If the application(s) are truly scale-out, you can simply buy based on whatever price-per-ghz model works best.

Bringing It All Together

Lastly, you’ll need to marry these two data points together. Assuming you understand your total vCPU and vRAM capacity, simple math is to simply divide your capacity requirements by your capacity options. Let’s use a hypothetical scenario:

  • Workload Type: General Web/App/Ops Server Virtualization
  • Target Subscription Ratio: 5:1
  • Target Clock Speed: 2.4-2.7
  • Total vCPUs: 1000
  • Total vRAM: 4000GB

For this example, I’m going to arbitrarily choose some middle ground CPUs, and run some math against them assuming 2 socket configurations.

  • Xeon Gold 6126 = 12 Cores Per Socket @ 2.6Ghz, 24 Cores Total. 1000 (vCPU) / 5 (v:p) / 32 (cores) = 8.33 Servers
  • Xeon Gold 6150 = 18 Cores Per Socket @ 2.7Ghz, 36 Cores Total. 1000 (vCPU) / 5 (v:p) / 36 (cores) = 5.55 Servers
  • Xeon Gold 8168 = 22 Cores Per Socket @ 2.7Ghz, 44 Cores Total. 1000 (vCPU) / 5 (v:p) / 44 (cores) = 4.55 Servers

Now against memory:

  • 4000 (vRAM) / 192 (GB/server) = 20.83 Servers
  • 4000 (vRAM) / 384 (GB/server) = 10.42 Servers
  • 4000 (vRAM) / 576 (GB/server) = 6.94 Servers
  • 4000 (vRAM) / 768 (GB/server) = 5.21 Servers

The objective of this exercise is to find compatible options with a roughly equivalent total server count across processor and memory. Having one primary constraint far away from another can result in increased solution costs if you’re not careful, and by keeping the resulting server count for each factor close together, you net the ideal price/capacity balance against your performance requirements. Based on the above data, we can conclude a 768GB memory configuration is ideal, and marries up well against either the 6150 processor option. Obviously there’s many options in between and there may be better options with deeper analysis – ideally aided by some kind of spreadsheet or capacity analysis tool.

Let’s take this more extreme, shall we? What if we wanted to shoot for maximum density with this same workload using the Xeon 8180 processors?

  • Xeon Gold 8180 = 28 Cores Per Socket @ 2.5Ghz, 56 Cores Total. 1000 (vCPU) / 5 (v:p) / 56 (cores) = 3.57 Servers
  • 4000 (vRAM) / 1152 (GB/server) = 3.47 Servers

*Note: 1152GB is accomplished with 12x64GB 12x32GB

You may find this is ideal in space constrained datacenters, but the price premium for the 8180 processors and 64GB DIMMs may be discouraging.

That seems … laborious.

Yep, it sure does. On the upside, building a simple spreadsheet to automate this logic allows this process to be quick and effective and only took 5-10 minutes.

Want to take it to the next level? Getting comfortable with concepts like conditional formatting, vlookup, or better yet index and match functions within Excel will allow you to do cool things like filter and aid in balanced configuration selection. Even fiscal analysis could be added to this framework if needed - or what if this was all auto-built this with something as simple as a raw RVTools export?

What’s Next?

Once you have a grasp for properly selecting and designing a compute environment to satisfy your proposed workloads, you’re off to the races. But beyond that, there’s many other considerations, which I’ll briefly touch on a few here:

  • These design principles only account for initial builds, but in all reality, a large scale virtualized environment is a leaving, breathing entity. How are you going to track the effectiveness of your design, react to changing workload requirements, plan, and scale for the future? Sure – you could partway by whipping up a fancy spreadsheet, but the reality is you’ll likely need some kind of software instrumentation to enable to and support your people and technology going forward.
  • Start standardizing. Nobody wants a datacenter full of snowflakes, and maximizing the cost and performance effectiveness of individual sets of workloads is exciting, but ultimately re-doing your design for every add/change/remove is not operationally feasible. Run this process on adequate representative sets in your environment, and select a standard that’s a good trade-off between “effectiveness” and “uniqueness.” Perhaps you can have 1-3 standards for your virtual environment? Choose wisely, but know they can always be re-visited.
  • Consider specialized workloads and how you need to accommodate them from a performance standpoint. There are numerous advanced concepts such as NUMA locality you may want to keep in mind when it comes to selecting the right memory and CPU configuration. If CPU1 needs to access memory attached to CPU2, it’ll go through a bottleneck called QPI or UPI. This matters for massive databases or memory-hungry apps.
  • Know you do NOT need to build dedicated clusters for Oracle or MSSQL. There are numerous legally and contractually defensible ways to partition within a larger cluster. This comes as a surprise to most organizations – or they’re otherwise uncomfortable or unwilling to put the measures in place to make this viable.
  • Considerations for cluster failover capacity, operational requirements, etc need to be factored in. What do you want to be your primary constraint - memory is easiest to see, but oversubscribing could result in memory swap - bad for your environment, even if just temporarily. If you have to pull more hosts than planned in your n x failover capacity, do you want a bump in cpu oversubscription and ready time, or do you want memory swapping to disk? Do you have the sophistication to measure failover capacity in a rapidly changing environment? Crucial considerations on a budget.

I hope this article helped bring some of the concepts associated to compute and memory selection and architecture down to earth in a practical manner. Please don’t hesitate to reach out regarding the basic or advanced concepts in this article – I welcome feedback and participation.



To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics