Mission Critical systems delivers top availability and redundance

Lars Tunkrans

IT Systems Architect, Fujitsu M12 SPARC server Specialist.

Published May 7, 2021

The need for mission critical servers is usually evident when any downtime will have a significant financial impact; as in large-scale e-commerce, banking or stock-exchange systems.

Alternatively, when the service involves critical life and death decisions in medical, air traffic control or search and rescue operations.

What then, sets a mission critical server apart from the “standard” servers ?

Almost all “standard” servers have redundancy features such as hot-swap redundant dual Power Supplies, hot-swap multiple fans for cooling, and hot-swap storage devices in some form of Redundant Array of Independent Disc (RAID), and Error Correction Code memory (ECC). This is the established set of redundancy for the standard dual socket X86 server. Apart from these features dual socket server are built on a large “motherboard” that if it has a component failure brings the whole server/service down. How is this weakness addressed in mission critical servers ?

The Mission Critical server SPARC M12-2s manufactured by Fujitsu is built up using up to 16 building blocks (BB), that is connected to each other using a high speed Cross-Bar BUS that connects up to 32 CPU’s in an interconnect and delivers up to 5.4 TB/s of peak bandwidth.

M12-2s Building block chassis

Full Scale M12-2s system with 32 CPU's

The new CPU and Memory Unit (CMU) released in September 2020 now has 24 DIMM slots, pushing the BB up to 48 DIMM slots, allowing 48 * 64 GB or 3 TB RAM per BB. Max total memory in the full server build with 16 BB’s is now 48 TB.

Fujitsu SPARC M12-2s redundancy features, not found in a standard server.

Redundancy features in Mission critical servers in addition to the standard features listed above, included in the Fujitsu SPARC M12-2s system, creating the mission critical system.

• Capacity on demand (COD). A SPARC M12-2s server can grow and attach new hardware such as CPU Cores, whole CPUs, Memory, whole Building Block chassis and individual PCIe cards can be hot-added, while the system is in full operation running applications for users. A lot of down time incurred for “upgrades” is thus done away with.

• A “motherboard” for every CPU with Memory Unit (CMU). The system can be expanded to 32 CMU system-boards. Two CMUs in every BB. A fault in one CMU does not fatally stop the system.

• The high-speed interconnect that effectively creates one single server out of the 32 possible CPU’s and memory units (the CrossBAR), allows individual CMU’s in the system to be configured into physical partitions or join a single Solaris environment joining all 32 CPU’s into a single server, but also allows a failed BB to be disconnected for service. Solaris can release and disconnect from the failed unit. This is a redundancy feature as its possible to disconnect and repair 1 BB while the 15 other BBs are working. An intelligent system design installs one more BB than required for headroom to utilize this.

• Solaris 11 Multipathing. A Fujitsu SPARC M12-2s server that includes dual or several Building Blocks (BB) chassis, can create multiple redundant Network and Storage I/O paths from applications through multiple network or storage controllers to the same resource. The resource then becomes redundantly available even if one path to the resource breaks.

• The Zettabyte filesystem. ZFS is the default filesystem for Solaris 11. It continuously checks a mirrored or raided filesystem, that all the copies of the stored data in the filesystem is identical, and protects against data going bad because of disc-sector flaws. Sysadmins can perform data scrubs and snapshots at will.

• Dual SAS storage controllers in every Building Block chassis (BB), allows mirrored system volumes of storage across dual device paths, ensuring access should one of the paths fail. This feature is in addition to the hot-swap RAID function.

• Memory mirroring. Memory subsystem can be setup in mirror mode, always storing application execution data in two different RAM DIMMs.

• Instruction retries, if an instruction execution fails, the CPU will retry the instruction. This ensures that spurious cosmic radiation does not cause an application to fail.

• Failed CPU core replacement. If a CPU core does develop an error, it is retired and replaced by a core that was not previously licensed/activated.

• ECC Parity Single-bit error correction within most of the CPU core data-paths.

• CPU 3rd level Cache sliced in four sections so that only 25% of the cache becomes inactive in case of a failure.

• 66.000 Data integrity checkpoints within the CPU.

• Redundant “lights out management”. Every building block chassis is equipped with dual ethernet ports for lights out management, which allows the unit to be redundantly connected to dual network switches/paths. Additionally, the individual lights out management modules are interconnected and configured in a master, secondary, and slave relationship.

• The lights out management module is a hot-swap component, replaceable while the main server is running.

• Hot-swap PCIe cards. As Solaris can withdraw from a PCIe card and the PCIe cards are located in Hot-swap containers, there is no need to shutdown the system to replace a PCIe card.

• Vapour and Liquid Loop Cooling (VLLC) for the CPU’s with a radiator built in to the chassis with dual liquid pumps for redundancy. Yes, the server includes a radiator!

• Four 80Plus Platinum hot-swap power supplies in each BB.

Uptime of 99.999 % is still five minutes of downtime per year.

Five minutes of downtime per year means 99.999% uptime of a service. That is the same amount of time that is needed for a normal server computer to reboot once a year. So the goal of five nines uptime was not achievable with anything less than a cluster, and is still impossible to achieve while simultaneously also performing necessary security patching on any operating system, with one exception: Solaris 11.

With the fast reboot feature implemented in Solaris 11, a server typically shuts down in 15 seconds and reboots in about one minute. Patching is done on a snapshot of the boot partition while all normal applications are still running. The patched snapshot is then used as the live boot-partition during the next reboot.

Together these features have done away with almost all service degradation during a patch cycle. A Solaris based service can deliver five nines uptime with carefully planned and implemented patch and reboot cycles in combination with a mission-critical server hardware platform. With FUJITSU SPARC M12-2s Servers, Solaris delivers extraordinary uptime.

Summary:

Fujitsu mission-critical servers deliver Reliability, Availability and Serviceability comparable to the mainframe systems Fujitsu also builds.

Fujitsu mission-critical servers are your best option for building non-stop services using Solaris running on redundant, hot swappable and dynamically reconfigurable server components.

Larger servers with more memory are required to accelerate future services such as in-memory databases. X86 based hypervisors are not scalable beyond the size of a two socket server with 6 TB memory, and large databases need a mission-critical server in order to accommodate the entire database in memory.

Investing in larger mission-critical servers will reduce your network infrastructure size, giving you the ability to invest in higher speed connections for the same cost.

Investing in mission-critical servers enables your IT department to deliver five nines uptime.

(Solaris & JAVA is a trademark of Oracle Corporation.)

(SPARC is a trademark of SPARC International inc.)

(UNIX is a trademark of the OPEN group)

Mission Critical systems delivers top availability and redundance

Lars Tunkrans

IT Systems Architect, Fujitsu M12 SPARC server Specialist.

The need for mission critical servers is usually evident when any downtime will have a significant financial impact; as in large-scale e-commerce, banking or stock-exchange systems.

Alternatively, when the service involves critical life and death decisions in medical, air traffic control or search and rescue operations.

What then, sets a mission critical server apart from the “standard” servers ?

Fujitsu SPARC M12-2s redundancy features, not found in a standard server.

Uptime of 99.999 % is still five minutes of downtime per year.

Summary:

Insights from the community

Others also viewed

THE POWER HYPERVISOR

High‐Efficient Compute Server Clusters

Inside the Core: Exploring Server Processor Technology

Performance perspective on virtual disk controllers in VMware environments.

Storage and I/O Tuning in Linux for Ultra-Low Latency Trading

Lesson 3: Not all components are made equal

Explore topics