Tailor your basket to match your eggs
High Availability and Disaster Recovery on Power – IBM has an impressive portfolio to offer
Antony Steel, Belisama
Over the last 10 years IT operations have evolved to the point where critical applications are rarely hosted on the same Frame (Server), nor in many cases, in the same Data Centre. However, I have found that this development tends to be more piecemeal rather than being driven by a detailed review. A review which examines the application requirements for High Availability / Disaster Recovery (HA/DR) then matches these requirements to the solutions available – across all the infrastructure.
For many years IBM has been recognised as a leader in HA and DR solutions for workloads on Power designed to meet the availability requirements of critical enterprise applications. In the last few years they have expanded their portfolio to include protection for the “less critical” applications in the Data Centre. These are the applications that can afford a slightly longer outage and/or less stringent requirements around data loss. In particular if you are looking for a simple and reasonably priced HA or DR solution, IBM now has VMRM and VMRM/DR (more details below).
It is worth noting that in the 2019 ITIC Global Server, Hardware, OS Reliability Survey (Mid-year update), 66% of the respondents said that increasing workloads are impacting negatively on their reliability and only 18% said that they hadn’t experienced a decline in reliability due to increasing workloads. The same survey deals with estimated costs of outages, which while not under consideration here, needs to be taken into account when looking at pricing your HA or DR solution(s).
Now that IBM has a more comprehensive portfolio of HA/DR solutions, it is a good time to review what is available, what has changed and how these options will match your application availability requirements.
While the primary focus of HA/DR solutions is to work around failures in the infrastructure, these tools are equally useful in managing around maintenance and upgrade tasks. For example, PowerHA includes a tool on AIX to manage ifixes and Service Packs across the cluster. Over the last few years, PowerHA development has been focused on its ease of use and has successfully countered the perception that PowerHA is difficult to manage.
Range of options
Typically organisations have a range of applications with related (but differing) Service Level Agreements (SLAs). These are usually based on both Recovery Time Objective (RTO – the time before clients can access the application again), and Recovery Point Objective (RPO – last transaction saved) – the diagram below shows how they fit together. To match this, IBM has a number of solutions, which can either work together or independently, to meet your different SLAs. It has also become more important to understand the range of solutions as it is rare to find only one operating system running on your Power infrastructure.
figure 1 - RPO and RTO
Addressing the cost of these solutions, which in most cases includes the duplication of some expensive infrastructure, is not easy. However, to be prepared, an organisation needs to be able to develop a realistic cost to their business of some of the more common failure scenarios. Fortunately the other side of the equation – the setup cost – is becoming easier to control by using some of the newer features of the products. For example, licenses now only need to be activated when needed and resources can be freed as required by automating the shutdown of less critical workloads.
The Power HA/DR portfolio now consists of:
Live Partition Mobility or LPM (more of a useful tool for Administrators to move workloads for maintenance and some types of failure);
Simplified Remote Restart (SSR);
IBM Virtual Machine Recovery Manager HA (VMRM HA), management of SSR;
IBM Virtual Machine Recovery Manager DR (VMRM DR), evolved from IBM Geographically Dispersed Resiliency for IBM Power Systems;
PowerHA – now for AIX, I and Linux;
PowerHA Enterprise Edition; and
Geographic Logical Volume Manager (GLVM) – an old AIX feature that may have a future in the cloud.
Live Partition Mobility
This is more of an administrative tool which is the PowerVM component that allows administrators to move running workloads from one Power system to another. LPM has a few configuration requirements and offers huge benefits for little effort.
This is not really considered to be an HA solution, but I mention it for completeness and note that the following 3 options will not work if your Power environment is not configured for LPM.
Simplified Remote Restart
figure 2 - Simplified Remote Restart
The simplest option, which is configured via the HMC, NovaLink, or PowerVC to enable an LPAR to be restarted on another Frame if the current Frame fails. If managed by PowerVC, the placement of the restarted LPARs can be controlled, otherwise it is a manual process.
SSR is operating system agnostic and only recognises Frame failures.
IBM VM Recovery Manager
figure 3 - IBM VM Recovery Manager
VMRM HA is designed to automate and manage Simplified Remote Restart within the Data Centre.
It is also Operating System agnostic and can be scripted to go beyond just responding to Frame failures, using AIX or Linux agents to perform further monitoring. The distribution or placement of the virtual machines across the infrastructure can also be controlled.
IBM VM Recovery Manager DR
figure 4 - IBM VM Recovery Manager DR
Similar to VM Recovery Manager, but extends beyond the Data Centre. VMRM DR relies on replication to be configured at the storage layer, which it then controls. It supports a wide range of IBM and non-IBM storage solutions.
There is an additional cool feature that allows you to easily run your DR testing by starting the LPAR in DR using cloned LUNs and thus not impacting production availability.
PowerHA – now for AIX, I and Linux (again!)
figure 5 - PowerHA
This option really consists of 3 separate products, one for each Operating System. Each shares the common logic of moving resources around the infrastructure to mask failures and allow applications to be restarted and made available for client access. PowerHA is also a useful tool or moving workloads around for maintenance and now includes an impressive array of management, monitoring and testing tools.
PowerHA Enterprise Edition
figure 6 - PowerHA Enterprise Edition
Extending the AIX and i options by introducing sites (Data Centres), while controlling the replication of data and coordinating the behaviour of the clusters in each DC. In most of the clusters I have worked on, replication is managed by the storage subsystem, with just a few using Geographic Logical Volume Manager (GLVM). Has the same management, monitoring and testing tools as PowerHA
Geographic Logical Volume Manager
figure 7 - Geographic Logical Volume Manager
GLVM is part of the AIX Logical Volume Manager and allows replication to be configured over IP. By itself, there is no control or automation, but when integrated with PowerHA EE, it is a fully managed DR solution replicating Logical Volumes over IP (synchronously or asynchronously).
However, watch this space as GLVM has an interesting future in the cloud.
Comparing the features
figure 8 - feature comparison
Note: GLVM is not included in the table as it is a feature of AIX which can be managed by HA for Cross Site storage replication over IP
HA/DR maintenance and management
As discussed above, there has been a perception that the configuration and management of PowerHA is complex, however over the last 3-5 years:
IBM development has sought and acted on feedback from users;
IBM has invested heavily in a management GUI and Smart Assists (bundled solutions for “common” applications) to make the installation and management simpler and less prone to errors;
Feedback from support and analysis of the Support databases means that during installation and the regular cluster synchronisation, common mistakes and any inconsistencies are checked for; and
The testing process has been simplified and parts automated as experience has shown that a well tested and maintained cluster is not only less likely to fail, but instils greater confidence.
Where to start?
For “Enterprise” Customers, greater virtualisation of Power workloads has muddied the water and added to this perceived complexity. In reality however, most organisations have a small number of “classes” of applications based on the SLAs – the RTOs; and RPOs mentioned above. The problem can often be simplified by grouping each Application into an “Availability Class”, which is then mapped to the appropriate HA / DR option. This is an ongoing process to be updated as the landscape evolves.
Belisama, IBM and the partner ecosystem can advise on options, help you review your HA/DR requirements, run a proof of concept and help with the design, installation, testing and maintenance of the solution. At Belisama we also have the HA/DR skills to assist with the planning, implementation, testing, training, and maintenance / upgrading of existing solutions.
The summary of each of the above options is by necessity very brief and we would be happy to provide further information and assistance as required.
• “My years at the coalface – L2 support@IBM” - Red Steel
• Implementing IBM VM Recovery Manager for IBM Power Systems August 2019 SG24-8426-00
• IBM Geographically Dispersed Resiliency for IBM Power Systems February 2017 SG24-8382-00
• IBM PowerHA SystemMirror V7.2.2 and V7.2.3 for IBM AIX and Linux December 2018 SG24-8434-00