Components of High Availability – Managing Data in a Hybrid Network

High availability is a buzzword that many application and hardware vendors like to throw around to get you to purchase their products. Many different options are available to achieve high availability, and there also seems to be a number of definitions and variations that help vendors sell their products as high availability solutions.

When it comes right down to it, however, high availability simply means providing services with maximum uptime by avoiding unplanned downtime. Often, disaster recovery (DR) is also closely lumped into discussions of high availability, but DR encompasses the business and technical processes used to recover once a disaster has happened.

Defining a high availability plan usually starts with a service level agreement (SLA). At its most basic, an SLA defines the services and metrics that must be met for the availability and performance of an application or service. Often, an SLA is created for an IT department or service provider to deliver a specific level of service. An example of this might be an SLA for a Microsoft Exchange server. The SLA for an Exchange server might have uptime metrics on how much time during the month the mailboxes need to be available to end users, or it might define performance metrics for the amount of time it takes for email messages to be delivered.

When determining what goes into an SLA, two other factors need to be considered. However, you will often see them discussed only in the context of disaster recovery, even though they are important for designing a highly available solution. These factors are the recovery point objective (RPO) and the recovery time objective (RTO).

An RTO is the length of time an application can be unavailable before service must be restored to meet the SLA. For example, a single component failure would have an RTO of less than five minutes, and a full- site failure might have an RTO of three hours. An RPO is essentially the amount of data that must be restored in the event of a failure. For example,

in a single server or component failure, the RPO would be 0, but in a site failure, the RPO might allow for up to 20 minutes of lost data.

SLAs, on the other hand, are usually expressed in percentages of the time the application is available. These percentages are also often referred to by the number of nines the percentage includes. So, if someone told you that you need to make sure that the router has a rating of five 9s, that would mean that the router could only be down for 5.26 minutes a year. Table 13.1 shows you some of the different nines rating and what each rating allows for downtime.

TABLE 13.1 Availability percentages

Availability ratingAllowed unplanned downtime/year
99 (two nines) percent3.65 days
99.9 (three nines) percent8.76 hours
99.99 (four nines) percent52.56 minutes
99.999 (five nines) percent5.26 minutes
99.9999 (six nines) percent31.5 seconds
99.99999 (seven nines) percent3.15 seconds

Two important factors that affect an SLA are the mean time between failure (MTBF) and the mean time to recovery (MTTR). To be able to reduce the amount of unplanned downtime, the time between failures must be increased, and the time it takes to recover must be reduced. Modifying these two factors will be addressed in the next several sections of this chapter.

Achieving High Availability

Windows Server 2022 is the most secure and reliable Windows version to date. It also is the most stable, mature, and capable of any version of Windows. Although similar claims have been made for previous versions of Windows Server, you can rest assured that Windows Server 2022 is much better than previous versions for a variety of reasons.

An honest look at the feature set and real- world use should prove that this latest version of Windows provides the most suitable foundation for creating a highly available solution. However, more than just good software is needed to be able to offer high availability for applications.

Achieving High Availability

In today’s technology world, there are many ways to set up and manage a high availability network. Since the AZ- 800 and AZ- 801 exams cover both onsite servers and Azure, we will talk about setting up high availability using these two methods. Many third- party companies offer high availability solutions, but we will focus on onsite and Azure setups.

High Availability Foundation

Just as a house needs a good foundation, a highly available Windows server needs a stable and reliable hardware platform on which to run. Although Windows Server 2022 will technically run on desktop- class hardware, high availability is more easily achieved with server- class hardware. What differentiates desktop- class from server- class hardware? Server- class hardware has more management and monitoring features built into it so that the health of the hardware can be monitored and maintained.

Another big difference is that server- class hardware has redundancy options. Server- class hardware often has options to protect from drive failures, such as RAID controllers, and to protect against power supply failures, such as multiple power supplies. Enterprise- class servers have even more protection.

More needs to be done than just installing Windows Server 2022 to ensure that the applications remain running with the best availability possible. Just as a house needs maintenance and upkeep to keep the structure in proper repair, so too does a server. In the case of a highly available server, this means patch management.

Installing Patches

Microsoft releases monthly updates to fix security problems with its software, both for operating system fixes and for applications. To ensure that your highly available applications are immune to known vulnerabilities, these patches need to be applied in a timely manner during a scheduled maintenance window. Also, to address stability and performance issues, updates and service packs are released regularly for many applications, such as Microsoft SQL Server, Exchange Server, and SharePoint Portal Server. Many companies have a set schedule— daily, weekly, or monthly—t o apply these patches and updates after they are tested and approved.

Desired Configuration Manager (DCM), an option in Microsoft Configuration Manager, is a great tool for helping to validate that your cluster nodes are patched. It can leverage the SCCM client to collect installed patches and help reporting within the enterprise on compliancy with desired system states based on the software installed.

To continue with the house analogy, if you were planning to have the master bath remodeled, would you rather hire a college student on spring break looking to make some extra money to do the job or a seasoned artisan? Of course, you would want someone with experience and a proven record of accomplishment to remodel your master bath.

Likewise, with any work that needs to be done on your highly available applications, it’s best to hire only decidedly qualified individuals. This is why obtaining a Microsoft certification is definitely an excellent start to becoming qualified to configure a highly  available server properly. There is no substitute for real-l ife and hands- on experience. 

Working with highly available configurations in a lab and in production will help you know not only what configurations are available but also how the changes should be made.

For example, it may be possible to use failover clustering for a DNS server, but in practice DNS replication may be easier to support and require less expensive hardware in order to provide high availability. This is something you would know only if you had enough experience to make this decision.

As with your house, once you have a firm and stable foundation built by skilled artisans and a maintenance plan has been put into place, you need to ascertain what more is needed. If you can’t achieve enough uptime with proper server configuration and mature operational processes, a cluster may be needed.

Windows Server 2022 provides two types of high availability: failover clustering and network load balancing (NLB). Failover clustering is used for applications and services such as SQL Server and Exchange Server. Network load balancing is used for network- based services such as web and FTP servers.

Leave a Reply