CHapter 10 High Availability

Study and lecture note

Terminology

High availability (HA) is a design strategy. The strategy is simple: try to ensure that users keep access to services.

MTBF: Mean time between failures, is a measure of how reliable a hardware product or component is. For most components, the measure is typically in thousands or even tens of thousands of hours between failures. For example, a hard disk drive may have a mean time between failures of 300,000 hours.
MTTR: Mean time to recover, is the average time that a device will take to recover from any failure. Examples of such devices range from self-resetting fuses (where the MTTR would be very short, probably seconds), up to whole systems which have to be repaired or replaced.
RPO: Recovery time objective, is defined by business continuity planning. It is the maximum targeted period in which data might be lost from an IT service due to a major incident.
RTO: Recovery time objective, is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.
ECC memory: Error Correction Code memory, along with supporting motherboard hardware, can detect and sometimes correct bit errors caused by electrical disturbances. ECC memory is carried out in hardware, so it is not necessary for the OS to support it.
RAID: Only RAID level above 0 provide fault tolerance. RAID 1 and RAID 5 are common server disk configurations. RAID 6 and RAID 1+1(RAID10) are becoming more common.
Hot-swappable disks:
The feature also must be supported by the OS;
If you do not want downtime, the disk need to be in RAID array.

Round robin

Sometimes referred to as a poor man’s load balancer. It uses multiple DNS records to resolve the same host name to multiple IP addresses. For example, if we have 3 servers, who’s domain name is www.it.com, with ip addresses of 192.168.0.1, 192.168.0.2, 192.168.0.3; after the first client query (get the ip address of 192.168.0.1), the order will become 192.168.0.2, 192.168.0.3, 192.168.0.1. So the next client will use 192.168.0.2. And so on.

In windows 2008, by default, round-robin is enabled.

Characteristics:

No recognition of a down server.
Cached client records, it will take some time until the cached records expire.
All server have equal priority: it can not assign the work load according to server performance.

Network load balancing (NLB)

The servers in an NLB cluster share the load of incoming requests based on rules you can define. A server cluster is sometimes referred to as a server farm.

Characteristics:

From the client perspective, a server cluster appears on the network as a single device with a single name and IP address(virtual IP address).
The server can filter traffic based on the the Port number.
Can assign each server a priority number.
Well suited to TCP/IP based application such as web servers and streaming media server where data can be easily replicated among the participating servers and is not changed by users.
Not advisable if data being accessed on the servers require exclusive access such as with database, file or print and email app.

Tasks to create an NLB cluster

Create a new cluster
Select a host and network interface to participate in the cluster
Configure the host priority/host ID
Set the cluster IP address
Set the cluster name and operation mode
Configure port rules
Configure DNS record for the cluster
Add additional servers to the cluster

Failover Clusters

Failover clusters consist of two or more servers, usually of identical configuration, that access common storage media. Typically, storage is in the form of a SAN.

Failover cluster is well suited to back-end database applications, fle-sharing servers, messaging servers and other applications that are both mission critical and deal with dynamic read/write data.

Clustered application —An application or service that is installed on two or more servers that participate in a failover cluster. Also called clustered service.
Cluster server—A Windows Server 2008 server that participates in a failover cluster. A cluster server is also referred to as a cluster node or cluster member.
Active node—A cluster member that is responding to client requests for a network applica-tion or service. Also referred to as active server.
Passive node —A cluster member that is not currently responding to client requests for a clustered application but is in standby mode to do so if the active node fails. Also referred to as passive server.
Standby mode • —A cluster node that is not active is said to be in standby mode.
Quorum • —The cluster confguration data that specifes the status of each node (active or passive) for each of the clustered applications. The quorum is also used to determine, in the event of server or communication failure, if the cluster is to remain online and which servers should continue to participate in the cluster.
Cluster heartbeat • —Communication between cluster nodes that provides status of each of the cluster members. The cluster heartbeat, or lack of it, informs the cluster when a server is no longer communicating. The cluster heartbeat information communicates the state of each node to the cluster quorum.
Witness disk • —The witness disk is shared storage used to store the cluster configuration data and is used for helping to determine the cluster quorum.

Best practice

1. High availability software solutions are augmented by hardware fault tolerance; make sure your server hardware has fault-tolerant features such as hot-swappable RAID disk drives and redundant power supplies.
2. Use round-robin load balancing for an easy and quick load balancing solution but be aware of its limitations: no recognition of a down server, cached client records, and lack of prioritization.
3. Use network load balancing for applications in which the data being accessed is easily replicated among servers and is not changed by users.
4. Before using any clustering solution, make sure your OS version and updates are consistent among all servers.

5. Before creating your NLB cluster, make sure that DNS is set up correctly; a zone for the FQDN of the cluster must exist and A records for each server and the cluster name must exist.
6. It is recommended that you use multiple NICs on your server whereby one NIC is dedicated to non-cluster related communication.
7. Create port rules to ensure the cluster only accepts communication for services that are specifcally offered by all cluster members.
8. Use the Multiple host fltering mode option on your NLB cluster to provide scalability; use the Single host fltering mode to provide fault tolerance without scalability.
9. Use failover clusters to provide the highest level of fault tolerance.
10. Be sure to choose the quorum model that best supports your failover cluster confguration.
11. Server components used in a failover cluster should meet the Certifed Windows Server 2008 requirements.
12. For best disk performance in your failover cluster, use SAS, Fibre Channel, or iSCSI storage technologies.
13. Run the cluster validation wizard before you create a new cluster and again periodically after your cluster is running to revalidate the confguration

Lab Reflection:

1. Problem I met in the NLB labs:

The two server can not ping or communicate with each other.

Solution: When you use dual NICs for each server, make sure they are all(4 interfaces) in same VLAN. For Vbox, you give the same name for the “internal network”.

Can not open the web page.
Solution: If you use one of the server as DNS server, make sure to configure all the NICs'( both on server and on client machine) DNS server point to the IP address of the dedicated NLB NIC on the DNS server, not the NIC used for client request.
Go to the NIC configuration on both server, IPv4, advanced, DNS tab, uncheck the ” Register this connection’s address in DNS”, check it on the one dedicated for internal communication.
If you use NLB for the IIS server, you can not test the connection of the web page from the server, you have to test from a client machine.
On virtual box only multicast working??
If you found the host can not ping the virtual IP or can not open the full cluster FQDN, which is used for the web service, try to switch the communication mode between unicast and multicast, see if this can solve the problem.

2. Other software on Linux:

Pacemaker: Pacemaker is a high-availability cluster software that was developed as an open source project, mainly by people who work for SUSE Linux. It helps you make sure that essential services on your network get the best availability possible.

What hardware and software do I need to run Pacemaker?

You need at least two servers that run Linux. Currently, Pacemaker is able to support up to 16 servers, but some people run it on clusters that have hundreds of servers, which are called nodes in the cluster.

Virtually all Linux distributions are supported. But if you need Enterprise support, Novell’s SUSE Linux Enterprise Server is currently the only Linux distribution that has that ability.

Features:

Detection and recovery of machine and application-level failures
Supports practically any redundancy configuration
Supports both quorate and resource-driven clusters
Configurable strategies for dealing with quorum loss (when multiple machines fail)
Supports application startup/shutdown ordering, regardless machine(s) the applications are on
Supports applications that must/must-not run on the same machine
Supports applications which need to be active on multiple machines
Supports applications with multiple modes (eg. master/slave)
Provably correct response to any failure or cluster state.

Reference

http://searchitchannel.techtarget.com/feature/FAQ-Pacemaker-high-availability-cluster-technology

http://clusterlabs.org/

http://virtuallyhyper.com/2013/04/load-balancing-iis-sites-with-nlb/

MCTS Guide to Microsoft Windows Server 2008 Applications Infrastructure Configuration by John E.Tucker, Darrel Nerove, Greg Tomsho