Configure High Availability with Corosync and Pacemaker

All the techniques and methods used to improve the availability of a system or a service and increase the fault tolerance are called High Availability, such example of fault we can mention: hardware redundancy, clustering, replication of data at hot physically (RAID 1 and RAID 5) or in software (Snapshots, DRBD), or crisis scenarios (degraded mode, emergency plan). In a large company, it can lead to a position of responsibility at full time. You will see a picture to implement a facet of this problem in this article: an active / passive cluster that ensures the availability of an application service.

Concerning GNU / Linux, in this article two software that could manage a cluster infrastructure were tested:

  • Heartbeat which has its proofs but it is limited: no cluster from more than 2 nodes, no management of resources and rules to switch from one node to another.
  • Corosync and Pacemaker: It is the choice of the Red Hat distribution and which be outlined later in this article.

We had mounted a representative model composed by two virtual machines Debian Wheezy with 4 network interfaces that runs an Apache service which is accessed by an IP address managed by the cluster.

The following picture represents the network diagram:

picture

The eth0 and eth1 interfaces are part of a logical link aggregation and they help the cluster to check the status of the other nodes. They constitute a private network with the other nodes in the network 10.20.13.0/255.255.255.252. The interfaces eth2 and eth3 are part of another logical aggregation, they provide service at  outside in the 192.168.1.0/24 network.

The logical aggregation (also called bonding) provides an additional redundancy. If the network adaptor eth0 burn out, the traffic still pass through eth1. It can be configured in active / passive mode or in load-balancing mode.

Here is the configuration of the interfaces on the vm-node1 machine in “/etc/network/interfaces/”:

auto bond0
iface bond0 inet static
  address 10.20.13.1
  netmask 255.255.255.252
  bond_mode active-backup
  bond_miimon 100
  bond_downdelay 200
  bond_updelay 200
  slaves eth0 eth1
auto bond1
iface bond1 inet static
  address 192.168.1.61
  netmask 255.255.255.0
  gateway 192.168.1.1
  bond_mode active-backup
  bond_miimon 100
  bond_downdelay 200
  bond_updelay 200
  slaves eth2 eth3

And the configuration of the bonding in “/etc/modprobe.d/bonding”:

alias bond0 bonding 
alias bond1 bonding

The network configuration of the vm-node2 machine is symmetrical with bond0 in 10.20.13.2 and bond1 in 192.168.1.62. When the network configuration is dine, we can handle the cluster. First you need to install Corosync and Pacemaker on Debian, you will do as follow:

apt-get install corosyncpacemaker

Then configure Corosync. It manages the cluster infrastructure, which means the state of the nodes and their functioning in group. For this, we have to generate a key for authentication which will be shared by all nodes in the cluster. The corosync-keygen utility to generate this key from pseudo-random keystrokes which must then be secured and copied to other nodes.

# generation of the key from vm-node1
corosync-keygen
mvauthkey/etc/corosync/authkey
chownroot:root/etc/corosync/authkey
chmod400/etc/corosync/authkey
 
# copy of the key to vm-node2
scp/etc/corosync/authkeyroot@10.20.13.2:/etc/corosync/authkey

Corosync proposes the concept of connection rings to enable the communication between nodes. As part of the model, we defined two rings: ring0, the default communication ring that uses the private network and ring1, a backup ring that passes through the switches with the rest of the traffic. Corosync lets you define the rings in terms of IP / netmask rather than defining IP addresses. This is significant because the same configuration file can be deployed on all nodes without changing anything.

totem {   
    version:2    
    
    #How long before declaring a token lost(ms)   
    token:3000    

    #How many token retransmits before forming a new configuration   
    token_retransmits_before_loss_const:10    

    #How long to wait for joining messages in the membership protocol(ms)   
    join:60    

    #How long to wait for consensus to be achieved before starting a new round of membership configuration(ms)    consensus:3600    

    #Turn off the virtual synchrony filter   
    vsftype:none    

    #Number of messages that may be sent by one processor on receipt of the token   
    max_messages:20    

    #Limit generated nodeids to 31-bits(positive signed integers)   
    clear_node_high_bit:yes    

    #Disable encryption   
    secauth:off    

    #How many threads to use for encryption/decryption   
    threads:0    

    #Optionally assign a fixed nodeid(integer)   
    #nodeid:1234    

    #This specifies the mode of redundant ring,which may be none,active,or passive.   
    rrp_mode:passive    

    interface {       
        ringnumber:0       
        bindnetaddr:10.20.13.0       
        mcastaddr:226.94.1.1       
        mcastport:5405   
    }   

    interface {       
        ringnumber:1       
        bindnetaddr:192.168.1.0       
        mcastaddr:226.94.1.1       
        mcastport:5407   
     }
} 

amf {   
     mode:disabled
} 

service {   
     #Load the Pacemaker Cluster Resource Manager   
     ver:       0   
     name:     pacemaker
} 

aisexec {   
    user:  root   
    group: root
} 

logging {   
    fileline:off   
    to_stderr:yes   
    to_logfile:no   
    to_syslog:yes   
    syslog_facility:daemon   
    debug:off   
    timestamp:on   
    logger_subsys {       
          subsys:AMF       
          debug:off       
          tags:enter|leave|trace1|trace2|trace3|trace4|trace6   
    }
}

At this point, the cluster infrastructure is in place but it does not manage any resource. It is the role of Pacemaker.

It has the following operational constraints:

  1. resources (Apache service and the cluster IP address) running on the server vm-node1 in the normal case.
  2. Apache service and the cluster IP address must run on the same server if our service is unreachable.
  3. If the Apache service crashes on the primary server, it switches to the secondary server.
  4. if the primary server is attached over the Internet gateway, it switches to the secondary server.

Pacemaker provides some utility in text mode to interact.

  • CRM to manage all aspect configuration.
  • crm_mon displays the state of the cluster.

First we define the global configuration. Quenched STONITH (Shoot The Other Node In The Head) and the quorum. STONITH is the ability to kill the other node if it no longer meets the infra cluster.  And the quorum, it does not work on a cluster within 3 knots.

property stonith-enabled=false
property no-quorum-policy=ignore

We can now define our first resource: cluster IP address attached to the active node.

primitive vip ocf:heartbeat:IPaddr2 params ip=192.168.1.60 cidr_netmask=24nic="bond1"op monitor interval="10s"

Then the Apache resource, the critical service we want to provide in this model:

primitive httpd ocf:heartbeat:apache params configfile="/etc/apache2/apache2.conf"statusurl="http://localhost/server-status"op start timeout="60s"op stop timeout="60s"op monitor timeout="20s"

The starting and stopping of Apache are now managed by the cluster. So, we have now to remove the automatic start of the service:

update-rc.d-fremoveapache2

You will notice that goes beyond the definition of an Apache resource. The Pacemaker STATUSURL attribute allows to use the Apache status page to determine a rocker. So do not forget to configure this URL in Apache:

<Location/server-status>    
    SetHandler server-status    
    Order deny,allow    
    Deny from all    
    Allow from 127.0.0.1
</Location>

As we built the configuration step, crm_mon back perhaps there are some errors on certain resources because it was not operational. There is a failure counter that notifies us in case of warning. we can reset this timer  for the resource http using the following command:

crm resource clean up httpd

At this stage we have a cluster address and an HTTP resource, but not necessarily on the same node. vip resource will switch if the node is dropped down. The httpd resource will switch if the node is dropped or the apache service rises (monitoring by URL / server-status).

Let’s go further and force both resources to run on the same node:

colocation httpd-with-vip inf : httpd vip

And we would like to have this also in the normal case, the resources run on vm-node1, our primary node:

location preferred-node vip 100 : vm-node1

Finally, we add a scale condition. If the node doesn’t reach the gateway over the Internet, we want to move resources to the other node For this we define a resource at the ping type that runs on all nodes (through the concept of cloned resource). Then a vacation rule is added to switch if the active node does not see the bridge.

primitive ping-gateway ocf:pacemaker:ping params host_list="192.168.1.1"multiplier="1000"op monitor interval="10s"
clone cloned-ping-gateway ping-gateway
location vip-needs-gateway vip rule-inf:not_defined pingd or pingd lte 0

This is our operational model. and we hope that we help you to do it.