Until the moment we have been researching possible solutions to make an OpenStack Cloud deployment have as much high availability features as possible.

Before the Folsom release H.A features were not built in the OpenStack service components.
With a large number of requests from the OpenStack community, starting with the Folsom release H.A is being addressed as part of the project. The features are still being introduced and in test phase, there aren’t a lot of production deployments out there yet, but with the help and feedback of the community the OpenStack developers believe that by the time the next version is release (Grizzly) OpenStack H.A features will be automated and ready to get in production mode from the get go.

Getting into the details of the H.A features available in Folsom:
Instead of reinventing the wheel, OpenStack decided to go with a proven and robust H.A provider available in the market: Pacemaker was their choice. With more than half a decade of production deployments, Pacemaker is a proven solution when it comes to providing H.A features to a vast range of services.

Specifically looking at the technologies involved with OpenStack, the role of H.A would be to prevent:

  • System downtime — the unavailability of a user-facing service beyond a specified maximum amount of time, and
  • Data loss — the accidental deletion or destruction of data.

In the end the focus is to eliminate Single Points of Failures in the cluster architecture.
A few examples:

  • Redundancy of network components, such as switches and routers,
  • Redundancy of applications and automatic service migration,
  • Redundancy of storage components,
  • Redundancy of facility services such as power, air conditioning, fire protection, and others.

Pacemaker relies on the Corosync project for reliable cluster communications. Corosync implements the Totem single-ring ordering and membership protocol and provides UDP and InfiniBand based messaging, quorum, and cluster membership to Pacemaker.

An OpenStack high-availability configuration uses existing native Pacemaker RAs (such as those managing MySQL databases or virtual IP addresses), existing third-party RAs (such as for RabbitMQ), and native OpenStack RAs (such as those managing the OpenStack Identity and Image Services).

Even though high availability features exist for native OpenStack components and external services they are not automated in the project yet so there is a need for manual installation and configuration of whatever H.A features are needed in the cloud deployment

A quick summary of how a Pacemaker setup would look is:
pacemaker-cluster

PaceMaker creates a cluster of nodes and uses Corosync to establish a communication between them.

Besides working with RabbitMQ, Pacemaker can also bring H.A features to a MySQL cluster, the steps would be:

  • configuring a DRBD (Distributed Replicated Block Device) device for use by MySQL,
  • configuring MySQL to use a data directory residing on that DRBD device,
  • selecting and assigning a virtual IP address (VIP) that can freely float between cluster nodes,
  • configuring MySQL to listen on that IP address,
  • managing all resources, including the MySQL daemon itself, with the Pacemaker cluster manager.

More information can be found:
DRBD
RabbitMQ
Towards a highly available (HA) open cloud: an introduction to production OpenStack
Stone-IT
Corosync