Folsom

Installing OpenStack, Quantum problems

During the following weeks we plan to expand more on the subject of setting up an OpenStack cloud using Quantum.
For now we have been experimenting with different Quantum functionality and settings.
At first Quantum might look like a black box, not due to its complexity, but because it deals with several different plugins and protocols that if a person is not very familiar with them it becomes hard to understand why Quantum is there in the first place.

In a nutshell Quantum has the role to provide an interface to configure the network of multiple VMs in a cluster.

In the last few years the lines between a system, network and virtualization admin have become really blury.
The classical unix admin is pretty much non existent now a days since most services are offered in the cloud in virtualized environments.
And since everything seems to be migrating over to the cloud some network principles that were applied into physical networks in the past some times don’t translate very well to virtualized networks.

Later we’ll have some posts explaining what technologies and techniques underlie the network configuration of a cloud, in our case focusing specifically on OpenStack and Quantum.

With that being said, below are a few errors that came up during the configuration of Quantum:

1. ERROR [quantum.agent.dhcp_agent] Unable to sync network state.

This is error is most likely caused due a misconfiguration of the rabbitmq server.
A few ways to debug the issue is to:
Check if the file /etc/quantum/quantum.conf in the controller node(where the quantum server is installed) has the proper rabbit credentials

By default rabbitmq runs on port 5672, so run:

[sourcecode]
netstat -an | grep 5672
[/sourcecode]

and check if the rabbitmq server is up an running

On the network node(where the quantum agents are installed) also check if the /etc/quantum/quantum.conf have the proper rabbit credentials:

If you are running a multihost setup make sure the rabbit_host var points to the ip where the rabbit server is located.

Just to be safe check if you have a connection on the management networking by pinging all the hosts in the cluster and restart both the quantum and rabbitmq server as well the quantum agents.

2. ERROR [quantum.agent.l3agent] Error running l3nat daemon_loop

This error requires a very simple fix, however, it was very difficult to find information about the problem online.
Luckily, I found one thread on the mailing list of the fedora project explaining in more details the problem.

This is error is due to the fact that keystone authentication is not working.
A quick explanation – the l3 agent makes use of the quantum http client to interface with the quantum service.
This requires keystone authentication. If this fails then the l3 agent will not be able to communicate with the service.

To debug this problem check if the quantum server is up and running.
By default the server runs on port 9696

[sourcecode]
root@folsom-controller:/home/senecacd# netstat -an | grep 9696
tcp 0 0 0.0.0.0:9696 0.0.0.0:* LISTEN
tcp 0 0 192.168.0.11:9696 192.168.0.12:40887 ESTABLISHED
[/sourcecode]

If nothing shows up is because the quantum server is down, try restarting the service to see if the problems goes away:

[sourcecode]
quantum-server restart
[/sourcecode]

You can also try to ping the quantum server from the network node(in a multihost scenario):

[sourcecode]
root@folsom-network:/home/senecacd# nmap -p 9696 192.168.0.11

Starting Nmap 5.21 ( http://nmap.org ) at 2013-01-28 08:07 PST
Nmap scan report for folsom-controller (192.168.0.11)
Host is up (0.00038s latency).
PORT STATE SERVICE
9696/tcp open unknown
MAC Address: 00:0C:29:0C:F0:8C (VMware)

Nmap done: 1 IP address (1 host up) scanned in 0.04 seconds
[/sourcecode]

3.ERROR [quantum.agent.l3agent] Error running l3nat daemon_loop – rootwrap error

I didn’t come across this bug, but I found a few people running into this issue.
Kieran already wrote a good blog post explaining the problem and how to fix it

You can check the bug discussion here

4. Bad floating ip request: Cannot create floating IP and bind it to Port , since that port is owned by a different tenant.

This is just a problem of mixed credentials.
Kieran documented the solution for the issue here

There is also a post on the OpenStack wiki talking about the problem.

Conclusion

This should help fixing the problems that might arise with a Quantum installation.
If anybody knows about any other issues with Quantum or has any suggestions about the problems listed above please let us know!

Also check the official guide for other common errors and fixes

OpenStack High Availability Features

Until the moment we have been researching possible solutions to make an OpenStack Cloud deployment have as much high availability features as possible.

Before the Folsom release H.A features were not built in the OpenStack service components.
With a large number of requests from the OpenStack community, starting with the Folsom release H.A is being addressed as part of the project. The features are still being introduced and in test phase, there aren’t a lot of production deployments out there yet, but with the help and feedback of the community the OpenStack developers believe that by the time the next version is release (Grizzly) OpenStack H.A features will be automated and ready to get in production mode from the get go.

Getting into the details of the H.A features available in Folsom:
Instead of reinventing the wheel, OpenStack decided to go with a proven and robust H.A provider available in the market: Pacemaker was their choice. With more than half a decade of production deployments, Pacemaker is a proven solution when it comes to providing H.A features to a vast range of services.

Specifically looking at the technologies involved with OpenStack, the role of H.A would be to prevent:

  • System downtime — the unavailability of a user-facing service beyond a specified maximum amount of time, and
  • Data loss — the accidental deletion or destruction of data.

In the end the focus is to eliminate Single Points of Failures in the cluster architecture.
A few examples:

  • Redundancy of network components, such as switches and routers,
  • Redundancy of applications and automatic service migration,
  • Redundancy of storage components,
  • Redundancy of facility services such as power, air conditioning, fire protection, and others.

Pacemaker relies on the Corosync project for reliable cluster communications. Corosync implements the Totem single-ring ordering and membership protocol and provides UDP and InfiniBand based messaging, quorum, and cluster membership to Pacemaker.

An OpenStack high-availability configuration uses existing native Pacemaker RAs (such as those managing MySQL databases or virtual IP addresses), existing third-party RAs (such as for RabbitMQ), and native OpenStack RAs (such as those managing the OpenStack Identity and Image Services).

Even though high availability features exist for native OpenStack components and external services they are not automated in the project yet so there is a need for manual installation and configuration of whatever H.A features are needed in the cloud deployment

A quick summary of how a Pacemaker setup would look is:
pacemaker-cluster

PaceMaker creates a cluster of nodes and uses Corosync to establish a communication between them.

Besides working with RabbitMQ, Pacemaker can also bring H.A features to a MySQL cluster, the steps would be:

  • configuring a DRBD (Distributed Replicated Block Device) device for use by MySQL,
  • configuring MySQL to use a data directory residing on that DRBD device,
  • selecting and assigning a virtual IP address (VIP) that can freely float between cluster nodes,
  • configuring MySQL to listen on that IP address,
  • managing all resources, including the MySQL daemon itself, with the Pacemaker cluster manager.

More information can be found:
DRBD
RabbitMQ
Towards a highly available (HA) open cloud: an introduction to production OpenStack
Stone-IT
Corosync