OpenStack issue: nova-network instance has no IP address using FlatDHCPManager or VlanManager

Occasionally, when working with an OpenStack installation that uses legacy Nova networking with FlatDHCPManager or VlanManager, you may encounter an issue where an instance does not correctly take the private IP it was assigned. When this happens, obviously you won’t be able to ping the system on the network, but you will also likely see cloud-init hang because it cannot contact the metadata server. This issue is often due to a condition where nova-network’s dnsmasq stops updating its internal list of virtual MAC address to IP address mappings.

Verify nova-network receives the IP

When an OpenStack instance is created, nova-network receives a message that tells it to update dnsmasq’s configuration with the new mapping. This mapping is used to assign the IP address to the instance via DHCP when it boots. nova-network handles this message by updating the contents of /var/lib/nova/networks/nova-<bridgename>.conf. If you open that file in your favorite text editor, you should see contents like this:

If you see your IP address listed, continue.

Restart dnsmasq

dnsmasq is managed and started by nova-network. However, stopping nova-network typically doesn’t stop the dnsmasq service. First, stop the nova-network service:

At this point, you should still see dnsmasq processes in the process table. Kill them.

Then, start nova-network, and after a few seconds you should see the dnsmasq instances again:

Then, reboot your instance and you should see it grab its IP address from DHCP.

Recovering perma-down Nova services in OpenStack Kilo

In OpenStack Kilo with RabbitMQ, you may periodically run across services that simply do not start and register correctly. There is no issue with the service configuration, and no matter how many times you try to restart them, they simply don’t come back. (I frequently hit this issue with nova-compute and nova-network on my compute nodes.) If you check the service log, you’ll probably notice a lot of logs like this occurring around every 90 seconds:

If you’re lucky enough to have debug and verbose modes enabled, you might also catch this:

What’s happening is that OpenStack, via the oslo.messaging subsystem, is trying to create a queue that already exists, because it wasn’t cleaned up previously. This procedure has awful, terrible error handling (but you’re running OpenStack, so you already knew that everything in OpenStack has awful, terrible error handling). It assumes this failure must be a race condition, and keeps retrying indefinitely to create the queue. The solution is to delete the queue and restart the service, so the service can create the queue correctly.

Delivery queues in OpenStack are named <subsystem>.<nodename>, so if your compute node is named, your nova-compute and nova-network queues will be named and, respectively. (Nova services don’t prefix their queue names with nova-; other services generally do.)

If you prefer to use the GUI, and this is easy for small deployments, you can find the queue underneath the Queues tab in your management GUI. If you prefer to use the CLI (the GUI becomes entirely unusable past a few thousand queues), use rabbitmqadmin:

Afterwards, restart the service, and you should see it functioning normally.

Applying a “definition of done” to infrastructure engineering

The DevOps movement has had many incredibly positive outcomes on IT systems engineering as a discipline. Direct work with software development teams has led many infrastructure engineers to adopt practices which have been standard operating procedure in the software development realm for decades. Much of this practice has centered around adoption and evolution of the technology that allows Agile development teams to quickly and confidently achieve rapid change. Infrastructure-as-code, as either traditional configuration management or containerization, allows complicated platforms to be expressed as versioned artifacts, without the bureaucratic overload of an ITIL-style CMDB and manual release management processes. Simultaneously, continuous integration systems allow us to trivially test for regressions in functionality, performance, and security.

Significantly less attention has been paid to the ways that Agile teams manage schedules. This is critical for technologists to understand, because according to a well-known study by McKinsey and the University of Oxford, the average IT project is 45% over budget on cost and 7% over budget on time. These risks are intrinsically known. One major reason for the swell in cloud computing is that, since most organizations share critical personnel on operations and infrastructure, operational issues can create unforeseen bottlenecks on key project staff. (The Phoenix Project covered this concept at length.) However, even with distractions removed, we must also keep pace by using estimation methods compatible with Agile engineering styles.

Particularly in Scrum methodology, the Definition of Done is integral to this process. Peter Stevens summarizes the concept quite succinctly:

At its most basic level, a definition of Done creates a shared understanding of what it means to be finished, so everybody in the project means the same thing when they say “it’s done”. More subtly, the definition of Done is an expression of the team’s quality standards. A more rigorous definition of Done will be associated with higher quality software. Generally the team will become more productive (“have a higher velocity”) as their definition of Done becomes more stringent, because they will spend less time fixing old problems. Rework all but disappears.

Old problems? Rework? These are things that are in no way foreign to anyone who has built any kind of technology infrastructures. Acceptance criteria, even loosely-defined ones, ensure everyone is on the same page with project progress.

Stevens’ sample Definition of Done looked like this:

  1. Potentially releasable build available for download
  2. Summary of changes updated to include newly implemented features
  3. Inactive/unimplemented features hidden or greyed out (not executable)
  4. Unit tests written and green
  5. Source code committed on server
  6. Jenkins built version and all tests green
  7. Code review completed (or pair-programmed)
  8. How to Demo verified before presentation to Product Owner
  9. Ok from Product Owner

Infrastructure has some different requirements. A Definition of Done for an infrastructure task might have some of the following:

  1. Service Level Agreement determined
  2. Infrastructure repeatable through code
  3. Continuous integration tests for (2) written and passing
  4. Metrics and logs aggregated for rapid problem diagnosis
  5. Automated monitoring alerts for availability and performance problems
  6. Documentation and architecture diagrams completed
  7. Run books written for investigating outages
  8. Automated backups of service data
    1. Automated verification of backups
  9. Guidelines established for capacity planning and scaling
    1. Launch-day capacity plan completed
  10. Full and partial service failure behaviors tested
  11. Operations staff provided basic training on the service

(Alternatively, depending on just how closely your development and operations teams work together, you might work together directly on the same sprint goals and share a single Definition of Done that takes these operations-oriented facets into account.)

The concept of a checklist is far from new; Tom Limoncelli even wrote an entire book about how to improve individual productivity by making effective use of them. But the Definition of Done’s emphasis on team communication and understanding makes it clear that this is a crucial concept for high-performing DevOps organizations. A good Definition of Done should include input from infrastructure, product owner, development, security, and risk management teams, as well as higher-level layers of the business. In a post very much worth reading, Mitch Lacey outlines a clear process for helping arrive at a mutually-understood Definition of Done.

When the team discusses these items together, everyone understands that each of these facets has an impact on the schedule, and discussions happen around what those impacts are. All stakeholders have agreed on the value of each of these aspects of the deliverable, and have discussed how much work is actually appropriate to arrive at Done.

Christian Vos actually proposes writing two Definitions of Done: one for minimum acceptance, and one for continued maturity of the project. (In other words: it’s okay to ship without batteries included, as long as everyone involved is aware.) Particularly in Lean shops where the uptake of new features is not known until those features are deployed and observed, this can be valuable to avoid building unnecessary resiliency, instrumentation, or scale into the system before it’s needed.

The Definition of Done is a powerful process which can be invaluable for helping team members arrive at shared understanding. Working together to arrive at the Definition of Done from an infrastructure perspective allows organizations to understand system operability, coordinate and resolve conflicting priorities, and schedule features completely and correctly so they do not need to be revisited on future development sprints.

Estimating is crucial, but not all estimates are time-driven

If you ask two developers how long it will take to implement a feature, you’ll get four different answers.

Over the past few months, I’ve been following a lot of the Twitter discussion between Agile advocates like Woody Zuill and Bob Marshall, who are either advocates or tepidly in favor of the #NoEstimates movement, and enterprise managers like Glen Alleman and Peter Kretzman, who approach the movement with a comic disdain somewhat reminiscent of Waldorf and Statler, the Muppet Show’s resident hecklers who are quite keen to let Fozzy Bear know he isn’t very good at being funny.

Zuill is a huge proponent of the abolition of software estimates in favor of a different way of management that does not require them. Marshall holds a more nuanced position based around Goldratt‘s Theory of Constraints, which posits that businesses should understand the areas which actually bottleneck their growth, and focus efforts there. Alleman and Kretzman are staunch believers that a business absolutely, under no circumstances, can make intelligent and informed decisions without at least coarse-grained estimates of the time to achieve a goal. (It’s also critical to understand that, from all perspectives, estimates and deadlines are not the same thing.)

All of these are valid perspectives, applied to some specific domain. It’s also clear that there are places where time-based estimations are a very bad fit. One such example is any area that’s replete with “unknown unknowns,” like product development. Key to successful development of a product is identification of a correct product/market fit. Stamping a time estimate on it represents a foregone conclusion around the exercise; you don’t know how many times you’ll need to start over. Most R&D-oriented deliverables, often relying on a creative spark that cannot be forced, fall into the same pattern.

Lagging vs. leading indicators

Most successes and failures can be predicted through the proper use of the right metrics. These metrics fall into one of two categories:

  • Lagging indicators are metrics that represent the current or past state of something. These are typically numbers that can unquestionably be used to measure progress towards a goal. Some examples of lagging indicators would be total revenue booked last quarter, the market share of a product, or the number of new services contracts signed.
  • Leading indicators are metrics that begin to trend in a certain direction before a corresponding lagging indicator. Thus, they can be used to predict the future value of a lagging indicator.

Some metrics can be both leading and lagging indicators. Because of this duality, they are often very valuable. A services contract is both a discrete unit, and a predictor of future revenue. Likewise, because projects comprise interdependent queues of work, schedule slips indicate both that the work that was expected to already be done has not been finished, and work that is expected to be done at some date in the future will likely be late. Project lateness is a leading indicator that a company will lose an edge to a competitor who can deliver their version on time. That’s bad. It’s also a leading indicator of complete project failure, when a project is abandoned because it will never meet its goals in a way that provide a positive return on investment. That’s very bad.

This is precisely why schedule adherence is such a valuable indicator, and why project managers dwell on it obsessively. However, some kinds of work don’t benefit from lagging indicators. Other leading indicators might be more efficient.

Estimate based on your key indicators

If you know that time is a bad metric for your work, try to estimate your work based on one of your other key performance indicators. If you are trying to determine a product/market fit, consider trying to estimate how many business development meetings you’ll need in order to determine whether or not your strategy works. You might estimate how many high-level user stories you need to collect before you can begin to write code for a product. Your estimate might be wrong — that’s one of the points of estimating — but it allows you to meaningfully measure your progress. Just as importantly, it allows you to report that progress to other people who are invested in the outcome.

Time is the de facto estimation currency of most organizations, so if it’s a bad fit for you, remember that at some point, you’ll probably need to exchange at an unfavorable rate.

Calling bullshit on “code is not the asset”

The climate of technology discussions is increasingly being dominated by annoying platitudes, cookie-cutter maxims which eschew all nuance in favor of cultural memes. Repeated frequently enough, they become indistinguishable from truth.

If you’re not reading Gareth Rushgrove‘s DevOps Weekly newsletter, you really should be. It’s a tremendously useful aggregation of reading materials that, while rarely immediately applicable, provoke deep mental dialogue on the ways that problems can be approached. One such item was Dan North’s Microservices: software that fits in your head, which is an excellent slide deck on microservice architectures and patterns.

But there was this little nugget buried inside:

Screen Shot 2015-03-19 at 2.06.11 AM


What startles me about this slide is that DevOps was, in large part, a direct reaction to this mentality being so pervasive in Information Technology. IT is a cost center, said executives, and we must take every opportunity to minimize the damage that it causes. Like whack-a-mole hammers we must stamp out creativity wherever we find it, and institute strict governance processes to ensure that these costs stay low.

Never mind that the most effective large-scale IT environments are the ones who understand how to leverage their previous investments as a first-class platform to build their innovations. As one well-known example involving physical assets, Amazon created Amazon Web Services as a way to earn revenue from their existing computing capacity, which sat mostly idle outside of the holiday shopping season. It has since grown to become the largest web hosting platform in the world.

Somewhere along the line, somebody forgot to consider that code can be an asset too. Technology companies, especially small startups, frequently pivot after discovering that their existing technology can be quickly adapted to fulfill a market need that wasn’t previously anticipated. And there’s no better example of this than one of the most transformative technologies being adopted today: Docker.

Docker, by far the most widely-adopted tool for managing application containers, began as an internal tool for dotCloud, a then little-known PaaS service competing with Heroku, AppFog, and other hosts. This happened because after dotCloud released Docker as a public project and began to speak about it, people recognized the value of the thing itself, a value independent of the specific business problem dotCloud was trying to solve when they wrote Docker.

As North points out, the costs associated with developing software are quite substantial. However, we must be mindful that the code is not the cost itself; it is undeniably an asset, but one with liquidity and depreciation that must be managed like any other asset. This complexity is extremely difficult to manage and isn’t well-adapted to snappy bullet-point aphorisms.

Lean thinking teaches us to limit work in progress, and kanban teaches us that we tie up our capital whenever we invest in materials that aren’t used to produce a good that will sell quickly. It’s crucial that we distinguish useless raw materials from the machining infrastructure we’ve purchased, customized, and created to streamline the production process. With software, they can both look the same.

XWiki Google Apps authentication with Nginx and Lua

XWiki is a really terrific open-source wiki package that, in my opinion, is the only freely-available package coming even close to the functionality of Atlassian’s Confluence. I recently wanted to integrate XWiki with single sign-on provided by Google Apps, but there are no XWiki plugins that work directly with Google Apps OAuth. Instead, we’ll be using a custom Lua authenticator with Nginx to handle authentication redirects, then provide authentication headers to XWiki.

In this post, I’ll be going into how I configured this scheme on an Ubuntu 12.04 LTS system. 14.04 should work without significant modifications.


If you need to switch between multiple Google Apps accounts, logout does not work using the link in XWiki. There’s probably a trivial workaround that I haven’t bothered to find yet.


Before beginning, you’ll need these in order to follow along:

  1. A public-facing Ubuntu/Debian server with a functioning XWiki installation. This article will assume this server is running on the default HTTP port 8080.
  2. An XWiki administrator account with a username that matches your Google Apps username (the portion before the @ symbol). If you do not create this, you will be locked out of administration once you enable Google Apps login.
  3. A verified Google Apps domain with at least one user account. I’ll be using in this article’s examples.
  4. A permanent hostname for the XWiki server, to be used for OAuth2 callbacks. I’ll be using in examples.
  5. An SSL certificate for the site, issued by a trusted Certification Authority, and installed on the XWiki server. I’ll be using /etc/ssl/certs/ in examples. I’ll also be assuming that your certificate is a PEM file containing the key, the server certificate, and the certificate chain concatenated into a single file. If this is not how you store your certificates, you’ll need to update your Nginx configuration accordingly.

You should not have Nginx preinstalled on your server. We are going to build our own Nginx with the Lua module installed. (If you have a custom-built Nginx package with a recent Lua module version compiled in already, feel free to use it, of course.)

Create the OAuth credentials

OAuth differs from traditional username/password authentication in that an application using OAuth never sees the username or password that are provided. Instead, the application redirects to a third-party login server that verifies a user’s credentials. Once you are verified, Google’s OAuth systems will issue a callback to your application confirming that the user is correctly authenticated. To make this function, you need to tell Google’s servers a little bit about your XWiki installation.

Create a Google Developer project

Log into the Google Developers Console using your Google Apps account. Once you are logged in, click the Create Project button in the middle of the screen. Name your project whatever you like, then click Create to finish account creation. The Google Developers Console should now take you inside your newly-created project.

Configure a consent screen

Before you can create an OAuth client ID, you need to configure a consent screen. This is the screen that’s shown to users after they log into Google, asking them to grant certain account privileges to your application.

From the menu on the left side of the screen, click APIs & auth to expand the sub-menu, then click Consent screen. Under Email address, select your email address. Under Product name, enter a product name that will be shown on the consent screen for your applications. All other fields are optional. Once you’ve finished filling in all the fields, click Save to create your consent screen.

Create an OAuth Client ID

From the menu on the left side of the screen, click APIs & auth to expand the sub-menu, then click Credentials. Locate the OAuth heading, then click the Create new Client ID button. The Create Client ID dialog will appear. Enter the following parameters:

  • Application type: Web application
  • Authorized Javascript origins:
  • Authorized redirect URIs:

The client ID should now appear on the right side of the screen. Note the Client ID and Client secret fields. You’ll need both of these values later to configure authentication in Nginx.

Install Lua and CJSON

Begin by installing the Lua libraries. We’ll be using LuaJIT with Nginx for performance. We also need some security libraries in order to have HTTPS support in Lua.

Next, download and build the Lua CJSON library. The current version is 2.1.0 as of this writing.


Build a custom Nginx with Lua scripting

Download lua-nginx-module and extract the sources, so the module can be found by the Nginx configure script:

Install some Nginx build dependencies:

Then download, extract, and configure Nginx:

Download the authentication module and configure Nginx

Agora Games has kindly published an Nginx Lua script that can be used to support OAuth2 authentication. However, at the time of this publication, it doesn’t support a crucial feature that we need — the ability to set HTTP headers based on OAuth login status. We’re going to pull that from eschwim’s fork.

With the script in place, we’re going to configure Nginx. Create an virtual host in /etc/nginx/nginx.conf with the following configuration:

With the configuration in place, start Nginx with sudo /opt/nginx-1.7.10/sbin/nginx -c /etc/nginx.conf.

(You should, of course, configure Nginx to start with your init system of choice, like Upstart or runit, so Nginx will start automatically when your server reboots. That configuration is beyond the scope of this article.)

Install XWiki  header auth module

From your wiki’s administration page, locate Extension Manager from the menu on the left, then click Add Extensions. Search for Headers Authenticator for XWiki. Locate the plugin in the table at the bottom and click Install, then wait for installation to complete.

On my XWiki 6.4.1 installation, this process never completed successfully. It kept downloading the file into a temp directory over and over and wouldn’t stop until I forcibly restarted the XWiki service. I had to download the plugin jar, manually place it into /usr/lib/xwiki/WEB-INF/lib, and restart the service.

Configure XWiki for headers authentication

Finally, you’re going to configure XWiki to use the headers Nginx is feeding to it. Add the following to /etc/xwiki/xwiki.cfg:

Finally, restart your XWiki Tomcat container with service tomcat7 restart (or whatever is appropriate for your installation type).

Wrapping up

When you browse to, you should now see a Google Apps login screen. After providing your login credentials, you should be prompted to provide basic account information to the Google Developer app that you created earlier. Once you authorize the app to use your credentials, you should see your account logged into XWiki automatically.

sensu-run: test Sensu checks with token substitution/interpolation

When I’m configuring Sensu checks, especially things that make direct use of variables in my Sensu configuration, I’ve gotten annoyed by the fact that testing them is more difficult than it needs to be. I’ve hacked up a very quick and dirty tool called sensu-run for testing arbitrary commands and standalone checks. Give it a try and see how it works!

Permanently setting FQDN in Google Compute Engine

Unlike Amazon’s EC2, Google Compute Engine allows you to choose the names for your instances, and takes meaningful actions with those names — like setting the hostname on the system for you. Unfortunately, this only affects the short hostname, not the fully-qualified domain name (FQDN) of the host, which can complicate some infrastructures. To set the FQDN at instance launch, we’ll need some startup script magic.

This script snippet checks for the domain or fqdn custom attribute on your instance and applies it to the host after the system receives a DHCP response. It’s based on Google’s own set-hostname hook included with the Google Startup Scripts package. Of course, you’ll need to bake this into your base GCE system image using Packer or another similar tool.

Place the following into /etc/dhcp/dhclient-exit-hooks.d/zzz-set-fqdn:


Using Google Compute Engine service accounts with Fog

Google Compute Engine has a great little feature, similar to EC2’s instance IAM roles, where you can create an instance-specific service account at instance creation. This account has the privileges you specify, and the auth token is accessible automagically through the instance metadata.

Unfortunately, Fog doesn’t support this very well. It expects you to pass in an email address and a key to access the Google Compute Engine APIs, neither of which you have yet. However, you can construct the client yourself, using a Google::APIClient::ComputeServiceAccount for authorization, and pass it in. This code snippet should help:

Follow Fog issue #2945 and assume this post to be outdated when it gets closed.

Replace annual reviews with individual retrospectives

In the past several decades, and particularly in the past few years, many forward-thinking managers have come to the conclusion that traditional yearly performance appraisals are a waste of time at best, or a net negative to morale at worst. This is a philosophy supported by many bright management thinkers, including W. Edwards Deming:

Evaluation of performance, merit rating, or annual review… The idea of a merit rating is alluring. the sound of the words captivates the imagination: pay for what you get; get what you pay for; motivate people to do their best, for their own good. The effect is exactly the opposite of what the words promise.

Bob Sutton and Huggy Rao, authors of Scaling Up Excellence, wrote in their book about Adobe’s experiences eliminating yearly performance appraisals from their organization:

Since the new system was implemented, involuntary departures have increased by 50%: this is because, as Morris explained, the new system requires executives and managers to have regular “tough discussions” with employees who are struggling with performance issues—rather than putting them off until the next performance review cycle comes around. In contrast, voluntary attrition at Adobe has dropped 30% since the “check-ins” were introduced; not only that, of those employees who opt to leave the company, a higher percentage of them are “non-regrettable” departures.

Clearly, many managers and their organizations have found annual performance reviews to be an ineffective tool for managing teams. But what if we took the annual performance review, and were able to humanize it as a tool for good?

Retrospectives: a human approach

Performance reviews are a terrible source of anxiety and stress. A year’s worth of judgment, and the consequences of that judgment, are compressed and handed down in an instant. It’s often as nerve-wracking for the manager as for the subordinate.

So, when I worked as a manager, I used my annual meetings to do something slightly unconventional: to forsake any judgments or value propositions, and instead remind my staff of their accomplishments over the last year. An anniversary, if you will.

In technology, we rarely get the opportunity to think in time periods greater than a few months. If you work within an Agile shop, you might think in two-week sprints. A year ago is a world away, and for someone mired in a difficult project, it might be difficult to slog through the impostor syndrome and remember all the things you did for the organization. We can’t always see our professional development at a macro level, and an outside perspective with a long view can help figure out where we’re going.

The goal of management should be not just improving short-term productivity, but to align the company’s goals with the career development goals of its employees over the long-term. Removing annual performance appraisals is a great step towards removing unnecessary stress from the workplace, but aligning employees’ career goals over the long-view is still crucial in maintaining an effective team.

© 2015 @jgoldschrafe

Theme by Anders NorenUp ↑