Summary

There was approximately 25 minutes of server downtime from 2016-08-05 09:46 UTC to 2016-08-05 10:09 UTC while updating nginx. The root cause was a change in nginx default behaviour around IPv6 sockets which caused IPv4 connections to be ignored.

Details

The version of nginx installed on the server was quite old. Before upgrading, I looked through the release notes to look for changes. I copied them to a text file for easy reference if something went wrong. Nothing in the changelog looked like it would affect us (though crucially, I didn't fully understand all the mentioned changes...).

I updated nginx while it was running (to prevent downtime being too long). Once the update had finished, I restarted nginx. It reloaded and everything seemed to be ok. I then realised that the old process was running, and the new version hadn't started. After killing the old process and starting the new one, clojars.org wasn't accessible.

I searched for possible answers, and a change that was mentioned in the changelog came up: ipv6only was now set on by default for IPv6 sockets. This means that only IPv6 connections would be accepted and IPv4 would not connect. If you were connecting to Clojars over IPv6, you would have had no connectivity issues. I updated the nginx server config to turn off ipv6only, and service was restored.

Learnings

When updating software, make sure that we understand all changes in the changelog, especially ones which change default behaviour.
If possible it would be good to test upgrades on a test server before applying it to the production server.

Posted Aug 05, 2016 - 10:21 UTC

Resolved

This incident has been resolved.

Posted Aug 05, 2016 - 10:06 UTC

Monitoring

Nginx problems appear to have been resolved (IPv6 issues)

Posted Aug 05, 2016 - 10:05 UTC

Investigating

Working on restoring access after nginx upgrade.

Posted Aug 05, 2016 - 09:42 UTC