Deployments without downtime: High availability in the DevOps Services Web IDE

This blog references DevOps Services at, which has been deprecated. To get the latest tools for agile projects, use IBM Bluemix Continuous Delivery.

With cloud-based applications, there is always pressure to deliver faster. Development teams face technology pressure to quickly respond to security problems, web browser changes, and other technology dependencies. At the same time there is business pressure to deliver fixes to customers more quickly, and to provide early exposure to new features to gather critical feedback. The downside is that deployments often mean disruption and downtime for customers. In response to these pressures, the IBM Bluemix DevOps Services team is making deep architectural and technology changes within its applications to enable zero-downtime deployments.

One of the most complex technology changes to support zero-downtime deployments was in the DevOps Services Web IDE. The Web IDE is a large Java enterprise application that required a noticeable outage to upgrade, and was under intense pressure to deliver more frequently. The following chart shows deployment frequency and average deployment outage duration for the Web IDE over a six month period.


As you can see, deployment frequency increased from one deployment per month in September and October to five in February. Worse, because the user base and user data steadily grew, each deployment took longer. From December to February, the average downtime was over 30 minutes per month. As much as customers want the latest features and fixes, they don’t want their development tools to fail in order to get them.

In April, the Web IDE was migrated to a new high-availability architecture. This architecture adds full redundancy through all tiers of the app, so that deployments can occur at the flip of a proxy switch. There are now at least two instances of the Web IDE running at any given time. The app servers sit behind an nginx load balancer that is able to switch between them with no noticeable delay. Persistence is handled by a separate disk node that is shared by both instances of the Web IDE in a RAID configuration to protect against individual disk failure. At the front end, authentication is handled by a redundant pair of Apache proxies. This means we can not only upgrade without downtime, but also perform maintenance such as kernel updates on any single piece of the stack without downtime. The following diagram provides an overview of the new architecture:


Upgrades are fully automated and orchestrated using UrbanCode Deploy, meaning both production servers are upgraded with the click of a button. First, the standby server is upgraded and tested, then nginx routes customer traffic to that server. Finally the primary server is upgraded and tested, then traffic is switched back to primary. This process puts the system back into its default configuration, ready for the next deployment or emergency fail-over.

After one final downtime in April to transfer to the new architecture, the team’s velocity increased noticeably. The team has delivered nine production upgrades so far in July 2015 alone, surpassing the total number of deployments from September to December 2014. Since April, some deployments have occurred while the site was in maintenance mode for other reasons, but none have required any downtime. Customers are getting the fixes and features they need much faster, without the traditional disruption that deployments used to imply. As the following chart shows, the duration of deployment outages has dropped to zero, and deployment frequency is increasing.


While this is great progress, the team continues to focus on optimizing the delivery process. The combination of a high availability architecture and a fully automated deployment process takes much of the traditional friction out of software delivery. With these fundamentals in place, there is no major barrier to delivering production updates daily, or even hourly, based on business and technology needs. For a deeper look at the changes the team is making to support continuous delivery, see these slides from a recent talk I gave on the subject. For more updates, can also follow @bluemixdevops or me, @jarthorne, on Twitter.

John Arthorne
Sr. Development Manager – Bluemix DevOps Services Web IDE & Eclipse Orion



via IBM Bluemix DevOps Services

July 6, 2017 at 06:51PM