How to Achieve Zero-Downtime Sitecore Deployments (Architecture) – Part I

Share Button

Is it possible to devise a high-availability architecture using Sitecore that can avoid down time and broken functionality during deployments?

Well, this article discusses the potential problems you may encounter during deployments and proposes a system architecture to achieve this goal.

So, what are the main problems that affect the availability of the website during the deployment?

  1. Code/Markup/Config updates will cause the application pool to restart.
  2. Publishing new sublayouts can be problematic. i.e. publishing the subalyouts before the code and markup are deployed is good enough to get the yellow screen of death.
  3. Rebuilding indexes can cause your search and listing pages to stop working till the rebuild process is complete.

The following architecture describes how address the problems mentioned above to avoid any downtime during the deployment.

The Architecture

System Architecture
System Architecture

This proposed architecture is based on the Multi-Instance environment documented in the Sitecore scaling guide apart from having a Web database per CD server. For simplicity, the diagram illustrates the architecture on only 2 CD servers. However, in the CD servers can scale out as needed based on the performance requirements.

The Content Management Setup

A standard CM configuration where the CM server(s) are connected to all databases (core, master, web cd1 and web cd2). A publishing target is created for each web database individually.

The Content Delivery Setup

Here are the general guidelines for the CD setups:

  1. All CD servers are connected to the core database.
  2. Each CD server is connect to its own web database.
  3. A load balancer is used to distribute the load between the CD servers.
  4. A session state service is used to store the sessions out of the IIS process (e.g. Asp.Net StateServer). You can avoid this part by configuring your load balancer to use sticky session. However, this will increase the time needed to finished the deployment process as explained in the following section.

The Deployment Workflow

Normally, the content editing side of the CMS can have lower availability expectations compared to the content delivery side. Hence, the code deployment and content updates can be done straight away on the CM server (master database) without needing to worry too much about the deployment process.  However, In order to deploy to the CD server, you can follow these steps:

1- Connection draining

Configure the load balancer to drain the connection on the server you are about to deploy to (e.g. CD1). The time needed to accomplish this step depends on how the load balancer is configured; if sticky sessions are used, the load balancer will redirect only the new requests to CD2 while CD1 will keep serving its existing users until the session expires. If sticky sessions are not used, the process of draining the server will be as simple as sending all the upcoming traffic to Web2 (without the need for waiting the sessions to expire or connections to terminate).

2- Deployment to CD server

Once the CD server is drained and it doesn’t serve the site users anymore. A code deployment – that involves dlls, config files, markup, etc. – can be safely done.

Straight after the deployment is done, the changes from the Master database needs to published to the Web CD1 database. At this point, the CD1 instance can be checked (or regression tested if needed), while the public facing site is still delivering the old version of the code and content.

3- Bring the CD server back to life

It’s time now to enable the CD1 on the load balancer so the site users can see the updated site. Once the CD1 is enabled, you will need to repeat the whole workflow on CD2.

 

Summary

This article illustrated the possibility to achieve a zero-downtime Sitecore deployment using a multi-instance environment, where a CM server, two CD servers are used. The proposed solution works fine for simple CMS sites. However, a more complex architecture will be needed to address functions such as search, integration with external systems and custom databases. Part II of this article covers the work needed to minimize the downtime when search is used.

 

11 Comments

  1. Kam Figy said:

    For search, you may be about to talk about this anyway, but SwitchOnRebuildIndex should enable serving read requests during rebuild.

    Nice post.

    February 3, 2015
    Reply
    • ehabelgindy said:

      Oh .. That’s exactly what I am planning to do 🙂
      Thanks

      February 3, 2015
      Reply
  2. Nathanael Mann said:

    We do fairly similar to this, but have thought about using sql replication. During the upgrade, we could then stop the replication process, when the process is complete and the first bank of servers are running happily, restart the replication (thus upgrading the content for you) and then you just need to upgrade files (for the most part).

    February 3, 2015
    Reply
    • ehabelgindy said:

      Haven’t thought before about using sql replication to do it! This is an amazing idea though.

      February 3, 2015
      Reply
      • Nathanael Mann said:

        It was first suggested by a solution architect from a non-sitecore background, but makes a lot of sense for Sitecore.

        February 4, 2015
        Reply
  3. Derek Dysart said:

    How do you deal with the growing amount of time that publishing takes when scaling this approach? I could imagine an environment with more than 3 CD servers, serving a dozen languages could run into publishing delays, even with parallel publishing enabled or a dedicated publishing instance.

    February 10, 2015
    Reply
    • ehabelgindy said:

      This may be extreme, but if the performance of incremental or smart publishing – with parallel publishing enabled – is not satisfactory, you may go for multiple publishing servers.

      February 11, 2015
      Reply
  4. Jonathan Folland said:

    In a web farm environment, we would take authoring offline for a short period while we backed up all databases. We would then point authoring to the backup databases. We would point online web servers to the backup databases. We would deploy any new code to offline servers. We would point offline servers to the going forward database (production). We would deploy packages through offline servers. We used a tool from Hedge Hog development to sync the authoring changes from the backup databases to the going forward databases. When we were ready to deploy, we rolled on the new servers and off the old servers. We used powershell scripting to achieve much of this, so that it could be repeated without error.

    February 11, 2015
    Reply
    • ehabelgindy said:

      This sounds great, thanks for sharing your solution.

      February 11, 2015
      Reply
  5. Andrew lansdowne said:

    Very interesting. It reads as though all the load balancer setting change and publishing are done manually: is this the case or are you able to automate the entire process?

    March 13, 2015
    Reply
    • ehabelgindy said:

      It depends on the load balancing service, I would say this would be a manual process in most cases. However, you can automate this part if the load balancer can be configured using Powershell scripts or some sort of APIs.

      March 16, 2015
      Reply

Leave a Reply to ehabelgindy Cancel reply

Your email address will not be published. Required fields are marked *