Disaster recovery

Business continuity and disaster recovery planning

To ease support and deployment, SD Elements is deployed on a single virtual appliance (Virtual Machine or VM) as a monolithic service. The VM runs nginx (front-end web proxy), apache (application server), postgres (database), and a few other small services (memcache, rabbitmq, and so on).

SD Elements is a non-mission critical system. As a result, the impact of disruption in service is limited to employee efficiency in product development teams. While service availability is not mission critical, the data stored in SD Elements can affect the quality of the software produced by development team. As a result, data integrity is a more sensitive subject. We recommend the following SLOs:

  • Recovery Point Objective (RPO): 1 hour to 1 day

  • Recovery Time Objective (RTO): 1 hour to 1 business day

For larger deployments the lower RPO and RTO are recommended.

Backups

The system is designed to store all its active data in an easy to back up database. A database backup can be loaded back into the appliance as long as the backup and restore are done using the same version of the software.

In extenuating circumstances, the data can be migrated to a newer version, but never to an older version of the software.

At minimum, we recommend a full virtual appliance backup before any attempt to update the software and after a successful update, which normally happens once a month. This allows you to restore any system level issues back to the most stable version of the virtual appliance and then use the data backup to load the latest stable data onto the appliance.

Full virtual appliance backup

Since SD Elements runs in a virtualized environment, ready-to-use solutions such as an automated snapshot and backup provided by VMWare vCenter, or other similar solutions, provide an easy way of backing up the entire deployment.

For cases where the virtualization layer backup is not available, we offer a replacement method using rzbackup (a wrapped around zbackup) to do de-duped differential backups of the entire operation system and its data from within the server. The process is straightforward to set up and we can assist if need be.

Differential backups of the database

Our VMs can be configured to do hourly snapshots of their database (recommended to set up and test). Our VM administration tool (sde_admin) has a built in command that does all the work: it will dump the data and create the rotating deltas. On our hosted SaaS servers we have it set up to run hourly. These dumps end up in /docs/sde/backup, and can be pulled off our VMs using rysnc or any other backup agent.

Full backups of the database

The SD Elements databases are fairly small in size. It’s also reasonable to maintain full backups of the database. Again our administration tool has a wrapper around postgres’s built-in functionality to make doing this work straightforward.

Database replication

Using the aforementioned backup methods, it is possible to replicate the database of the application into database of a warm standby duplicate server on an hourly basis. This can reduce the time to recovery significantly as no new setup and configuration will be needed for the backup system.

Summary of suggested deployment

As we generally don’t treat SD Elements as a mission critical system, it likely isn’t worth the cost, operational overhead, complexity, or trying to deploy the system in a manner that meets classic high availability requirements (i.e. five 9’s uptime). Our suggested approach for disaster recovery is to run a “warm” spare server.

Beginning with two identical servers, designate one as the live server, and the other as the backup. The database on the live server is replicated to the database on the backup server: this replication can happen using the built-in facilities of postgres, or simply by doing periodic database dumps and imports. Users will use the live server. No one will interact with the backup server.

In case the live server goes down, the live server can be made active using a load balancer, changing DNS records, and so on. You would need to mirror some of the file system configurations that are unique to each server: keyczar keys, certs, and so on.

Business continuity and disaster recovery for SaaS

To ease support and deployment, SD Elements is deployed on a single virtual appliance (Virtual Machine or VM) as a monolithic service. The VM runs nginx (front-end web proxy), apache (application server), postgres (database), and a few other small services (memcache, rabbitmq, and so on).

Service Level Objectives (SLO)

SD Elements is a non-mission critical system. As a result, the impact of disruption in service is limited to employee efficiency in product development teams. While service availability is not mission critical, the data stored in SD Elements can affect the quality of the software produced by development teams. As a result, data integrity is a more sensitive subject. Our SaaS instances have the following SLOs:

  • Recovery Point Objective (RPO): Hourly

  • Recovery Time Objective (RTO): 1 hour to 1 business day

Backups

The system is designed to store all its active data in an easy to back up database. A database backup can be loaded back into the appliance as long as the backup and restore are done using the same version of the software.

In extenuating circumstances, the data can be migrated to a newer version, but never to an older version of the software.

Differential backups of the database

Our VMs are configured to do hourly snapshots of their database (recommended to set up and test). If required, these dumps can be pulled off our VMs through our support team.

Full backups of the database

A recoverable backup is encrypted and sent offsite daily. In the event of an issue, these will be used as the dataset should we require to failover.

results matching ""

    No results matching ""