Monday, September 3, 2007

Viability of DNS failover

In Spring 2006, my site was plagued with recurrent hardware problems causing serious downtime. At the time, the site was hosted on a dedicated server and I had no failover strategy whatsoever so when a hard disk failed on the server, you could expect a few days of downtime.

At the beginning of the summer, I got fed up and started investigating possible solutions to this problem and, after some experimentation, finally settled on DNS failover. Here are the results of the experimentation, originally posted on WebHosting Talk:

I run a site with about 1,000,000 unique visitors per month and recent server failures made me decide to get a failover server to minimize downtime. My goal wasn't to get 99.999% uptime but to be able to be back on track after a failure in a "reasonable" amount of time. After evaluating several solutions, I decided to go with DNS failover. Here's how the setup work:

1) mydomain.com points to main server with a very low TTL (time to live)
2) failover server replicates data from main server
3) when main server goes down, mydomain.com is changed to point to failover server

The drawback is the DNS propagation time since some DNS servers don't honor the TTL and there is some caching happening on the user's machine and browser. I looked for empirical data to gauge the extent of the problem but couldn't find any so I decided to setup my own experiment:

I start with mydomain.com pointing to the main server with a TTL of 1800 seconds (1/2 hour). I then change it to point to the failover server which simply port forwards to the main server. On the main server, I periodically compute the percentage of requests coming from the failover server which gives me the percentage of people for which the DNS change has propagated.

I made the DNS change at exactly 16:04 on 06/21/06 and here are the percentage of propagated users:

06/21/06 16:00 0 %
06/21/06 16:05 3 %
06/21/06 16:10 20 %
06/21/06 16:15 37 %
06/21/06 16:20 59 %
06/21/06 16:25 69 %
06/21/06 16:30 76 %
06/21/06 16:35 80 %
06/21/06 16:40 86 %
06/21/06 16:45 90 %
06/21/06 16:50 91 %
06/21/06 16:55 92 %
06/21/06 17:00 93 %
06/21/06 17:05 94 %
06/21/06 17:10 94 %
06/21/06 17:15 95 %
06/21/06 17:35 95 %
06/21/06 17:40 96 %
06/21/06 17:45 97 %
...
06/22/06 10:40 99 %

So even after 18 hours, there is still a certain percentage of users going to the old server so DNS failover is obviously not a 99.999% uptime solution. However, since more than 90% of the users are propagated in the first hour, the solution works well enough for me.

1 comment:

NuvoDev Technologies said...

Great blog article about this topic, I have been lately in your blog once or twice now. I just wanted to say my thanks for the information provided here.

http://www.nuvodev.com