My 2 cents on the AWS failure and lessons learned from the past.

Single Point of Failure

A lot has been published already about AWS EC2 failure, I wanted to add my 2 cents on the issue as it reminded me of a notorious event that happened to DoubleClick in August 2000.

What AWS and their customers experienced is unfortunate, but it will and it can happen to anyone! In IT we are dealing with various complex systems – hardware, software, and people – things are bound to break at some point. Failure is not limited to IT, human history is full of such failures with automobile recalls, bank failures, nuclear disasters, collapsing bridges. What people should understand is that failure is bound to happen, be ready for it, and learn from it to avoid it in the future.

Let’s be real very few companies out there have the money and resources to have a redundant transactional systems running in parallel which can act as back up. For most companies you just have to fail “nicely”. You should have plans and processes to deal with everything from troubleshooting the failure, recovering from failure, to notifying customers of the failure, and most importantly architect your application and systems so they fail nicely and can recover from such failures.

Companies that have websites or web application must be able to redirect all requests to “Service is down” webpage. Mobile or desktop applications relying on APIs might need to have special logic built-in for such failures. However, if you are a company delivering services to other website via tags, like adserving or widgets, things get a little more complicated. You cannot remove the tags from the webpages, unless your clients build it in their pages. You need to ensure you can deliver from another location enough to ensure your tags do not impact the web performance and usability of your clients’ websites!

Back at DoubleClick we ran a fairly large infrastructure delivering billions of impressions, the DART tags are present on almost every major website. One day in 2000 we had a really bad outage and our tags “stopped” working because the adserving system experienced a catastrophic meltdown. Customers were not happy, but they understood that technology fails sometimes, and they had SLAs to protect them. What they were most unhappy about was that the DoubleClick ad tag had such an incredible impact on the performance of their sites. Webpages came to a crawl or stopped loading, the user experience was horrible! Our client couldn’t recover from our failure – some were able to remove the tags via their Content Management Systems – but others just had to suffer from our failure.

So we went back to the drawing board and built a complete secondary system capable of handling the billions of ad calls but that will only deliver 1×1 pixels or empty JavaScript. So in case of a major outage the ads would not work but at least would not take down the entire customer’s site with us and their user experience. That “Dot” system was never used in real life, but was always there in case we needed it.

The first lesson for companies that provide services to other websites is to not rely on a single vendor for hosting and spare a few hundred dollars and get a backup plan. So next time AWS or anyone else goes down, you will not have impacted the user experience of the folks visiting your customer’s site. And once you have that backup system in place, test it every frequently! Make sure the right folks know when to pull the trigger and the system can handle it (capacity).

The second lesson is about diversification; do not put all your eggs in one basket. If you go with vendor A for hosting, choose vendor B for DNS, choose vendor C for CDN…

Lastly, if you are website relying on 3rd party vendors, make sure you monitor them. Also learn about their technology and their vendors, who they are relying on for hosting their technology, who is their DNS provider, and most importantly what are their back up plans in case that tag comes to a crawl!

The cloud is great, it is the future of IT -but do not drink too much of the kool-aid or “cloud-aid”, be ready for outages and failures!

Mehdi - one of the guys who handled those angry customer phone calls in 2000.

For more about the AWS issue : The Big List of Articles on the Amazon Outage

My 2 cents on the AWS failure and lessons learned from the past.

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112