Quantcast
Channel: Outage – Catchpoint's Blog
Viewing all articles
Browse latest Browse all 49

The Domino Effect of LinkedIn’s DNS Outage

$
0
0

On Wednesday night users trying to access LinkedIn site were redirected to servers owned by Confluence Labs.  The outage lasted a couple of hours and a post mortem Thursday described the issue as caused by human error at Network Solutions, provider for LinkedIn’s DNS.

Immediate effect of the outage is obvious: upset LinkedIn users, ad revenue loss, alarmist media coverage, possibly some brand damage, and a reminder of last year’s security leak. Sadly the incident not only impacted the LinkedIn website, but it also triggered web performance problems on websites using Linkedin’s Share Plugin.

Web Performance Impact on Other Websites

During the LinkedIn DNS downtime, our monitoring agents observed failures on other websites such as The Atlantic, The Daily Beast, and The Economist.  The problems lasted about 6.5 hours, from 6 PM PT Wednesday to 2:30 AM PT Thursday.

The impacted sites included the LinkedIn Social Plugin in their webpages by referencing an inline JavaScript call to “platform.linkedin.com”. Unfortunately, this type of code blocks the browser from rendering HTML until the JavaScript request from LinkedIn is successfully loaded and executed.  When the LinkedIn DNS problem started, the JavaScript calls “platform.linkedin.com” resulted in an infinite chain of redirects.

Before the outage, under normal performance, this is the waterfall of the LinkedIn tags on a website:

website waterfall normal

During the outage, platform.linkedin.com instead responded with multiple 302 redirects to itself.

website waterfall during linkedin outage

The recursive redirects caused serious performance degradations, as the browser could not render the rest of the webpage HTML until it gave up on the redirect chain. The servers of Confluence could not handle the high load of requests (of all the users hitting it) directed to LinkedIn, resulting in slow response times.

When looking at the performance of “platform.linkedin.com”, you can see that LinkedIn response times spiked.

platform.linkedin.com performance

In the Web Performance community this event is referred to as SPoF – “Single Point of Failure”. The most unfortunate part of a SPoF from a vendor tag, is that the site’s end user will blame the site itself for the issue. End users have no idea of what is happening behind the scenes, what social or adserving tag is causing the slowness.

You can see the impact on the overall page performance below:

Linkedin Outage Impact on Website Performance

The Fragility of the Web and the Security Nightmares

The incident clearly shows how fragile and delicate the web can be.  Third party tags can easily impact performance of a website and they can become gateways to users on thousands of sites.

Imagine for a moment that LinkedIn’s DNS or Servers had been hijacked (like the media first assumed), the effect on the web would have been devastating.  Not only would visitors to linkedin.com be impacted, but also any visitor of a webpage that embedded the LinkedIn tag as well.  The hijacker would have the ability to take down a large portion of the web, could deliver malicious software to millions of people, or steal who knows what information.

Lessons Learned

So what can websites do to deal with such situations?

  1. Before you put a tag on the webpage make sure you know who you are doing business with, understand how secure and scalable their system is, and most importantly ensure you are protected with an SLA and contract agreement from that vendor.
  2.  Ensure third party tags load async, so they don’t block the content.  When possible, mitigate security risks by relying on iframes to a different domain from the webpage. There are plenty of articles on this topic, or you could simply rely a Tag Management System.  This way, your team can quickly remove unresponsive tags and can react immediately to third party issues.
  3.  Monitor, monitor, monitor.  Always know how your site and providers are performing and have an alerting mechanism set in place to notify upon performance degradation.

Lastly, kudos to the LinkedIn, Confluence Labs, and Network Solutions teams for their quick reactions and transparency during the resolution and post mortem following – an example every company should follow.

Mehdi – Catchpoint


Viewing all articles
Browse latest Browse all 49

Trending Articles