Facebook global outage: Impact, causes, and takeaways
On Oct 4, around 11:40 AM eastern time, Facebook and its associated services ( Instagram, Whatsapp ) started showing signs of an outage, and shortly it went to the global scale and the entire services came to a grinding halt.
The outage lasted for around 7 hours and affected customers all over the world. Being a technologist and someone responsible for the IT infrastructure of a small startup, I was quite interested to learn what really caused this grand outage for a technology giant like Facebook.
What was the impact?
If you think about the services that went down, you may feel that “Oh, it’s just social media”. That is true if you are someone like me who goes on Facebook and Instagram for the sake of killing time. But let’s not forget that these services generate billions on ad revenue and also there are influencers and others who rely on this as primary income.
Also, at this time of the pandemic, many of the gatherings are online and it would not be a good experience when you have planned live streaming and the service is down completely.
Facebook is also used as a primary social login for many websites and users are extensively using it. This has also affected all those trying to authenticate with Facebook and has impacted not just the Facebook services but anything that relies on it.
Let’s not forget Whatsapp. This is something like a de facto messaging app in many parts of the globe. I am using it daily for getting in touch with friends and family. Also, there are many small merchants relying on just Whatsapp for running their business using business accounts and bots. Imagine all of the order taking and customer communication severed for more than 7 hours.
So, even though services may look like just “social media”, the impact on them right now is quite far-reaching and is no longer about just getting in touch with someone. Many businesses rely on these services for daily operations and hence the impact of this outage cannot be side-lined.
This outage not only affected the users of Facebook but even the employees were also affected as the internal networking faced issues and the badges were not working for the employees to swipe in/out of the building 😲.
What went wrong?
As of this writing, we don’t have the complete root cause analysis from the Facebook team. But as per the initial communication, it’s something to do with the misconfiguration on the routers that prevented the DNS resolution from reaching the Facebook services.
To quote the Facebook engineering team post,
Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.
Though this doesn’t provide much information, it is safe to assume that someone manually pushed a “misconfiguration” that propagated to the network and made even the internal systems and data centers inaccessible.
I had posted this news on Reddit and there were a lot of discussions going around that. Many ( including me ) initially thought that it’s related to the DNS services. But based on the comments from users with better knowledge, I found that the DNS was intact. But the DNS resolution was never able to read the Facebook servers. This would look like a DNS issue, but behind the scenes, Facebook had unplugged itself from the internet !!.
Many of the posts also linked the misconfiguration to BGP ( Border Gateway Protocol ) that advertises or announces the addresses handled by a router which in turn is allowing a packet to reach the right server. Looks like Facebook had their DNS resolution also linked to their own BGP and this affected not just the Facebook services, but even the internal networking as well making the access to the internal tools and services ( even the badge swiping for entry / exit ) difficult.
Some of the Reddit users who claim to be from Facebook were commenting that they were coordinating through discord in the war room. That should show the magnitude of the outage.
For those who are interested in the full technical or networking causes, you can follow below Twitter thread from Rob Errata
We will only get to know about the complete reason for the outage once ( or if ! ) Facebook publishes the RCA ( Root cause analysis ). From the looks of it, as specified above, it’s mostly to do with the BGP misconfiguration.
There is no denying that a 100% uptime for any service is a myth. Almost all the services go down in parts once in a while. But the outage of Facebook was particularly interesting as it even locked out internal services. Also, this seems to have been triggered by a manual error or misconfiguration from one or some of the admins possibly.
The key takeaway from this incident as I understand is the requirement for separation of network and resources. It’s pretty obvious that failing the internal network must have caused a delay in the overall resolution process due to inaccessibility to the key tools and monitoring services. The second thing is the “move fast, break things” idea of Facebook. It can be followed in the case of some feature development, but not at the networking or key structure level. You need to have strict validations and multiple levels of reviews for strategic configuration changes. Coz, even if you automate all the processes, there is always some human factor that can go wrong.
If you are interested in more details and analysis, I would recommend the below articles