Bug in Skype for Windows client downed P2P network
Skype CIO Lars Rabbe says a cluster of critical support servers became overloaded
Skype's chief information officer, Lars Rabbe has revealed that the recent Skype outage was caused by the peer-to-peer (P2P) network becoming unstable and suffering a critical failure.
The initial problem began when a series of servers became overloaded.
"On Wednesday, December 22, a cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers. In a version of the Skype for Windows client [version 5.0.0152], the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash," said Rabbe in his blog post.
When this occurred, users that were running either the latest Skype for Windows (version 188.8.131.52), older versions of Skype for Windows (4.0 versions), Skype for Mac, Skype for iPhone, Skype on your TV, and Skype Connect or Skype Manager for enterprises were not affected.
Unfortunately almost half of all Skype users globally were running the 184.108.40.206 version of Skype for Windows. The crashes caused around 40% of those clients to fail, according to Rabbe. These clients also included 25 to 30% of the publicly available supernodes, which also then failed.
"Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25 to 30% fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes," said Rabbe.
This massive load, 100 times more than normal traffic as users' restarted crashed Windows clients, caused more supernodes to shut down. This cycle was repeated until there was almost complete Skype failure a few hours after the initial crisis.
To fix the crash, the Skype engineering and operations team then introduced hundreds of instances of the Skype software into the P2P network, which acted as dedicated supernodes to provide enough temporary supernode capacity to accelerate the recovery of the peer-to-peer cloud.
The team was able to stabilise the network by Christmas Eve and complete repairs on Christmas.
"We are truly grateful to all of our users and humbled by your continued support. We know how much you rely on Skype, and we know that we fell short in both fulfilling your expectations and communicating with you during this incident. Lessons will be learned and we will use this as an opportunity to identify and introduce areas of improvement to our software, further assess and invest in capacity and stability, and develop better processes for outage recovery and communications to our user base. Thank you to everyone," said Rabbe.