A Digital Experience Monitoring (DEM) strategy unlocks the key to understanding how end-users interact with…
Microsoft Teams had an outage yesterday, on February 3rd 2020 for a few hours, and apparently it was due to an expired TLS / SSL certificate. The Internet erupted in delight over this for some reason with lots of coverage everywhere. Slack and MS Teams usually do garner more attention when they experience unfortunate outages most likely due to their rivalry. Apparently, because this was an expired certificate it gave everyone a reason to write “Oops” and other similar quips. Here’s a couple of examples:
- Microsoft Teams goes down after Microsoft forgot to renew a certificate
- Microsoft Teams has been down this morning
At Exoprise, we’re always a little fascinated when an outage hits the “tape” so to speak and catches the attention of major digital outlets like Techcrunch and The Verge. Cloud outages happen all the time whether due to the provider, the Internet, peering exchanges or local gateways. As outages go, this Teams outage was pretty detrimental and serves as a good reminder that cloud services, security, certificates and encryption are always difficult. We especially enjoy all the geeks on Hacker News who think it can all be solved with a Let’s Encrypt certificate and a little automation. They really grasp the scale of Office 365 / Microsoft Teams </sarcasm>.
Early Detection With CloudReady
As always, our service detected and made our customers aware of the outage well before Microsoft reported the problem. At around 8:50 am, our synthetic Teams Messaging and Audio Video sensors started reporting outages and timeouts while trying to sign in, message to particular teams and start AV conferences within Teams. Here’s an example alarm email from our system:
We had a number of these alarms fire for different environments so we knew the outage was escalating and spreading across continents. Additionally, we usually rank very high for Microsoft Teams monitoring and outages, so our website (www.exoprise.com) started getting lots of activity, downloads, signups and CloudReady Health Report subscriptions. These are all strong signals that there’s an outage happening.
Microsoft confirmed and posted a Office 365 Services Health bulletin about thirty minutes later when our customers were already well aware of the outage. Since Exoprise has the Office 365 Service Communications integrated into its portal as well as the ability to distribute the notifications via Email, everyone gets to see those notices as soon as they are published. Here’s a sample email of what was seen and distributed for this Teams outage.
The first notice didn’t reveal the underlying problem but was updated later in the morning to indicate that it was a authentication certificate which needed updating:
|PublishedTime:||Mon, 03 Feb 2020 15:03:33 +0000|
|MessageText:||Title: Can’t access Microsoft Teams User Impact: Users may be unable to access Microsoft Teams. Current status: We’ve determined that an authentication certificate has expired causing users who have logged out and those that are still logged in to have issue using the service. We’re developing a fix to apply a new authentication certificate to the service which will remediate impact. Scope of impact: This issue may potentially affect any of your users attempting to access Microsoft Teams. Start time: Monday, February 3, 2020, at 1:15 PM UTC Root cause: An authentication certificate has expired causing users who have logged out and those that are still logged in to have issue using the service. Next update by: Monday, February 3, 2020, at 4:30 PM UTC|
You can also see the various twitter updates of Microsoft’s outage notification and when they started to roll out a fix at 11:20 AM ET..Microsoft suggested that service was restored for most tenants by 12PM ET and finally confirmed that the fix was successfully deployed at 4:27 PM ET.
Another benefit to our customers is that they know, regardless of the excellent and proactive communications from Microsoft, when fixes are really fully deployed. Often, a fix is deployed but takes time (there’s that scale thing again) to roll out to all tenants and environments. Its very important for help-desks, operators and administrators to know when something is really fixed so they can properly respond and disseminate. Here’s an example screenshot from one of our Teams sensors where you can see service wasn’t restored until around 4pm ET.
Managing Teams Outage
Nobody likes an outage especially one that could have been prevented with some robust monitoring software like Exoprise CloudReady that specifically handles Microsoft Teams and SSL Certificate Expiration Checking (all in one platform). Get started today to know about outages and certificate expiration’s in real time.