A Digital Experience Monitoring (DEM) strategy unlocks the key to understanding how end-users interact with…
Microsoft 365, including Office 365, has been suffering repeated outages over the past few days. Between Tuesday November 19th and Thursday November 21 2019 (so far), there have been outages, timeouts and problems with SharePoint, OneDrive, and various parts of Azure AD (AAD). Earlier in the week those outages extended to Yammer, Microsoft Teams and more.
Exoprise customers, of course, knew of the Microsoft 365 outages well in advance of service health update or getting notifications from Microsoft. Importantly, they know about when service is actually restored versus just getting an automated Microsoft status update that might indicate that the problem is resolved for some tenants and locations. Its important to know when service is really restored so you can communicate to your end users and business.
These outages appear to be Microsoft’s fault but that isn’t always the case. Often Exoprise CloudReady is used to detect and isolate network outages, ISP performance issues and other issues along the service delivery chain.
Let’s look at some of the telemetry that Exoprise customers get and examine some of the status feeds from Twitter and Service Communications Status feeds from Microsoft.
As of Thursday November 21st, 2019
For this latest outage, we’ll start with Microsoft’s Twitter Status which indicates a problem:
We’re investigating an issue under SP196186 where users may be unable to run CSOM and API call scripts. More details are available in the admin center.
— Microsoft 365 Status (@MSFT365Status) November 21, 2019
You can see that Microsoft specifically calls out outages for SharePoint, OneDrive, and the Portal. They indicate a start time of 2019-11-21 14:45 (UTC) for this particular outage. Click the image for a larger view.
Here is a screenshot, in real-time, of the CloudReady dashboard for one of our example company tenants, General Sawdust (its our Contoso). You can see the errors and slowdowns highlighted, alongside the integrated Office 365 Service Health. Click the image for a larger view.
Early Notifications, Integrated Confirmation
Exoprise customers generally get notifications of issues and infrastructure problems before Microsoft status updates arrive. Sometimes hours in advance of problems. For Thursday’s OneDrive, SharePoint and other outages Microsoft’s integrated communications status messages arrived at about the same time for this particular tenant.
Other Exoprise customers were notified much earlier.
Real-time Outages and Errors Apparent In Various Forms
You can see the main performance dashboard grids for the sensors showing gaps and skips where they generated errors and the services were unavailable.
The 7-day availability charts will ultimately report availability at less than 90% for the two days.
Microsoft Service Health Messages for Outage
For this outage on Thursday the 21st of November, they provided the following information though some of the details started to change later on.
Title: Delays or problems loading SharePoint Online sites
User Impact: Users may experience intermittent delays or navigation errors when accessing SharePoint sites.
Current status: We’re reviewing available diagnostic data, as well as traffic prioritization within the potentially affected SharePoint Online infrastructure, to isolate the cause of this impact.
Scope of impact: This issue could potentially affect any user intermittently when attempting to access SharePoint Online sites.
Next update by: Thursday, November 21, 2019, at 6:00 PM UTC
Example notification for General Sawdust tenant for SharePoint outage. This and other integrated notifications are part of the CloudReady platform. These messages can be integrated with many other systems using Webhooks, Email-Hooks, included Text messages and more.
Outages Resolved, Service Restored
Ultimately the outage was resolved and service completely restored at 2019-11-22 01:56 (UTC). You can see this outage within your service health dashboard as the following:
SP196346 – Delays or problems loading SharePoint Online sites
Final status: We completed deployment and have verified via monitoring that the service is fully restored.
Scope of impact: This issue could have potentially affected any user intermittently when attempting to access SharePoint Online sites.
Start time: Thursday, November 21, 2019, at 2:45 PM UTC
End time: Friday, November 22, 2019, at 1:56 AM UTC
Preliminary root cause: We identified a code misconfiguration within a recent deployment that caused load balancers to process traffic inefficiently.
We’re reviewing the cause of the misconfiguration in the code to detect and prevent similar issues from happening in the future.
We’ll publish a post-incident report within five business days.
Service Restored, Real-time Updated Dashboard
For completeness, here’s what service restored looks like from a synthetic monitoring perspective within CloudReady.
On Tuesday November 19th, 2019 Microsoft 365 Outages
Earlier in the week Microsoft 365 experienced similar outages that they also attributed to network changes. Here’s some timeline and brief analysis of the event.
We’re investigating an issue preventing access to Microsoft 365 services. We’ll provide additional details shortly on https://t.co/AEUj8uAGXl.
— Microsoft 365 Status (@MSFT365Status) November 20, 2019
Recent Microsoft 365 Outage
For the Tuesday Outage, Microsoft cited a recent networking update as the cause that affected a broader portfolio of applications including Teams, Exchange Online, SharePoint Online, Yammer and Skype for Business Online. This outage occurred for well over two hours.
We’ve identified and reverted a networking build that caused user traffic from the internet to Microsoft 365 services to intermittently fail, and are seeing early signs of recovery. Refer to https://t.co/AEUj8uAGXl for additional details, or MO196220 in the admin dashboard.
— Microsoft 365 Status (@MSFT365Status) November 20, 2019
Ultimately, the outage resolution as updated from the Microsoft 365 Admin Center is as follows:
Title: Unable to access Microsoft 365 services
User Impact: Affected users may have been intermittently unable to access one or more Microsoft services.
Final status: Our investigation identified that a portion of infrastructure entered a degraded state. Automated recovery procedures repaired this problem, and we confirmed that service was restored after monitoring the environment.
Scope of impact: Impact was specific to a subset of users who were served through the affected infrastructure.
Start time: Tuesday, November 19, 2019, at 3:30 PM UTC
End time: Tuesday, November 19, 2019, at 4:15 PM UTC
This is the final update for the event.
Service Restored for November 22, 2019
It was a difficult week for Microsoft 365 outages mostly due to network changes. Not every tenant was affected and service was restored pretty quickly.
For customers, its best to have early detection and communications in place so that your business isn’t left in the dark. Additionally, having documentation about the Service Level Agreement violations is important to receive service credit refunds.