skip to Main Content

It looks like the SharePoint and OneDrive outages have continued into this month with outages on December 2nd.

Please do sign up for a free trial of our service to see how you can diagnose outages like this, recover service credits, and know when services really recover.

Final Update:
Microsoft produced a final update for these week+ long difficulties that may have affected some SharePoint and OneDrive sites. This was the final result from 07 Dec 2019 02:25 (UTC).

We’ve confirmed that a prior configuration change within the SharePoint Online caching infrastructure inadvertently caused increased failure rates and latency for customer request traffic into SharePoint Online and OneDrive for Business. As a result, users hosted on the infrastructure affected by this change would experience latency and intermittent failures accessing these services or adjacent products with dependencies on them.

Microsoft 365, including Office 365, has been suffering repeated outages over the past few days. Between Tuesday November 19th and Thursday November 21 2019 (so far), there have been outages, timeouts and problems with SharePoint, OneDrive, and various parts of Azure AD (AAD). Earlier in the week those outages extended to Yammer, Microsoft Teams and more.

Cloud and Network Outages
Outages Happen. Sometimes they’re under your control and sometimes they aren’t.

Exoprise customers, of course, knew of the Microsoft 365 outages well in advance of service health update or getting notifications from Microsoft. Importantly, they know about when service is actually restored versus just getting an automated Microsoft status update that might indicate that the problem is resolved for some tenants and locations. Its important to know when service is really restored so you can communicate to your end users and business.

These outages appear to be Microsoft’s fault but that isn’t always the case. Often Exoprise CloudReady is used to detect and isolate network outages, ISP performance issues and other issues along the service delivery chain.

Let’s look at some of the telemetry that Exoprise customers get and examine some of the status feeds from Twitter and Service Communications Status feeds from Microsoft.

As of Thursday November 21st, 2019

For this latest outage, we’ll start with Microsoft’s Twitter Status which indicates a problem:

 

Microsoft Office 365 Service Communications Status+
Microsoft Office 365 Service Communications Status

You can see that Microsoft specifically calls out outages for SharePoint, OneDrive, and the Portal. They indicate a start time of 2019-11-21 14:45 (UTC) for this particular outage. Click the image for a larger view.

Real-time SharePoint Monitoring Dashboard+
Real-time SharePoint Monitoring Dashboard Detailing SharePoint and OneDrive Outage

Here is a screenshot, in real-time, of the CloudReady dashboard for one of our example company tenants, General Sawdust (its our Contoso). You can see the errors and slowdowns highlighted, alongside the integrated Office 365 Service Health. Click the image for a larger view.

Start a Free Trial Today. Its Simple To Get Started

Every day customers start and deploy a full suite of sensors in under 5 minutes. Give it a try for network benchmarks, root cause analysis and complete visibility into ALL of Office 365.

Early Notifications, Integrated Confirmation

Exoprise customers generally get notifications of issues and infrastructure problems before Microsoft status updates arrive. Sometimes hours in advance of problems. For Thursday’s OneDrive, SharePoint and other outages Microsoft’s integrated communications status messages arrived at about the same time for this particular tenant.

Other Exoprise customers were notified much earlier.

Dashboard Grid Showing OneDrive, SharePoint, Azure AD Outages
Dashboard Grid Showing OneDrive, SharePoint, Azure AD Outages

Real-time Outages and Errors Apparent In Various Forms

You can see the main performance dashboard grids for the sensors showing gaps and skips where they generated errors and the services were unavailable.

Availability Chart Showing High 8's for Availability
No Five-9’s for these services this week, SharePoint and OneDrive availability drops to high 8’s for the few days

The 7-day availability charts will ultimately report availability at less than 90% for the two days.


Microsoft Service Health Messages for Outage

For this outage on Thursday the 21st of November, they provided the following information though some of the details started to change later on.

Title: Delays or problems loading SharePoint Online sites

User Impact: Users may experience intermittent delays or navigation errors when accessing SharePoint sites.

Current status: We’re reviewing available diagnostic data, as well as traffic prioritization within the potentially affected SharePoint Online infrastructure, to isolate the cause of this impact.

Scope of impact: This issue could potentially affect any user intermittently when attempting to access SharePoint Online sites.

Next update by: Thursday, November 21, 2019, at 6:00 PM UTC

Sample Notification from CloudReady Synthetic+
Sample Notification from CloudReady Synthetic

Example notification for General Sawdust tenant for SharePoint outage. This and other integrated notifications are part of the CloudReady platform. These messages can be integrated with many other systems using Webhooks, Email-Hooks, included Text messages and more.

Outages Resolved, Service Restored

Ultimately the outage was resolved and service completely restored at 2019-11-22 01:56 (UTC). You can see this outage within your service health dashboard as the following:

SP196346 – Delays or problems loading SharePoint Online sites

Final status: We completed deployment and have verified via monitoring that the service is fully restored.

Scope of impact: This issue could have potentially affected any user intermittently when attempting to access SharePoint Online sites.

Start time: Thursday, November 21, 2019, at 2:45 PM UTC

End time: Friday, November 22, 2019, at 1:56 AM UTC

Preliminary root cause: We identified a code misconfiguration within a recent deployment that caused load balancers to process traffic inefficiently.

Next steps:
We’re reviewing the cause of the misconfiguration in the code to detect and prevent similar issues from happening in the future.

We’ll publish a post-incident report within five business days.

Service Restored, Real-time Updated Dashboard

For completeness, here’s what service restored looks like from a synthetic monitoring perspective within CloudReady.

SharePoint, OneDrive Service Restored+
SharePoint, OneDrive Service Restored

On Tuesday November 19th, 2019 Microsoft 365 Outages

Earlier in the week Microsoft 365 experienced similar outages that they also attributed to network changes. Here’s some timeline and brief analysis of the event.

 

Recent Microsoft 365 Outage

For the Tuesday Outage, Microsoft cited a recent networking update as the cause that affected a broader portfolio of applications including Teams, Exchange Online, SharePoint Online, Yammer and Skype for Business Online. This outage occurred for well over two hours.

Ultimately, the outage resolution as updated from the Microsoft 365 Admin Center is as follows:

Title: Unable to access Microsoft 365 services

User Impact: Affected users may have been intermittently unable to access one or more Microsoft services.

Final status: Our investigation identified that a portion of infrastructure entered a degraded state. Automated recovery procedures repaired this problem, and we confirmed that service was restored after monitoring the environment.

Scope of impact: Impact was specific to a subset of users who were served through the affected infrastructure.

Start time: Tuesday, November 19, 2019, at 3:30 PM UTC

End time: Tuesday, November 19, 2019, at 4:15 PM UTC

This is the final update for the event.

Service Restored for November 22, 2019

It was a difficult week for Microsoft 365 outages mostly due to network changes. Not every tenant was affected and service was restored pretty quickly.

For customers, its best to have early detection and communications in place so that your business isn’t left in the dark. Additionally, having documentation about the Service Level Agreement violations is important to receive service credit refunds.

Team Exoprise

Team Exoprise

Team Exoprise represents multiple people in the engineering, sales and marketing department here at Exoprise. It takes a village.

Back To Top