Skip to content
On Monday morning, June 5th there was a wide scale outage for Microsoft 365. Interestingly, for this one, they first reported it with a barrage of duplicate Exchange Online health status emails (why, we have no idea) but the issue was much more widespread than that (affecting only Exchange)—it affected most Microsoft Office 365 services. Ultimately, the incident became MO571683 and EX571516.

 

The first incident was EX571516: Some users are unable to access Outlook on the web, and may experience issues with other Exchange Online services. Soon after, Microsoft realized that it impacted the other services and raised Incident MO571683: Some users are experiencing impact to multiple Microsoft 365 services and features.

While Microsoft was quick to raise an investigation alert (actually a bakers dozen of them for a still undetermined reason), the incident was already ongoing for at an hour according to the Exoprise proactive telemetry.

Integrated Twitter Feed from Microsoft
Integrated Twitter Feed from Microsoft

Proactive and Way Early Outage Detection

Exoprise sensors first detected and confirmed the outage at 10:22 in various locations from Intel NUC boxes that we have deployed with different kinds of sensors. This alarm notification occurred at 10:22 from a Microsoft Graph OneDrive sensor, and most of the synthetic actions were abruptly canceled over two runs and the sensor alerted us to issues. See the email notification:

 

Exoprise Sensor Outage Alert Notification 2 Hours Before Microsoft reports outage
Exoprise Sensor Outage Alert Notification 2 Hours Before Microsoft reports outage

You can see the relevant error text about the sensor operations being canceled and the OneDrive service not being available:

Executing ‘Get recent files’: {“error”:{“code”:”serviceNotAvailable”,”message”:”A task was canceled.”,”innerError”:{“date”:”2023-06-05T14:22:16″,”request-id”:”cf6b5e42-239a-4abd-903b-a94e40c14c5c”,”client-request-id”:”cf6b5e42-239a-4abd-903b-a94e40c14c5c”}}}
Executing ‘Get recent files’: {“error”:{“code”:”serviceNotAvailable”,”message”:”A task was canceled.”,”innerError”:{“date”:”2023-06-05T14:17:05″,”request-id”:”03d721fd-7a99-4922-8eb1-862ed958bb33″,”client-request-id”:”03d721fd-7a99-4922-8eb1-862ed958bb33″}}}

Drilling into the alarm email, you can see that the outage started at around 10:00 am, but our sensors do further tests before raising the alarm to inhibit false positives. Proper alarm management across your Exoprise estate is critical to getting notified ahead of time without false positives and too many errors.

See the example of being notified at 10:00 am EDT

Early detection via OneDrive tests, seeing outage up to 2 hours before Microsoft
Early detection via OneDrive tests, seeing outage up to 2 hours before Microsoft

 

Then the dashboard started lighting up for other sensors, integrated twitter and Microsoft service feeds. This is what the dashboard looked like, still, during the start of the outage:

Exoprise dashboard with integrated status, twitter, and proactive sensor tests
Exoprise dashboard with integrated status, twitter, and proactive sensor tests

Drilling through to any of the sensors (SharePoint, OneDrive, Microsoft Teams), you could clearly see that the outage was affecting everyone through the power of Exoprise’s crowdsourced analytics. Increased load and response times impacted not only our sensors but also the global crowd.

In both of the following screenshots, you can see the outage across the common crowd series, throughout the collected metrics and increased response times.

Often, the root cause of Microsoft 365 Troubles, Azure AD had issues as early as 10:20 AM EDT.
Often, the root cause of Microsoft 365 Troubles, Azure AD had issues as early as 10:20 AM EDT.

This example is for Microsoft Teams conferencing. Look at the elevated crowd metric in the background as well. This outage was felt across a lot of infrastructure.

Early detection of Microsoft Teams Outages
Early detection of Microsoft Teams Outages

And, finally, here’s an example of an Outlook Web outage (which also affects the new Outlook desktop client):

Outlook, Exchange Outage June 5, 2023
Outlook, Exchange Outage, June 5, 2023

Here, too, you can see that the outage impacted other Exoprise customers who response and UI time were taking minutes to access.

Here’s the Latest Status Updates on Microsoft 365 Outage Incident MO571683

June 5, 2023 12:21 PM EDT

Title: Some users are experiencing impact to multiple Microsoft 365 services and features

User impact: Users are experiencing impact to multiple Microsoft 365 services and features.

More info:

Impact services and features include, but are not limited to:

Exchange Online—Users are unable to access Outlook on the web (as communicated under EX571516)

Microsoft Teams – Users may have difficulties scheduling meetings and/or live events -Users may have trouble loading people profile cards – Users may have issues loading file lists – Users may not be able to create new teams & channels – Users may not be able to install apps – Users may have trouble issuing searches – Users may see delays in admin policy changes taking effect – Users might see errors when using messaging extensions – Users may not see up to date Presence information – Teams Graph APIs may be impacted – Assignments tab in Teams may not load – Teams Admin Center functionalities may not perform as expected

SharePoint Online and OneDrive for Business – Users may be unable to use Search functionality

Current status: We’ve reverted a change which occurred just prior to impact starting and after service replication took place, telemetry is showing significant improvement in availability across the service. Additionally, we’ve received confirmation from previously impacted users that the issue is no longer occurring. We’re continuing to monitor to ensure recovery and are isolating any sections of the service showing residual impact which would require additional recovery actions.

Scope of impact: Impact is specific to users routed through the affected infrastructure.

Next update by: Monday, June 5, 2023, 12:30 PM EDT

Updates June 5, 2023 5:00 PM EDT

(looks like this thing just went on and on. our sensors have been variously reporting errors, hangs and slowdowns all day long)

Final status: We’ve completed the restarts to address the residual impact, and after an extended period of monitoring, we’ve confirmed that the issue is resolved. We continue to investigate why the update caused impact and further details will be provided within the Post-Incident Report.

Scope of impact: Impact was primarily to users hosted within infrastructure that received the update prior to the deployment halt. However, any user may have experienced some degree of impact as described in the “More Info” section above and in Service Health Dashboard post MO571683.

Preliminary root cause:  A recent update, intended to add additional resilliencies against downstream request remaining in a hanging state, inadvertently caused impact.

Next steps: – We’re reviewing dump files gathered from impacted components during the incident to better understand why impact occurred. – We’re investigating to content of the update to identify the underlying root cause. We’ll publish a Post-Incident report within five business days.

Post Incident Report Published

Starting on June 15 and through June 19, Microsoft has published and updated a post incident report for the following incidents:

EX571516/MO571683/MO572252: Can’t access Outlook on the web and other Microsoft services and features 

In summary, Microsoft is indicating that the outage of June 5th through June 6th was due to a Distributed Denial of Service (DDOS) attack. This is their root cause:

Root Cause 
Traffic analysis has identified anomalous spikes in HTTP requests related to a Distributed Denial of Service (DDoS) attack issued against a section of Microsoft 365 front-end components, bypassing existing automatic preventative recovery measures. The front-end components started to perform below acceptable thresholds, impacting features which included Outlook on the web, REST and Search functionality. 

Detect outages early, plus know when it’s fixed.

Contact us today!

Team Exoprise represents multiple people in the engineering, sales and marketing department here at Exoprise. It takes a village.

Back To Top