skip to Main Content

Early Detection of Exchange Online Mail Delivery Outage

Exchange Online down?

Exoprise CloudReady provides early detection of mission-critical mail outages. On February 3rd, Microsoft had a mail delivery delay. That led to mail delivery failures and an outage.

CloudReady detected the Exchange Online outage almost 2 hours in advance. Finally, Microsoft did publish the incident:

Users are experiencing delays of approximately five to ten minutes when receiving external email messages

EX237654, Exchange Online, Last updated: February 3, 2021 5:29 PM
Start time: February 3, 2021 5:28 PM

Status
Service degradation

User impact
Users may be experiencing delays of approximately five to ten minutes when receiving external email messages.

Latest message View history
Title: Users are experiencing delays of approximately five to ten minutes when receiving external email messages

User Impact: Users may be experiencing delays of approximately five to ten minutes when receiving email messages.

Current status: We’re reviewing email transport logs to determine the underlying cause of the issue.

Scope of impact: This issue could potentially affect any of your users if they route through the affected system.

Next update by: Wednesday, February 3, 2021, 7:30 PM (2/4/2021, 12:30 AM UTC)

Dropped Emails

This incident fails to mention the long delays on the mail transport receivers. As a result, mails drop, bounce back to the sender once the queues fill up, and reject happens. Our own sensors won’t wait that long before we notify our customers of the outage. You can read more about it on our Exchange mail flow page.

Early Notice

Here’s an example notice received nearly 2 hours in advance of the outage. At the start of the incident, email outages were occurring every 15 minutes or so.

But with the outage, transport queues starting to grow. And more missed mail delivery occurred. This is the case with mail outages; queues grow and problems multiply. 

Early notice of mail delivery transport problems, February 3rd, 2021
Early notice of mail delivery transport problems, February 3rd, 2021

Early Detection

Here’s an example of a dashboard covering a good-sized email monitoring deployment. Distributing the email sensors helps customers in determining global outages during network problems.

Early detection of exchange online mail delivery outage

Updates to EX237654

More info: Some users are receiving a 4.3.2. The error indicating that the message was not delivered. This is expected while we work to retry queued messages.

Current status: We’re in the process of re-routing traffic to alternate systems. This will improve service health and provide user relief. In parallel, we’re continuing to review recent changes.

Now, in the incident report, Microsoft is starting to indicate that messages can be dropped with 4.3.2 error. Message transports will eventually give up and return sender error.

More Updates

More info: Some users may receive the error “4.3.2 Temporary server error. Please try again later”. Subsequent attempts may continue to be unsuccessful.

Current status: Re-route traffic to alternate systems. Signs of recovery are visible.

Scope of impact: This issue could potentially affect any of your users if they are routed through the affected systems.

Start time: Wednesday, February 3, 2021, at 9:20 PM UTC

Next update by: Thursday, February 4, 2021, at 2:30 AM UTC

Post Incident Report

On February 7th, Microsoft published a post-incident report. More information can be found in the report here.

Scope of Impact

This issue could have potentially affected any of your users if they were routed through the affected system. Not all email for an impacted user would have likely been impacted (though it is possible.) In general, approximately 10%-15% of email within a given tenant would have experienced excessive delay. However, some tenants that route all their mail through their on-premises environment or another service provider could have higher impact.

Root Cause

Inadvertent restarts of front-end components triggered a clearing of the cache, which increased query rates into the anti-spam database components. This prevented caches from filling up with the required data. The result was email delays and delivery failures.

Start a Free 15 Day Trial for Early Detection of Cloud SaaS / Outages

Own Exoprise CloudReady and know about the outage hours in advance. Besides, you can communicate to users waiting on that business-critical email. More importantly, get to know exactly when the mail delivery problem resolution.
 
Make sure other vendors in this space are actually capable of testing and monitoring
  • Exchange mail flow
  • Mail queuing
  • Mail hygiene services
  • Microsoft Office 365 EOP
They just blog about the outage from Microsoft’s portal and service health messages. And don’t show how they actually captured the error and outage. Call them out and ask them to show you the evidence.
 
Exoprise always shows how it captures errors before Microsoft reporting the problem.

Better yet, start a free 15 day trial today, we’ve got your SaaS covered.

Integrate Exoprise Solutions to Maximize Your Systems Management Investments

Take back control, get back visibility

Read this short whitepaper to understand how you can maximize your existing IT investments in this day of cloud-based services by integrating Exoprise solutions with your existing ServiceNow ITSM workflows.

Team Exoprise

Team Exoprise represents multiple people in the engineering, sales and marketing department here at Exoprise. It takes a village.

Back To Top