A Digital Experience Monitoring (DEM) strategy unlocks the key to understanding how end-users interact with…
Early Detection of Exchange Online Mail Delivery Outage
Exchange Online down?
Exoprise CloudReady provides early detection of mission-critical mail outages. On February 3rd, Microsoft had a mail delivery delay. That led to mail delivery failures and an outage.
CloudReady detected the Exchange Online outage almost 2 hours in advance. Finally, Microsoft did publish the incident:
Users are experiencing delays of approximately five to ten minutes when receiving external email messages
EX237654, Exchange Online, Last updated: February 3, 2021 5:29 PM
Start time: February 3, 2021 5:28 PM
Users may be experiencing delays of approximately five to ten minutes when receiving external email messages.
Latest message View history
Title: Users are experiencing delays of approximately five to ten minutes when receiving external email messages
User Impact: Users may be experiencing delays of approximately five to ten minutes when receiving email messages.
Current status: We’re reviewing email transport logs to determine the underlying cause of the issue.
Scope of impact: This issue could potentially affect any of your users if they route through the affected system.
Next update by: Wednesday, February 3, 2021, 7:30 PM (2/4/2021, 12:30 AM UTC)
This incident fails to mention the long delays on the mail transport receivers. As a result, mails drop, bounce back to the sender once the queues fill up, and reject happens. Our own sensors won’t wait that long before we notify our customers of the outage. You can read more about it on our Exchange mail flow page.
Here’s an example notice received nearly 2 hours in advance of the outage. At the start of the incident, email outages were occurring every 15 minutes or so.
But with the outage, transport queues starting to grow. And more missed mail delivery occurred. This is the case with mail outages; queues grow and problems multiply.
Here’s an example of a dashboard covering a good-sized email monitoring deployment. Distributing the email sensors helps customers in determining global outages during network problems.
Updates to EX237654
More info: Some users are receiving a 4.3.2. The error indicating that the message was not delivered. This is expected while we work to retry queued messages.
Current status: We’re in the process of re-routing traffic to alternate systems. This will improve service health and provide user relief. In parallel, we’re continuing to review recent changes.
Now, in the incident report, Microsoft is starting to indicate that messages can be dropped with 4.3.2 error. Message transports will eventually give up and return sender error.
More info: Some users may receive the error “4.3.2 Temporary server error. Please try again later”. Subsequent attempts may continue to be unsuccessful.
Current status: Re-route traffic to alternate systems. Signs of recovery are visible.
Scope of impact: This issue could potentially affect any of your users if they are routed through the affected systems.
Start time: Wednesday, February 3, 2021, at 9:20 PM UTC
Next update by: Thursday, February 4, 2021, at 2:30 AM UTC
Post Incident Report
On February 7th, Microsoft published a post-incident report. More information can be found in the report here.
Scope of Impact
This issue could have potentially affected any of your users if they were routed through the affected system. Not all email for an impacted user would have likely been impacted (though it is possible.) In general, approximately 10%-15% of email within a given tenant would have experienced excessive delay. However, some tenants that route all their mail through their on-premises environment or another service provider could have higher impact.
Inadvertent restarts of front-end components triggered a clearing of the cache, which increased query rates into the anti-spam database components. This prevented caches from filling up with the required data. The result was email delays and delivery failures.
Start a Free 15 Day Trial for Early Detection of Cloud SaaS / Outages
Own Exoprise CloudReady and know about the outage hours in advance. Besides, you can communicate to users waiting on that business-critical email. More importantly, get to know exactly when the mail delivery problem resolution.
Make sure other vendors in this space are actually capable of testing and monitoring
- Exchange mail flow
- Mail queuing
- Mail hygiene services
- Microsoft Office 365 EOP
They just blog about the outage from Microsoft’s portal and service health messages. And don’t show how they actually captured the error and outage. Call them out and ask them to show you the evidence.
Exoprise always shows how it captures errors before Microsoft reporting the problem.