skip to Main Content

Early Detection of Exchange Online Mail Delivery Outage

Exoprise CloudReady provides early detection of mission critical mail outages. On February 3rd, Microsoft had a mail delivery delay, that caused mail delivery failures and an outage. While CloudReady detected the Exchange Online mail delivery error almost 2 hours in advance, Microsoft did finally publish an incident to track the outage:

Users are experiencing delays of approximately five to ten minutes when receiving external email messages

EX237654, Exchange Online, Last updated: February 3, 2021 5:29 PM
Start time: February 3, 2021 5:28 PM

Status
Service degradation

User impact
Users may be experiencing delays of approximately five to ten minutes when receiving external email messages.

Latest message View history
Title: Users are experiencing delays of approximately five to ten minutes when receiving external email messages

User Impact: Users may be experiencing delays of approximately five to ten minutes when receiving external email messages.

Current status: We’re reviewing email transport logs to determine the underlying cause of the issue.

Scope of impact: This issue could potentially affect any of your users if they are routed through the affected infrastructure.

Next update by: Wednesday, February 3, 2021, 7:30 PM (2/4/2021, 12:30 AM UTC)

Dropped Emails

What this incident fails to originally mention to customers and administrators is that when there are such long delays like this on the mail transport receivers, mails can be dropped and bounce back to the sender once the queues fill up and are rejected. Our own sensors won’t wait that long before we notify our customers of the outage. You can read more about on our Exchange mail flow page.

Early Notice

Here’s an example notice received nearly 2 hours in advance of the outage when Microsoft’s message transport started to fail. At the start of the incident, email outages were sporadic, occurring every 15 minutes or so. But as the outage persisted, and transport queues started to grow, more missed mail delivery occurred. This is often the case with mail delivery outages; queues grow and the problem compounds itself.

Early notice of mail delivery transport problems, February 3rd, 2021
Early notice of mail delivery transport problems, February 3rd, 2021

Early Detection

Here’s an example of a dashboard covering a good size email monitoring deployment. Distributing the email sensors help assist customers in determining global outages as well as when localized network problems occur.

Early detection of exchange online mail delivery outage

Updates to EX237654

More info: Some users are receiving a 4.3.2 error indicating that the message was not delivered. This is expected while we work to retry queued messages.

Current status: We’re in the process of re-routing traffic from the affected database infrastructure to alternate systems to improve service health and provide user relief. In parallel, we’re continuing to review recent changes and database telemetry in order to isolate and address the source of the issue.

Now, in the incident report, Microsoft is starting to indicate that messages can be dropped with 4.3.2 error. As detailed above, message transports will eventually give up and return an error to the sender when mails are undelivered.

More Updates

More info: Some users may receive the error “4.3.2 Temporary server error. Please try again later”, indicating that the message was not received. Subsequent attempts may continue to be unsuccessful. This is expected while we work to retry queued messages.

Additionally, third-party email filtering and management services may be experiencing more significant delays or failures as external senders retry to send us messages they have queued.

Current status: We’re continuing to manually re-route traffic from the affected database infrastructure to alternate systems and are seeing signs of recovery. Additionally, we’re working determine the source of degradation.

Scope of impact: This issue could potentially affect any of your users if they are routed through the affected infrastructure.

Start time: Wednesday, February 3, 2021, at 9:20 PM UTC

Next update by: Thursday, February 4, 2021, at 2:30 AM UTC

Post Incident Report

On February 7th, Microsoft published a post-incident report that explained what happened. More information can be found in the report here.

Scope of Impact

This issue could have potentially affected any of your users if they were routed through the affected infrastructure. Not all email for an impacted user would have likely been impacted (though it is possible.) In general, approximately 10%-15% of email within a given tenant would have experienced excessive latency. However, some tenants that route all their mail through their on-premises environment or another service provider could have had a much higher impact, depending on the retry settings for the sending system.

Root Cause

Inadvertent restarts of front-end components triggered a clearing of the cache, which increased query rates into the anti-spam database components. This prevented caches from filling up with the required data, resulting in email delays and delivery failures.

Start a Free 15 Day Trial for Early Detection of Cloud SaaS / Outages

If you had Exoprise CloudReady earlier today, you’d have known about the outage hours in advance, communicated it to your users who might be waiting on that business-critical email. Armed with this early information users could have talked to customers via other methods. And that would have had a high return on investment. More importantly, you would have known exactly when the mail delivery problem is resolved.

If you’re examining other vendors in this space, make sure they’re actually capable of testing and monitoring Exchange mail flow, mail queuing, mail hygiene services and Microsoft Office 365 EOP. If they’re just blogging about the outage from Microsoft’s portal and service health messages and not showing you how they actually captured the error and outage then call them out and ask them to show you the evidence. Exoprise always shows how it captures the errors in advance of Microsoft reporting the problem.

Better yet, start a free 15 day trial today, we’ve got your SaaS covered.

Team Exoprise

Team Exoprise

Team Exoprise represents multiple people in the engineering, sales and marketing department here at Exoprise. It takes a village.

Back To Top