How to Monitor Office 365 Mail Queues and Troubleshoot Slow Email Delivery for Exchange and Exchange Online
Despite the recurring predictions for its death, businesses of all sizes still run on email. Yes, messaging and news-feeds like Yammer and Slack have a place in corporate collaboration but the asynchronous nature of email is what makes it adaptable and robust.This asynchronous delivery is critical to make email robust and scalable. It also makes it exceedingly difficult to continuously monitor. On top of this, most corporate email systems employ additional layers and filtering including external cloud-based SMTP Filters and spam protection. These extra layers add points of failures and latencies for both inbound and outbound mail flow. To add to an already complex picture – Office 365 has its own message queueing systems.
When everything was behind the firewall, it was easy enough to detect stuck mail queues and resolve the situation. Now with cloud-based systems like Office 365 and Google Apps, it’s nearly impossible to monitor what’s working and what isn’t — unless you are using CloudReady, of course.
How CloudReady Monitors Mail Queue Performance
CloudReady has invested considerable time and effort in developing our email sensors to solve just this problem. All of the CloudReady Email Sensors including Exchange Online, Hosted Exchange, Outlook Web App, IMAP, Gmail and On-Premise Email (MAPI) sensors send synthetic emails between the mail platform and our CloudReady auto-responders. We knew from the beginning that testing email delivery, reliability and performance was critical to our customers confidence in the cloud.
How Does It Work?
All of our CloudReady sensors exercise the entire email system by sending a small test message from the sensor to our auto-responders in the cloud. When the sensor is browser-based, Gmail or Outlook Web App (OWA) for example, the sensor will navigate through the product to generate a test message just like a real user would — opening dialogs, hitting compose buttons, etc. Exoprise emulates a real user utilizing the product end-to-end and monitoring the whole experience. For reliable and accurate monitoring, a simple API test won’t get the job done.
Once an email is generated, our sensors confirm that the message was sent by checking the appropriate folders and headers on the outbound/sent side while we start the wait for the reply. Additionally, for many of our sensors, we setup mailbox rules to hide our synthetic emails. We do this so that customers can re-use existing accounts for monitoring email systems and get maximum coverage across their mailbox servers or Database Availability Groups (DAGs).
Next, we wait for the reply from our auto-responders so that we can parse up the timestamps, the MIME and extended header data, and report on the success or failure of the transmission. Our sensors have an exponential back-off algorithm and will wait up to approximately 3 minutes for a reply. We figure if the systems and layers can’t get a small email out and back within 3 minutes then its going to be too slow to do anything with real content and we report a failure. We’ve tested these thresholds with our customers and our crowd-powered data (see below).
If CloudReady is to measure the timings of interactions and round-trip times of emails, a good source of synchronized time is required. Unfortunately, CloudReady can’t guarantee that a customer system has a reliable clock or that it is kept time synchronized, so a CloudReady installation has to periodically synchronize its own concept of time to our servers and use that time source for its measurements.
We use this time synchronization across many of our sensors so its well worth the effort for improved accuracy.
Exoprise leverages a node-based SMTP application, haraka, to process its messages as a well as series of custom plugins. You can read more about haraka here. Since its node-based its event driven, incredibly fast, and highly scable. Our auto-responders process 100’s of thousands of inbound and outbound messages each day, typically in < 10 milliseconds per each message.
Continuous measurements: Inbound, Outbound and Transport Times
CloudReady continuously captures a number of metrics related to Message Queue performance and we are expanding this data-set all the time:
- MTA Outbound Time
The time it takes to send an email to our CloudReady auto-responders. This metric measures outbound queue and transfer performance. MTA stands for message-transfer-agent and is responsible for the transmission of email between servers.
- MTA Transport Time
The time it takes for a reply to reach the inbound MTA servers. This metric measures inbound transport performance and is a subset of the MTA Inbound time as reported by processing servers.
- MTA Inbound Time
The total time it takes to reply to an email from our auto-responders to the inbox. This metric measures inbound queue performance.
How to Interpret the Results
If you have a large MTA Outbound Time it likely points to your outbound SMTP filters and/or the providers or on-premise Exchange mail queues that are having problems.
If you have a large MTA Inbound Time but no corresponding MTA Transport Time then you have a stuck inbound mail processing queue problem. If its an Exchange or email environment within your control then you should consider increasing the resources for your inbound queues.
If you have a large MTA Transport Time then you have a problem with your SPAM or AV processing. Typically, this is the culprit and the source of the slow mail replies. When you see large MTA Transport Times — we have seen SPAM processing services take upwards of 3 or 4 minutes to process a small 900 byte message — you can guarantee that you will have a problem with email delivery in the future and that larger messages with attachments will take even longer to process.
Crowd-powered Like the Rest of Our Statistics
Just like ALL of the metrics that Exoprise CloudReady collects, we crowd-source our MTA statistics so that customers can understand what is typical for inbound and outbound mail processing. Being able to do this across on-premise Exchange servers and Office 365 environments and being able to make comparisons is very helpful in evaluating a migration to different SMTP or SPAM services as well as continuously monitoring.
We have had customers come to us and say “we thought our email was slow, we just didn’t realize how bad it was until we could compare it to other systems”. That is the power of CloudReady crowd-powered monitoring.
Find out if you have a problem today
CloudReady is a simple and powerful way for customers to get insight into and monitor their critical email flow. Don’t wait until after an outage to know whether or not you received that crucial proposal, purchase order, or support request.