A Digital Experience Monitoring (DEM) strategy unlocks the key to understanding how end-users interact with…
How to Monitor Office 365 Mail Queues and Troubleshoot Slow Email Delivery for Exchange and Exchange Online
I am sure you’re having exchange mail queues issues, right?
Despite the recurring predictions for its death, businesses of all sizes still run on email. Yes, messaging and news feeds like Yammer and Slack have a place in corporate collaboration. But the asynchronous nature of email is what makes it adaptable and robust. This asynchronous delivery is critical to make email robust and scalable. It also makes it exceedingly difficult to continuously monitor.
On top of this, most corporate email systems employ additional layers and filtering including external cloud-based SMTP Filters and spam protection. These extra layers add points of failures and latencies for both inbound and outbound mail flow. To add to an already complex picture – Office 365 has its own message queueing systems.
But when everything was behind the firewall, it was easy enough to detect stuck mail queues and resolve the situation. Now with cloud-based systems like Office 365 and Google Apps, it’s nearly impossible to monitor what’s working and what isn’t — unless you are using CloudReady, of course.
Watch our latest video on monitoring Microsoft Exchange Online performance
How CloudReady Monitors Mail Queue Performance
CloudReady has invested considerable time and effort in developing our email sensors to solve just this problem. All of the CloudReady Email Sensors including Exchange Online, Hosted Exchange, Outlook Web App, IMAP, Gmail, and On-Premise Email (MAPI) sensors send synthetic emails between the mail platform and our auto-responders. We knew from the beginning that testing email delivery, reliability and performance was critical to our customers confidence in the cloud.
How Does It Work?
All of our CloudReady sensors exercise the entire email system by sending a small test message from the sensor to our cloud auto-responders. When the sensor is browser-based, Gmail or Outlook Web App (OWA) for example, the sensor will navigate through the product to generate a test message just like a real user would — opening dialogs, hitting compose buttons, etc. Exoprise emulates a real user utilizing the product end-to-end and monitoring the whole experience. For reliable and accurate monitoring, a simple API test won’t get the job done.
Once an email is generated, our sensors confirm that the message was sent by checking the appropriate folders and headers on the outbound/sent side. Until then, we start the wait for the reply. Additionally, for many of our sensors, we set up mailbox rules to hide our synthetic emails. We do this so that customers can re-use existing accounts for monitoring email systems. Also, they can get maximum coverage across their mailbox servers or Database Availability Groups (DAGs).
Next, we wait for the reply from our auto-responders so that we can parse up the timestamps, the MIME and extended header data, and report on the success or failure of the transmission. Our sensors have an exponential back-off algorithm and will wait up to approximately 3 minutes for a reply. We figure if the systems and layers can’t get a small email out and back within 3 minutes then its going to be too slow with real content and we report a failure. We’ve tested these thresholds with our customers and our crowd-powered data (see below).
If CloudReady is to measure the timings of interactions and round-trip times of emails, a good source of synchronized time is required. Unfortunately, CloudReady can’t guarantee that a customer system has a reliable clock or that it is kept time-synchronized. Therefore, a CloudReady installation has to periodically synchronize its own concept of time to our servers and use that time source for its measurements.
We use this time synchronization across many of our sensors so its well worth the effort for improved accuracy.
Exoprise leverages a node-based SMTP application, haraka, to process its messages as a well as series of custom plugins. You can read more about haraka here. Since its node-based its event driven, incredibly fast, and highly scable. Our auto-responders process 100’s of thousands of inbound and outbound messages each day, typically in < 10 milliseconds per each message.
Continuous measurements: Inbound, Outbound, and Transport Times
CloudReady continuously captures a number of metrics related to Message Queue performance and we are expanding this data-set all the time:
- MTA Outbound Time
The time it takes to send an email to our CloudReady auto-responders. This metric measures outbound queue and transfer performance. MTA stands for message-transfer-agent and is responsible for the transmission of the email between servers.
- MTA Transport Time
The time it takes for a reply to reach the inbound MTA servers. This metric measures inbound transport performance and is a subset of the MTA Inbound time as reported by processing servers.
- MTA Inbound Time
The total time it takes to reply to an email from our auto-responders to the inbox. This metric measures inbound queue performance.
How to Interpret the Results
Let’s look at each one of them
- MTA Outbound Time
A large MTA Outbound Time likely points to your outbound SMTP filters and/or the providers or on-premise Exchange mail queues that are having problems.
- MTA Inbound Time
When you have a large MTA Inbound Time but no corresponding MTA Transport Time then you have a stuck inbound mail processing queue problem. If its an Exchange or email environment within your control then you should consider increasing the resources for your inbound queues.
- MTA Transport Time
Ok, so you have a large MTA Transport Time then you have a problem with your SPAM or AV processing. Typically, this is the culprit and the source of the slow mail replies. When you see large MTA Transport Times — upwards of 3 or 4 minutes to process a small 900-byte message — you can guarantee that you will have a problem with email delivery in the future. Larger messages with attachments will take even longer to process.
Crowd-powered Like the Rest of Our Statistics
Just like ALL of the metrics that Exoprise CloudReady collects, we crowd-source our MTA statistics. That way our customers can understand what is typical for inbound and outbound mail processing. Being able to do this comparison across on-premise Exchange servers and Office 365 environments is very helpful in evaluating a migration to different SMTP or SPAM services.
We have had customers come to us and say “we thought our email was slow, we just didn’t realize how bad it was until we could compare it to other systems”. That is the power of CloudReady crowd-powered monitoring.
Find out if you have a Office 365 mail queue problem today
CloudReady is a simple and powerful way for customers to get insight into and monitor their critical email flow. Don’t wait until after an outage to know whether or not you received that crucial proposal, purchase order, or support request.