Microsoft is making it both convenient and cost-effective to use the SaaS cloud-based Office 365 and Azure cloud services rather than a locally hosted Office, SharePoint and their respective on-premises services. However, there can be outages due to a variety of reasons. Network issues, both local and Internet-based, data center problems, malicious activity, and environmental conditions can all cause outages.
Fortunately, in most cases, outages are short-lived or localized to a specific geographic region. Often, performance problems can be transient or caused by local infrastructure or ISP’s. However, that’s not always the case. There were several incidents in 2018 that either affected thousands of users in multiple geographies, or lasted for several hours, or both.
Among the significant Office 365 incidents and outages of 2018 were:
April 2018 Email/Exchange Outage
In early April of 2018, the 6th of April to be precise, there was a widespread Email outage across most of the European data-centers that persisted over a day or two with many customers in the UK reporting that there were unable to access their mailboxes within Exchange. Customers reported:
Users attempting to access Office 365 email are reportedly being greeted with an ‘AADSTS90033’ error message, alongside the unhelpful warning: “Service is temporarily unavailable. Please retry later.”
The AADSTS90033 error message is typically displayed under normal circumstances when a user cannot get a token from Azure for the services they need to access.
Customers also reported the inability to sign into the Azure Management Portal as well as an inability to access the typically spry Azure Status Page. Many users in these regions were not able to log into their accounts at all.
All of the global, multi-national Exoprise customers were given advance notice of the Email outage as well as any access challengs and that it was mostly localized to their access from European regions as opposed to others. They were also able to know when the service was definitively restored as opposed to early indications of the Office 365 service communications status messages.
Azure Heat/Cooling Outage in Europe
While not directly an Office 365 related outage, Azure in Northern Europe suffered an 11-hour outage in and around June 19th and 20th of 2018 due to problems with the temperature control in their data-centers.
“A subset of customers using Virtual Machines, Storage, SQL Database, Key Vault, App Service, Site Recovery, Automation, Service Bus, Event Hubs, Data Factory, Backup, API management, Log Analytics, Application Insight, Azure Batch Azure Search, Redis Cache, Media Services, IoT Hub, Stream Analytics, Power BI, Azure Monitor, Azure Cosmo DB or Logic Apps in North Europe may experience connection failures when trying to access resources hosted in the region,” Microsoft said in its announcement.
After identifying the problem, Microsoft preemptively and proactively shutdown access to the subsystems to prevent further damage.
Lightning strike causes cooling damage
On September 4th, at about 4AM CDT, the Office 365 installation served out of the San Antonio data center experienced a lightning strike. This strike interrupted the data center cooling systems and brought down a number of Office 365 services, including Exchange, SharePoint, Microsoft Teams, and Azure Active Directory Services.
As a result, multiple Office 365 services became unavailable, both regionally and beyond. It took all morning to get most of the services back online. This had to be a difficult one for Microsoft to swallow, because it involved a single point of failure. It speaks to the need to consider resiliency as a core application requirement in the cloud.
Exoprise reported on this outage extensively shortly after it happened.
Unknown error affects users in the US and Europe. Microsoft restored service after an Office 365 outage that affected thousands of users in the U.S., the UK and Europe. On October 30th, numerous users reported email login issues. These issues involved repeated credential prompts and users being unable to log in using the Outlook client.
Intermittent issues continued through the week, with the last of them cleared up three days later.
Global Multi-factor Authentication Outages
Office 365, Azure users are locked out after a global multi-factor authentication outage. On November 19th, Microsoft’s multi-factor authentication services went down across the globe, preventing access to users who are required to sign in using a second layer of authentication to their account.
The official Office 365 status page confirmed that an authentication issue prevented users from accessing the service. Authentication services started to come back later in the day, but some users still had issues for the entire day.
As an increasing number of enterprises count on Office 365 to power their business, they have to have a better understanding of its availability and performance. Significant incidents and outages with Office 365 can drastically affect your business results. That wasn’t important when everyone had a local copy of Office, but when the entire enterprise is using Office 365 from the cloud, it is very important. Your enterprise can either be functioning normally, or down entirely.
Monitoring your Office 365 installation is a critical first step in getting the information you need on your enterprise applications in real time. You can’t effectively manage a vitally important part of your application infrastructure unless you know how it’s performing. And early insights on availability will help you prepare for outages.
Monitoring can give you an important first notification of an issue, and also tell you what services are being affected, rather than waiting hours to see a tweet from Microsoft. And you can be proactive, notifying your users and invoking contingency plans, rather than be inundated with user tickets.
Exoprise CloudReady provides monitoring tools and alarm notifications for many different parts of Office 365, including SharePoint, Exchange, and Skype. You can understand both your Office 365 availability and performance in real time.