Introduction Remote work has seen a resurgence due to the pandemic, and hybrid work is…
A 99.9% service provider SLA
doesn’t mean you are off the hook for SaaS availability
One of the biggest mistakes IT teams make as they move to the cloud is that they assume that the application provider now owns end-to-end service availability. When they see the 99.9% Service Level Agreement (SLA) guaranties offered by Microsoft and other cloud service providers they start believing that they are off the hook. The app provider is guaranteeing the availability, and if anything happens the app provider will be held to account. Right?
Not so fast
There’s a lot more to it than that. First let’s look at what that app provider SLA does and does not mean. 99.9% uptime corresponds to roughly 9 hours of downtime per year. That’s pretty good, and you may be thinking that is better than you were getting when you were running the app internally. But it’s important to realize that this guaranty is only committing to the availability service provider’s infrastructure. They make no promises about all the infrastructure between your users and them (a.k.a. the service delivery chain).
The actual availability you can expect at your user locations is a function of the availability of the various nodes in that service delivery chain connecting your users to the cloud. To keep things simple, we can look at it like this.
First, let’s estimate the availability of each of these nodes.
|Tier||Component||Estimated Availability||Annual Downtime Risk (hours)||Notes|
|App||Office 365, Salesforce.com, Box, etc. SLA||99.90%||8.8||Service level commitment offered by Microsoft and several other cloud service providers. For Microsoft, 99.9% is the level where 25% credit is issued. 100% credit only happens when availability drops below 95%.|
|Service Delivery Chain||Intent Backbone/IXPs||99.99%||0.9||The backbone internet fabric is highly redundant. True outages are rare and often short.|
|ISP||99.95%||4.4||Being a bit generous here. It’s not uncommon for ISPs to have outages. For example, Time Warner Cable, for example, had an extended outage in 2014.|
|ADFS/SSO||99.80%||17.5||Most enterprises use ADFS or a SSO solution like Okta. Let’s assume for this exercise that solution operates at a similar service level to the app itself.|
|99.80%||17.5||How reliable is your internet access? If you have as single fiber optic link to a single ISP you are likely to experience some amount of downtime.|
|DNS, Proxies, Firewalls, Filters etc.||99.80%||17.5||There are a number of various network services that your apps depend on. For simplicity I group these and assign an overall service availability estimate.|
|LAN||99.90%||8.8||How robust is your own network infrastructure? Did your network have more than 9 hours of downtime last year?|
|User||Real App Availability||99.14%||75.1||Total annual service downtime risk (hours)|
For serial systems like this, you calculate predicted availability by multiplying the availability of if each of the individual components:
As = A1 * A2 * A2 …
Using the availability numbers above for the components we would predict that the real availability of the app for your users would be 99.14%. That may sound good until you realize that it translates into a whopping 75 hours of downtime! Think about your environment. How highly available is your ISP or ADFS implementation? How many hours of downtime did they have last year? The edgeblog and Wikipedia both have excellent articles to help you estimate availability numbers for your environment.
Take control of SaaS availability for your users
Every year organizations lose hundreds of thousands of dollars due to application downtime and millions of dollars have been spent trying to build highly available datacenters. That’s been one of the draws of SaaS – the responsibility and cost of maintaining highly available datacenters moves to the provider. And yet, the service availability risks remain and IT teams are still on the hook to manage them.
The good news is that even though much of the service delivery chain is outside your direct control, you can do something that will have a very meaningful effect on this number – actively monitor as much of it as you can.
Early detection is the key to early resolution. You don’t want to wait until users start trying to access the apps in the morning to start remediation. You want to be constantly monitoring the cloud service, from your user points of access, 24x7x365, so you can find and fix problems before they impact users. CloudReady’s synthetic transaction monitoring lets you do this, testing service availability and performance round the clock at intervals ranging from 1 to 20 minutes
It doesn’t stop with early detection. A key challenge in managing SaaS availability is being able to identify the root cause quickly so you can focus your remediation efforts in the right place. Yes, you may have to contact a vendor support group, but pick the wrong one and you just added hours to the time to resolution. That’s where CloudReady’s crowd data aggregation and detailed network path diagnostics help you pinpoint problems inside or outside your network. Plus, with CloudReady’s data sharing capabilities, you can easily give your cloud and ISP vendors problem details as well as a view of the service delivery chain from inside your network.
So don’t assume your cloud provider SLAs have you covered. Manage your end-to-end service levels with CloudReady.