Are we putting too many eggs in digital baskets?

On Monday 20th October, AWS US-EAST-1 suffered from issues which had a significant impact on a lot of online services.
AWS stands for Amazon Web Services, which is a cloud hosting platform owned and operated by Amazon, Jeff Bezos's massive online business empire. AWS provides cloud compute services to anyone that wants to buy them, they are used to host and operate various businesses which is why so many business websites, portals and online services suffered disruption when AWS US-EAST-1 suffered an outage.
This is a worry as these services are sold to businesses as having multilayer resilience, meaning they should not have any single points of failure, if a component should fail, it shouldn't knock out the entire service.
What happened with AWS and why did it have such a wide impact?
Monday's outage impacted more than a thousand businesses and millions of service users around the world and the most frustrating thing of all, it was down to something simple that should NOT have happened, it was a DNS (domain name service) error. This is not the first time this has happened and it won't be the last either.
DNS (domain name service) is how computers work out how to talk to each other and other services online. It allows a human to enter in something easy to understand like a website address (like our website www.thesilvercloudbusiness.com) and the DNS server tells your computer how to find the website or service by translating the name into an IP address so that the computer can then work out the route across the internet to connect to the site.
DNS service issues are usually down to one of two things, either a service failure, or human error. Both of which can cause catastrophic consequences and failures if this happens and unfortunately it is difficult to mitigate against these issues.
Now you may ask, and its a VERY sensible question:
"If it is something as simple as the DNS address being wrong, why can't they fix it quickly, why was the outage for so long?"
Well, it is not that simple, we are still waiting to find out exactly what happened with AWS, but when Microsoft suffered from a DNS misconfiguration in one of their data centres, and someone accidentally entered the wrong address in a DNS record on the server, it meant they had locked themselves out of the system because when they tried to reconnect to it, it would point them to the wrong address!
The other issue is they were fighting DNS and how it works. DNS replicates itself to other DNS servers, so the incorrect address, once in the system, replicated with other servers, so not only where they locked out of their own system because it was reporting the wrong address, this information replicated around the world, impacting everyone trying to connect.
To fix the issue, they needed to work out what the issue was (remember they couldn't get into the system due to the wrong address making diagnosis harder), then they had to get someone who could physically access the DNS server to rectify the issue, then they had to wait whilst the updates replicated around the world to all the other DNS servers before things started to return to normal.
The above example hopefully explains why it takes a while for a simple issue to get resolved.
The next issue we are faced with is that unfortunately there are only three global cloud service providers:
- Amazon Web Services
- Microsoft Azure
- Google Cloud
All of which are US companies, all of which have experienced significant service outages that have impacted millions and what this outage has highlighted once again is there are a lot of digital eggs in one of three baskets, making it harder and harder to mitigate incidents of these kinds.
So what can be done?
The only real choice consumers and smaller businesses have is to try to mitigate this risk as much as possible, and make sure that not all of your critical services are provided by businesses that are all working from the same cloud hosting provider.
For example, when there was an outage with Microsoft 365 a couple of years ago (again down to a misconfigured DNS issue), a lot of businesses couldn't access their email, however businesses that were using a mail security service such as Barracuda could still see inbound email sitting in Barracuda's service, waiting to be delivered to Microsoft, because their service was not using Microsoft Azure cloud services, so these businesses could still see inbound messages and react to them before the issue was fixed by Microsoft.
Another way to mitigate service outages is to use an online backup that is hosted away from your primary cloud provider. It is easy enough to ask service providers if they use a cloud provider so you can map out where you data resides, ensuring again that you are not putting all your digital eggs in one basket and you can access a copy of your data by restoring it elsewhere, should you need to if the outage is prolonged.
The long and the short of it is that we live in an online world, but this world is fragile, so it pays to spread the services you consume across different providers where possible. If you would like help identifying where you services are located and provided by, right through to the back end cloud service provider, call us on 01722 411 999 and we can help you navigate the service layers making up your IT.