Are we putting too many eggs in a digital baskets?

On Monday 20th October, AWS US-EAST-1 suffered from issues which had a significant impact on a lot of online services.

AWS stands for Amazon Web Services, which is a cloud hosting platform owned and operated by Amazon, Jeff Bezos's massive online business empire. AWS provides cloud compute services to anyone that wants to buy them, they are used to host and operate various businesses which is why so many business websites, portals and online services suffered disruption when AWS US-EAST-1 suffered an outage.

This is a worry as these services are sold to businesses as having multilayer resilience, meaning they should not have any single points of failure, if a component should fail, it shouldn't knock out the entire service.

What happened with AWS and why did it have such a wide impact?

Monday's outage impacted more than a thousand businesses and millions of service users around the world and the most frustrating thing of all, it was down to something simple that should NOT have happened, it was a DNS (domain name service) error. This is not the first time this has happened and it won't be the last either.

DNS (domain name service) is how computers work out how to talk to each other and other services online. It allows a human to enter in something easy to understand like a website address (like our website www.thesilvercloudbusiness.com) and the DNS server tells your computer how to find the website or service by translating the name into an IP address so that the computer can then work out the route across the internet to connect to the site.

DNS service issues are usually down to one of two things, either a service failure, or human error. Both of which can cause catastrophic consequences and failures if this happens and unfortunately it is difficult to mitigate against these issues.

Now you may ask, and its a VERY sensible question:

"If it is something as simple as the DNS address being wrong, why can't they fix it quickly, why was the outage so long?"

Well, it is not that simple, we are still waiting to find out exactly what happened with AWS, but when Microsoft suffered from a DNS misconfiguration in one of their data centres in January 2023, and someone accidentally entered the wrong address in a DNS record on the server, it meant they had locked themselves out of the system because when they tried to reconnect to it, it would point them to the wrong address!

The other issue is that they were fighting DNS and how it works. DNS replicates itself to other DNS servers, so the incorrect address, once in the system, replicated with other servers, so not only were they locked out of their own system because it was reporting the wrong address, this information replicated around the world, impacting everyone trying to connect.

To fix the issue, they needed to work out what the issue was (remember they couldn't get into the system due to the wrong address making diagnosis harder), then they had to get someone who could physically access the DNS server to rectify the issue, then they had to wait whilst the updates replicated around the world to all the other DNS servers before things started to return to normal.

The above example hopefully explains why it takes a while for a simple issue to get resolved.

The next issue we are faced with is that unfortunately there are only three global cloud service providers:

Amazon Web Services
Microsoft Azure
Google Cloud

All of which are US companies, all of which have experienced significant service outages that have impacted millions and what this outage has highlighted once again is there are a lot of digital eggs in one of three baskets, making it harder and harder to mitigate incidents of these kinds.

So what can be done?

For larger organisations that went offline because of the AWS outage, it is unforgivable really. Not every service provider or company using AWS went down because they factored in their own resilience by spreading their services across multiple sites. If businesses had decided to build their own resilience into their service by using multiple sites then they would not have gone offline. Whilst it increases costs, it increases resilience, providing a better service to customers.

For businesses and organisations that consume cloud based services provided by others, such as cloud accounting services, or cloud based HR or stock order processing, the only real choice is to try to mitigate this risk as much as possible, and make sure that not all of your critical services are provided by businesses that are all working from the same cloud hosting provider with single site exposure.

For example, a lot of businesses couldn't access their email after the Microsoft outage of January 2023. However businesses that were using a mail security service such as Barracuda could still see inbound email sitting in Barracuda's service, waiting to be delivered to Microsoft, because their service was not using Microsoft Azure cloud services, so these businesses could still see inbound messages and react to them before the issue was fixed by Microsoft.

Another way to mitigate prolonged online service outages is to utilise data backup that is hosted away from your primary cloud provider and recover your data to a temporary work store so that your business or organisation can keep working.

It is easy enough to ask service providers if they use a cloud provider so you can map out where you data resides, and asking them about their service resilience and make sure it is spread across multiple regions, ensuring again that you are not putting all your digital eggs in one basket and you can access a copy of your data by restoring it elsewhere, should you need to if the outage is prolonged.

The long and the short of it is that we live in an online world, but this world is fragile, so it pays to ensure the services you subscribe to have geographic resilience and to spread the services you consume across different providers where possible.

If you would like help identifying where your cloud bases services are located, call us on 01722 411 999 and we can help you navigate the service layers making up your IT and work our where they reside and how resilient the services are.

Are we putting too many eggs in a digital baskets?

Publish Date: Oct 22, 2025

Tags