Seeing all these horror stories, I can't help but wonder when people are going to switch back to on prem, or at least some hybrid. It's pretty apparent we can't trust cloud providers...
I'm not so sure. Aren't we having a bit of negativity bias here, as we mostly hear about the negative events while there are probably countless successful integrations with cloud providers.
Look at all the banks shifting their infra to the cloud. Haven't heard anything bad happen from that. And their uptime and business requirements are as stringent as anyone else, even further scrutinized by regulators and auditors.
Yes, if you'd compare all horror stories of people having huge downtimes because their on-prem system collapsed vs. people having an issue with their cloud provider then we'd very quickly realise again why the cloud has taken us by storm.
The pendulum will swing over cost, not over outages like these, because the cost challenges are real, while these outages are rare.
It won't go all back to on prem because cloud providers will slash their margins once they struggle to maintain growth, and doing so they'll diminish the advantages of on-prem/hybrid setups for a lot of people, and so we'll find some equilibrium or other.
I'd expect more emphasis on hybrid between multiple clouds rather than cloud vs. on prem to happen first. E.g. layers to threat more parts of cloud services as commodities.
Full on prem is a pretty bad for a large company because your company now depends on a single monopolistic service provider that is run as a cost center (ie: the department managing your on prem stuff). Eventually that leads to a very bad engineer experience which leads to a massive opportunity cost for any tech company. It works for a while and then over time becomes more and more dysfunctional. Hybrid at least includes competition with cloud providers which makes things less dysfunctional.
Most places de facto depend on a single provider that becomes a cost center when using cloud providers too. While you can shop around, in effect it tends to get entrenched and get more and more dysfunctional as people try to exploit characteristics of your approvals process for using resources from your chosen provider. For a big company, you need a team to manage these resources - I've lost count of the number of recruiters who has contacted me just this year for roles involving management or architecture of cloud strategy etc. because they're building whole supporting organisations around a specific paradigm for hosting rather than building an organisation responsible for understanding and providing the best possible substrate for their applications.
I'm all for hybrid solutions, but mostly because it means you usually end up being able to cut the cost of on prem, colocation or managed servers even more relative to cloud setups because you can increase the utilisation rate (e.g. run your base load on cheap Hetzner servers but being able to spin up EC2 instances on short notice to take spices or handle failures up to and including all of Hetzner falling off the face of the earth). Most larger managed providers today also offers cloud services and many also offers colo services, so you can often easily mix and match as long as you're conscious off egress pricing which is often the biggest barrier to such setups today (especially if the big cloud providers are in the mix, though it's better than it was).
That said, I prefer not putting my eggs in one basket, and instead going the direction of putting in place layers to treat a multi-provider setup as one. In the past I've e.g. done zero downtime migrations between AWS->GCP->Hetzner that way, and also had systems split between actual on prem (as in a data room at our office), multiple colo's, dedicated servers at Hetzner, and a couple of VM providers where everything was transparent to our engineers (they didn't need to know on what continent a service ran unless doing performance optimisation or reliability engineering). We prepped to tie in AWS resources too, but it never become cost effective for that company to do so vs. the prices of the other providers used - it would have been if we had sudden extreme load spikes, but it was a business where traffic was linked closely to physical capacity at restaurants and nightclubs, and so traffic was very predictable.
Which is funny because reduced cost was one of the major selling points of "cloud" way back when.
Most companies aren't actually at the scale where they need cloud. Cloud comes with the benefit of minimal to no need for ops/devops, high uptime (when nothing goes wrong lol), integrated DDoS protections, APIs your junior devs can string together, and the illusion of infinite scaling.
In reality, it turns out you do need someone to effectively do ops, you don't have 99.999% uptime, your don't need most or any of those APIs, and you didn't need that promise of scaling after all.
With DDoS protection, on-prem hosting or independent dedicated hosting might be more economical and practical.
On demand pricing is also a key benefit of cloud. Instead of needing to make a big initial spend on hosting infra you can pay only for what you need.
The other big benefit is auto scaling, in 2012 lots of companies would launch and be unable to meet traffic from a big push from reddit or HN or others. Today it’s rare to see a new product’s site go down due to a “hug of death.”
That's an argument for using cloud for highly spiky traffic, or for not doing actual on prem primarily.
If you do in-colo hosting with leased servers you've already cut the big initial spend drastically (e.g. deploying to a new colo for a past employer involved buying a pair of switches and some cables), though you're still usually tied in to a lengthy contract. Doing managed servers usually comes with little to no upfront cost (sometimes a low setup fee) and often month-by-month contracts.
With respect to the on demand pricing benefit vs. monthly, my experience is that the price differential is so steep vs. the cheaper managed hosting providers that it's only usually viable for loads that run less than ~6 hours a day. Most sites do not have a variable enough load to justify the engineering cost to use cloud to handle their normal day/night cycle that way - usually the variations are too small. Some do.
But note that this only pays for itself if your base load is not on a cloud setup, as the base load tends to dominate and the base load cost of most cloud setups is high enough not to be outweighed by the increased flexibility vs. a pure managed/colo/on prem setup.
It gets worse for cloud: If looking for the most cost effective, they're not competing against a pure managed/colo/on prem setup, but against a hybrid setup which can auto-scale into the cloud but usually won't. If you e.g. deploy containers, all your need is to tie cloud instances automatically into an overlay network and feed data from your monitoring system into a tool that adjust min/max instances in for an autoscaling group in AWS when load exceeds a threshold, for example.
For a typical "pure" on prem setup you'd usually aim for the daily peaks to rarely exceed 50% of on prem capacity. Maybe a bit more if you have a very predictable business, or even less if your traffic is very unpredictable.
But if you add capability to that setup to spin up and tie in cloud instances to handle peaks, and you can push that into 90%+, or even above 100% - basically whichever number turns out most cost effective. Usually if you optimise that for cost, this means you'll end up with a setup which almost never spins up cloud instances, but which is now vastly cheaper relative to a pure cloud setup because it's gained the benefit of auto-scaling at a far higher load factor than would be safe in a pure on prem/managed setup.
I'd guess that a large percentage of SMBs are running on-prem, so you don't hear about switching back because they avoided the cloud in the first place.
No need for on prem (rolling your own cloud just means it might be you that fucks up, or your ISP anyway). If you use standard legos like VMs, simple functions, containers, k8s, postgresql etc. then you can have a replica of your site on another provider. Scale it down and keep the data in cold storage. Flick the on switch for an hour a month to check it all works.
A number of orgs I've worked with in the last few years have gone from on prem to full cloud to hybrid, with plans to bring more stuff on-prem in the future.
People won't switch to on-prem because if Google and some other providers screw up, there still are many other cloud providers. For K8 they could have gone with anyone from Digitalocean to Linode. Actually, they could have already replicated their cluster to more than one provider to use as failover in such a case. (assuming that they were using K8)
It doesn't take most to have this problem. It takes a reasonable proportion to fear they could have this problem.
That said, I think cost is more likely to drive change here than fear over billing - most people pick lover cost over risk mitigation far more often than they'd like to admit.