> it takes between 4 and 5 elapsed days to hand a site back to a customer. Atlas...

dxf · on April 13, 2022

>Are SLAs even real?

SLI: Some metric you use to measure a thing (e.g. uptime, latency, etc.)

SLO: Some objective you try to hit, as measured by the SLI (e.g. "99.99% of requests are processed within 3 seconds)

SLA: A promise to a customer that they will meet some SLO, and consequences if they don't. If there aren't consequences for not meeting the SLO, then measuring and tracking the metrics is a pointless exercise.

The SLA is "real" to the extent Atlassian is adhering to any listed consequences.

bombcar · on April 13, 2022

Most SLAs say "if we miss this, you get time for free" which means that these companies will hopefully get a refund ... for the time they can't use the service.

SLAs are mostly aspirational.

towelrod · on April 13, 2022

The linked article directly talks about this, at this level of downtime customers are promised a 50% discount. That's what the SLA means, effectively

hinkley · on April 13, 2022

Cars warranties are also aspirational/virtue signaling, to a point.

If the maintenance costs exceed the margins on the cars you lose money. Do that on too many product lines for too often and you’re looking at bankruptcy. But some makers clearly are more risk averse than others, so a 6 year warranty from maker X does not translate to a 7 year warranty from maker Y.

mh- · on April 13, 2022

But Atlassian's (published*) SLA offers a credit of at most 50% of the month.. not really the same as a manufacturer warranty on a car, which the costs of servicing could easily exceed the price paid for the car.

* - their larger customers will have negotiated SLAs.

edit: to be clear, I expect Atlassian will offer concessions beyond their SLA obligations. I'm only responding to the comparison.

profmonocle · on April 13, 2022

> and consequences if they don't.

And these consequences usually just amount to getting some percentage of your service fees back. I'm sure the affected customers will get their entire monthly Atlassian Cloud fees back. Since this is so severe maybe Atlassian will even give them credits for some # of free months.

But there's no way the amount they'll get from Atlassian is going to come close to what they're losing in productivity by not having access to Jira & Confluence. At my company, getting an entire free year of Jira wouldn't be worth Jira being inaccessible for a week.

bee_rider · on April 13, 2022

Does that indicate it would be preferable to pay more for a more reliable solution, if such a thing were to exist? Although, it definitely would be hard to quantify 'more reliable' there.

aunty_helen · on April 13, 2022

You could use it as a material breach of the contract and possibly get out of any arrangement you have with Atlassian.

inopinatus · on April 13, 2022

A typical SLA precludes that by specifying the remedy for noncompliance with the performance measure. Only if they fail to apply the remedy is there a material breach. For a month-to-month SLA, this limits liability to one month's subscription, as agreed in black-and-white.

Customers that demand service level agreements often fail to recognise that they cut both ways.

bluedino · on April 13, 2022

> Are SLAs even real?

Tommy: Here's the way I see it, Ted. Guy puts a fancy guarantee on a box 'cause he wants you to fell all warm and toasty inside.

Ted Nelson: Yeah, makes a man feel good.

Ted Nelson: But why do they put a guarantee on the box?

Tommy: Because they know all they sold ya was a guaranteed piece of shit. That's all it is, isn't it? Hey, if you want me to take a dump in a box and mark it guaranteed, I will.

rglover · on April 13, 2022

Haha I needed this, thank you.

0xbadcafebee · on April 13, 2022

The typical SLA has no teeth because even if the customer gets their money back, the real harm to the customer may be orders of magnitude greater than what they paid for the service. Some services are contractual or tightly embedded and you know you're not gonna lose the customer if your service goes down frequently. If the service provider doesn't lose money or face, they aren't motivated to prevent the downtime.

One alternative I thought of is the Charity SLA. The service provider pledges to give $5,000 to charity for every minute of downtime. Now everyone within the company knows "if we're down, we're losing thousands of dollars a minute!" and thus will be motivated to ensure the services stay up. But even if the services go down, the company's making tax-free donations, which isn't really bad for anybody. The company could even have a specific downtime goal every year, to make sure their monitoring/alerting/runbooks actually work, and to ensure they donate every year.

nh2 · on April 13, 2022

> it's just a threshold after which I think you're entitled to some money back for that month

That is exactly what SLAs are.

There are just a lot of people applying the wishful thinking that SLAs are a goal or metric of uptime.

Consider the AWS S3 page on the topic: https://aws.amazon.com/s3/sla/

"Reasonable efforts"; if not met, you get some fraction of the money back.

S3 has worse uptime than my desktop PC over the last years, but affected users got some fraction of their spending back.

iso1631 · on April 13, 2022

> S3 has worse uptime than my desktop PC over the last years

That's sacrilege on HN

miketria · on April 13, 2022

Hi, this is Mike from Atlassian Engineering. For the customers impacted by this incident covered by an SLA, we will adhere to our contractual terms. However, given the long duration of this outage, we are planning to go above and beyond for our impacted customers. We are currently focused on restoring service, but after that will be discussing how we can make it right for each impacted customer.

encryptluks2 · on April 13, 2022

It looks like you are focused on Hacker News comments.

bborud · on April 13, 2022

Think of SLAs as "this is how hard we'll scramble when shit hits the fan".

Except...I don't even believe that.

chrsig · on April 13, 2022

It's more "this is our contractual obligation, if we're down more than this, then we might not charge you"

dylan604 · on April 13, 2022

Lawyers are involved, so I'd assume some text about "excluding acts of god, sabotage,etc" to weasel their way out of things. They might even be able to get away with "acts of incompetence" how ever a lawyer might phrase that to allow their client to weasel.

TheCoelacanth · on April 13, 2022

SLA credits are a thing that actually happen in the industry. I wouldn't automatically assume that they will be able to weasel out of it.

They are typically limited to the amount that you actually paid, though, so basically they don't charge you for the time when you couldn't use the product. You usually won't get more than that.

mywittyname · on April 13, 2022

That's a good way to get executive approval to replace a system. Google or Apple can get away with this kind of behavior, I doubt Atlassian can.

This outage alone has spurred conversations in slack about how terrible JIRA is and why we should replace it. If this kind of shit was pulled, I can guarantee we'd be on shortcut, linear, or something else in short order.

MajorBee · on April 13, 2022

> Google or Apple can get away with this kind of behavior, I doubt Atlassian can

Atlassian absolutely can in enterprise settings. In my company (a large cloud company), if JIRA goes down, large swathes of the business will also stall, including code deployment (deployments are tracked through change management JIRA tickets). We also use the DC version of Atlassian products, so presumably we aren't be at the mercy of Atlassian cloud engineers.

echelon · on April 13, 2022

In some industries, three nines isn't exactly stellar. Every service I've worked on recently has demanded five nines of uptime and tons of reporting on latency and even seconds-long outages.

I've been on-call during a total infrastructure outage whose root cause was a service my team owned [1]. Our CEO was aware of it. Customers and business partners were aware of it. Other CEOs were aware of it. The media, you name it.

Some outages can be "business ending" or "business damaging". That's why we made a practice and process of performing regular disaster recovery exercises, had exceptionally well documented runbooks, had monitoring attached to everything, and engineered for resilience.

Though I'm not familiar with how Atlassian runs, I think this is an "engineering culture" thing or can be mitigated with a proper approach.

[1] The company has only had a few of these in total, and no member of our team was culpable for the complicated failure.

mmcgaha · on April 13, 2022

I think of SLAs as how do we design this thing. Ask for a system without an SLA and I will give you a system that is well designed and almost never goes down. As soon as you ask for an SLA, I will give you an over engineered system that costs more, takes longer to implement and is slower to iterate but it will almost never go down either.

krinchan · on April 13, 2022

Per the article, if you experience < 95% uptime in any 30 day window you qualify for a 50% discount. On a month or your next year or ... ? it doesn't say.

leeoniya · on April 13, 2022

> Atlassian's SLA page says, Premium Cloud Products 99.9%

> That's 43 minutes of downtime per month.

we need a better default way to communicate SLOs than "number of 9s", which are more human. how the status quo has stayed this way can only be attributed to intentional dark patterns, imho.

deathanatos · on April 13, 2022

… honestly, even the "number of 9s" concept is a struggle for some companies. I've seen a number of SLAs that fail to correctly state a unit: it's %/<unit of time>, and I see the "unit of time" get dropped every now and then, and the resulting thing is meaningless absurdity.

colechristensen · on April 13, 2022

SLAs aren't real unless there's a contractual consequence for not meeting them.

And a couple of percent discount on services for the extra downtime isn't really a meaningful consequence.

imglorp · on April 13, 2022

I was just thinking that there's a hysteresis function here: the service is worth much more to your team after you've wired your whole process into it than before you joined.

Offering you a free month or whatever doesn't acknowledge all the person-hours lost.

colechristensen · on April 13, 2022

There are certainly circumstances where you might have grounds to sue for damages if an SLA is breached. I'm not sure how often this happens but the losses from something like Jira being down could be quite a lot more than anybody pays for it. It's quite likely that defenses against exactly this are written into the contracts you agree to signing up for the service though.

mc4ndr3 · on April 13, 2022

I've yet to work at an office that paid sufficient attention to regular backup & restore validation, to scalable design, or proper unit testing, or to basic security updates. Upper management is repeatedly incentivized to produce vaporware, not reliable service.

Suits think a crummy Flash quiz on PII is enough to stop leaks. The automotive industry couldn't stop airbags from acting as claymores. It's even harder to get good code approved in tech.

closeparen · on April 14, 2022

"Shit happens" is a universal when it comes to computing. SLAs describe what is a normal background level of shit happening vs. what demands immediate attention and action from the team.

hinkley · on April 13, 2022

Basically not counting lost sales their income for this year went down 2%, which is not as big a deal to them as it is to their customers.