| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nitinagg 1529 days ago
	Selectively restoring data only for certain rows is super hard. But the communications by Atlassian has been the worst I have ever seen in the industry.

10 comments

profmonocle 1529 days ago

I actually got an email from our Atlassian contact just the other day encouraging us to switch to their cloud service. Crazy that no one thought to pause those. (I assume it must have been scheduled.)

link

HeyLaughingBoy 1529 days ago

This article on HN is the only time I've even heard that Atlassian was having a problem. I suspect that 99% of the tech "community" has absolutely no idea this is happening.

We use Jira, but it's self-hosted for my team. Maybe other teams that have transitioned to the cloud version are aware that there's a problem, but I haven't heard about it.

link

mcintyre1994 1529 days ago

It’s only 400 teams affected, but from this article it sounds like they’re all really big ones.

link

LadyCailin 1529 days ago

Apparently the self hosted version goes out of support in 2024, so there will only be cloud hosting. Dumb dumb dumb.

link

chousuke 1529 days ago

If the database schema for Jira on the cloud is anything like the Datacenter version, I'm not surprised they're having a hard time restoring data. I once tried to figure out how to find duplicate / redundant project schemas by querying the database (the required APIs are cloud-only) and could not even find which tables stored half the data, never mind how they referred to each other.

link

duxup 1529 days ago

As this continues I suspect that this might be one of the few times where a lack of transparency / good communication really ... might not be better or worse because the situation is so bad that transparency would be horrible just the same.

Granted that's how all lies start / what sometimes people assume and they're wrong but ... maybe this is that time?

Maybe it is in fact so bad that honesty would be a push or worse?

link

adamc 1529 days ago

If so, that itself would be a huge red flag for dealing with Atlassian.

link

duxup 1529 days ago

I think it is…either way.

link

williamscales 1528 days ago

> Maybe it is in fact so bad that honesty would be a push or worse?

In my opinion, such a scenario does not exist. Transparency always in all things.

link

miketria 1529 days ago

Hi, this is Mike from Atlassian Engineering. You are right the communications from us have not lived up to our standard. We will focus on this specifically once we restore service and get the post incident review out there. More details here: https://www.atlassian.com/engineering/april-2022-outage-upda...

link

lallysingh 1529 days ago

Spamming HN isn't helping your cause man.

link

Mysterise 1529 days ago

There is irony in complaining about over-communication when it's in response to criticisms of under-communication.

link

lallysingh 1528 days ago

Key word "spamming." It wasn't communication but another dry and information-free blob of text. Communication requires something to say.

link

dhzhzjsbevs 1528 days ago

It's worse than that, they're saying communication was not up to their standards without actually communicating anything we didn't already know.

At least explain why there was such a total communication blackout company wide. Even support staff weren't allowed to discus it. Why?

link

2muchcoffeeman 1529 days ago

Well why are they writing a blog and posting the link on HN? We’re not directly your customers. Did you apologise individually to the customers you ignored? You don’t have to apologise to anyone here.

link

jacquesm 1529 days ago

It is, but between 'hard' and 'impossible' there is the nagging question of whether you actually really still have that data.

link

seanwilson 1529 days ago

> Selectively restoring data only for certain rows is super hard.

What's the right way to structure your data here that would make restoring more straightforward here? Is this backup/restore scenario niche or they should have designed for it?

link

inopinatus 1529 days ago

in theory, shard your customer databases 1:1, job done. alas, in practice, many SaaS compromise this two ways:

a) overwhelmed by creeping featuritis, each customer's data has relationships to global tables, and

b) they backup their entire database cluster in one snapshot

and there maybe other gotchas for restoration, like relying on denormalized views and caches that have to be rebuilt. they may also have erroneously assumed that data protection's main value driver is whole-of-system disaster recovery, which can lead to pathologies such as "we don't have a single-customer restoration tool".

this is not a niche scenario

link

bpicolo 1529 days ago

Heck, it's worse now - if your data deletion tooling did a good job, there are dozens or hundreds of microservice databases to restore.

link

seanwilson 1529 days ago

> shard your customer databases 1:1

What are the downsides to this?

link

inopinatus 1529 days ago

* makes it much harder to distribute your tables by any other factor, for whatever reason (usually performance, sometimes archival)

* disaggregates data that the SaaS might be interested in querying/updating as an aggregate

* not all ORM frameworks handle this case well, if at all

* dumps are more than a single trivial command

basically all your data operations gain an additional dimension of complexity, and you may not perceive the benefits until much later

link

deckard1 1529 days ago

> not all ORM frameworks handle this case well, if at all

typically this is probably for internal reporting/metrics. But yeah, a custom script with direct SQL is in order. Personally my opinion is avoid ORM at all costs. Never seen a benefit that wasn't trivially done in SQL, and the downsides are incredibly painful.

The big downside of sharding out, per customer, is that's a lot of databases to migrate on upgrades. Or rollback if shit hits the fan.

The upside? You can have customers on different versions of your app if you really wanted to do such a thing.

In any case, proper tooling goes a long way to making it the difference between wonderfully manageable and torturous nightmare. Think idempotent backup scripts that are capable of failing at any time and resuming where they died, etc.

link

darkwater 1529 days ago

All of your points (minus maybe the first one) should be "easily" solved/implemented in a company the size of Atlassian, and maybe there are newer costumers sharded like this already. IMO what happened in this case is basically tech debt that is now being paid with loooot of interests.

link

seanwilson 1529 days ago

Would it be fair to estimate that the majority of SaaS companies aren't sharding like this then? Seems like a lot of downsides that impact everything often except for backups, which you'd restore rarely.

link

mypalmike 1529 days ago

Per-customer is a common sharding strategy for noSQL databases, so it may not be entirely uncommon.

Migrations suck too.

Work out a relationship graph and automate the export/import

link

anarazel 1529 days ago

ISTM the fairly obvious approach would be to bring up a complete copy of the affected database(s) and move the affected tenants to that "copy", while eventually deleting non-affected tenants. Can't imagine they don't have the ability to move tenants to different shards, they got to need that to deal with quickly growing customers etc.

link

ollien 1529 days ago

As someone who has never had to perform this kind of recovery: why is it so hard?

link

jacquesm 1529 days ago

Because it is very difficult to maintain relational integrity during a restore like that.

link

ollien 1529 days ago

Gotcha. I guess you could be heavy-handed and disable foreign key checks, but who knows what other bugs that would bring into the mix.

link

teling2 1529 days ago

The other difficulty is if you don't restore the entire state in a single transaction. Imagine you have partial data restored in Table A but haven't updated Table B correspondingly. Now some other program that consumes Table A and Table B and doesn't have error handling will crash (or worse, mutate state in other weird ways).

link

jacquesm 1529 days ago

That is relational integrity.

link

raincom 1529 days ago

So, it must be a bad idea to shove the data of multiple customers in a single table controlled by some column name ('tenant').

link

tmpz22 1529 days ago

It’s super hard no doubt but I wonder how much of the data was hot vs cold.

link