| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by duijf 2492 days ago

Re availability: We had a hard time keeping the system based on Spark available. There were days when the cluster would freak out multiple times in a single day. The 'fix' would be: restart a bunch of spark workers. We spent a lot of time debugging/finding this out (some parts documented in [1]) but couldn't work out what the problem was. (EDIT: Assuming there even was a single problem.)

In this particular case, I'd take the single point of failure over the previous situation.

That being said: we have successfully used PostgreSQL's fail-overs multiple times. In my experience, they work quite alright.

[1]: https://tech.channable.com/posts/2018-04-10-debugging-a-long...

1 comments

Darkstryder 2492 days ago

Yeah, I agree. It was more of a general comment, because you seem to have one Postgres instance for every client, which is already a big step against SPOF.

At $previous_job we had a "one service" = "one MySQL instance" policy. Every time a MySQL server would go down all clients would all lose access to that service at the same time. It was stressful and much less robust than your setup.

link