Hacker News new | ask | show | jobs
by wmsiler 2668 days ago
From the post, FaunaDB initially had several issues, which they've generally resolved. Jepsen is open source, so I'm curious why a database company wouldn't run Jepsen internally, work out as many problems as they can, and then engage aphyr in order to get the official thumbs up. Given how important data integrity is, I would assume that any database company would be running Jepsen (or something equivalent) regularly in-house. If they are doing that, then how is it that aphyr finds so many previously unknown issues? And if they aren't running Jepsen in-house, why not?
1 comments

This is a very good question, and to a substantial degree, this is what we did. We have internal QA systems that overlap Jepsen that catch most issues. We also ran our own Jepsen tests on the core product properties last year, fixed some issues and identified others, and reported the results on our blog.

However, correctness testing is fundamentally adversarial, like security penetration testing. Building a database is not easy, and testing a database is not easy either. It is a separate skill set, as anomalies that lingered for decades in other databases reveal. The engagement with the Jepsen team is explicitly designed to explore the entire product surface area for faults, not to apply Jepsen as it currently stands. Thus, a lot of custom work ensued on both sides to make sure that the database was both properly testable, and properly tested. The result of that work is what you see in the report.

The typical Jepsen report implicates not just implementation bugs, but the entire architecture of the system itself. Jepsen usually identifies anomalies that cannot be prevented even with a perfect implementation, which didn't happen here.

Some vendors restrict their engagement with the Jepsen team to only what they have tested themselves already, although those tests are not always valid. This was not our mindset—we wanted to improve our database by taking advantage of Kyle’s expertise, not present a superficially perfect report that failed to actually exercise the potential faults of the system.

To follow up on this a little bit--many of my clients do their own Jepsen testing, or have analogous tests using their own testing toolkit. When they engage me, the early part of my work is reviewing their existing tests, looking for problems, and then expanding and elaborating on those tests to find new issues in the DB.

Companies are finding bugs using Jepsen internally, which is great! But when they hire me, I'm usually able to find new behaviors. Some of that is exploring the concurrency and scheduling state space, some of it is reviewing code and looking for places where tests would fail to identify anomalies, some of it is designing new workloads or failure modes, and some is reading the histories and graphs, and using intuition to guide my search. I've been at this for five years now (gosh) and built up a good deal of expertise, and coming at a system with an outsider's perspective, and a different "box of tools", helps me explore the system in a different way.

I do work with my clients to determine what they'd like to focus on, and how much time I can give, but by and large, my clients let me guide the process, and I think the Jepsen analyses I've published are reasonably independent. If there's something I think would be useful to test, and we don't have time or the client isn't interested in exploring it, I note it in the future work section of the writeup.

It's not like clients are saying "please stick ONLY to these tests, we want a positive result." One of the things I love about my job is how much the vendors I work with care about fixing bugs and doing right by their users, and I love that I get to help them with that process. :)

Thanks for the response. That all makes sense. I assume the FaunaDB devs would have already tested and fixed all the scenarios they could come up with, so it's reasonable you'd want an outside party to come up with even more scenarios to examine.