Hacker News new | ask | show | jobs
by LAC-Tech 1027 days ago
I've been working hard to up skill on the consistency and distributed systems sides of things. General recommendations:

- Designing Data Intensive Applicatons. Great overview of... basically everything, and every chapter has dozens of references. Can't recommend it enough.

- Read papers. I've had lots of a-ha moments going to wikipedia and looking up the oldest paper on a topic (wtf was in the water in Massachusetts in the 70s..). Yes they're challenging, no they're not impossible if you have a compsci undergrad equivalent level of knowledge.

- Try and build toy systems. Built out some small and trivial implementations of CRDTs here https://lewiscampbell.tech/sync.html, mainly be reading the papers. They're subtle but they're not rocket science - mere mortals can do this if they apply themselves!

- Follow cool people in the field. Tigerbeetle stands out to me despite sitting at the opposite end of the consistency/availability corner where I've made my nest. They really are poring over applied dist sys papers and implementing it. I joke that Joran is a dangerous man to listen to because his talks can send you down rabbit-holes and you begin to think maybe he isn't insane for writing his own storage layer..

- Did I mention read papers? Seriously, the research of the smartest people on planet earth are on the internet, available for your consumption, for free. Take a moment to reflect in how incredible that is. Anyone anywhere on planet earth can git gud if they apply themselves.

6 comments

>- Did I mention read papers? Seriously, the research of the smartest people on planet earth are on the internet, available for your consumption, for free. Take a moment to reflect in how incredible that is. Anyone anywhere on planet earth can git gud if they apply themselves.

There is a flood of papers out there with unrepeatable processes. Where can you find quality papers to read?

This is a great point. It's true that there's a wealth of good information out there. But there's so much bad information that we now struggle with a signal vs noise problem. If you don't have enough context and knowledge yet to make the distinction, it's very easy to go down a wild goose chase. Having access to an expert in the field who can mentor and direct you is invaluable.
If a Com Sci paper is unrepeatable, all of this has been a freaking waste of everyones time.
Idempotency, testing, readability, and longevity are all too often ignored in CS and tech in the name of speed and simplicity.
Can you elaborate on idempotency in the context os CS papers?

I’m familiar with the concept wrt mathematics(in particular in the context of ulter-a-filters as my favorite professor would say it), but I don’t see the necessity in most CS research.

Idempotency (as I understand it, what maths people might call an 'idempotent function') is a very core idea in distributed systems - networks are unreliable, stuff may get lost, or you might not get an acknowledgment back, so the ability to send the same thing 1 0r a million times and end up with the same state is useful.

Or have I completely missed the point of your question..?

It is also super useful in ye olde batch process / etl process. Designing an ingest-analyze-report process to checkpoint its work and recover gracefully even when started at an unexpected time or place means you can retry safely rather than have to manually clear out the detritus of a partial job run.
I think autotune was saying that software often does shortcut hacks violating some of those principles as an optimization. This can be a topic for research into the tradeoffs (e.g., CAP theorem), but may be more common in non-research-based implementations (e.g., NoSQL databases because ACID is slow/constraining/etc.)
it is very rare that algorithms from theoretical comp sci papers actually get implemented and executed.
Can you explain what you mean by an un-repeatable process? This isn't a physical science, you don't need your own reactor or chemical lab or anything to repeat what they've done.
"But, you are not Google" - https://medium.com/humans-of-devops/you-dont-need-sre-what-y.... Just an example of what I'm talking about. Practices that work at one org may not easily transfer over to another, be that taking on kubernetes vs standard ec2 instances, on-prem vs cloud, etc. etc. Maybe un-repeatable isn't the best word so much as non-portable?
Ah I think I see what you're getting at.

Most of the stuff I read isn't "your SAAS app should have atomic clocks" and more "here's some maths on why it works, here's some explanation of what we were going for, here's some pseudocode".

Designing data intensive applications is frequently recommended, but is 6 years old now. I know that doesn’t sound like a lot, but the state of distributed compute is a lot different than it was 6 years ago. Do you feel like it holds up well?

A year of two ago I read Ralph Kimball’s seminal The Data Warehouse Toolkit. While I could see why it’s still often recommended, it was showing its age in many ways (though a fair bit older than DDIA). It felt like a mix of best practices and dated advice, but it was hard to tell for certain what was what.

A lot of what DDIA covers is pretty fundamental stuff. I expect it will age fairly well.

It's not really a book about 'best practices', despite the name. It's more like an encyclopaedia, covering every approach out there, putting them in context, linking to copious reference papers, and talking about their properties on a very conceptual and practical level. It's not really like 'hey use this database vendor!!!'.

Concur. In large parts, the book feels like it's making the forest of fundamental research papers, published across decades, accessible by putting them into context, ordering them and "dumbing them down" for us mere mortals.

Most of that research is decades old. I specifically remember Lamport timestamps. Not only has that held up, it's unlikely to go anywhere anytime soon. Most topics covered are as fundamental as the two generals problem; almost philosophical.

No database vendor can solve the issues around two concurrently existing write masters. Sync will be necessary, conflicts will occur. A concrete vendor could only hope to make that less painful (CRDTs for automatic conflict resolution, ...). That's kind of the level that book operates at.

DDIA partly goes into implementation details (for example, transaction implementations in popular databases), but the reasoning behind it is always rather fundamental. I think it's going to age well unless some breakthrough happens – but even then only that particular section would be affected.
> but is 6 years old now

Hmm. The techniques used today were invented in the 1970s-80s.

I was gifted Designing Data-Intensive Applications in high school and it changed the course of my CS education. Beautiful, beautiful book.
Thanks for the story, I gifted a copy to an intern at work this year and I hope they enjoy it.
Other than DDIA do you have a heuristic for finding the older papers? I often look for the seminal works but it’s hard when you don’t know who the key people are in a new field. Do you just look for highest number of citations on google scholar or something else?

For example in the world of engines, Heywood is basically god: (https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=John...) has 29k citations!

Yeah just go to wikipedia and find out who coined the term :) I love the original version vector paper, it's from 1983 and kind of funny in parts:

"Network partitioning can completely destroy mutual consistency in the worst case, and this fact has led to a certain amount of restrictiveness, vagueness, and even nervousness in past discussions, of how it may be handled"

https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2015/Papers...

But as a general starting point, all roads seem to lead to Lamport 78 (Time, Clocks). If you have a specific area of interest I or others might be able to point you in the right direction.

> - Follow cool people in the field.

Any advice how to approach this?

A few cool accounts I follow:

https://twitter.com/DominikTornow

https://twitter.com/jorandirkgreef

https://twitter.com/JungleSilicon

You can also follow me. Not saying I'm cool but I do re-tweet cool people:

https://twitter.com/LewisCTech

Any paper recommendations?