Understanding different failure modes and consistency guarantees
In distributed systems, things go wrong. Machines crash. Networks fail. Messages get lost. To design resilient systems โ and to do well in interviews โ we need to understand two things:
This lesson is about failure models and consistency models โ the foundation for reasoning about trade-offs in real-world architecture.
A crash failure is when a node โ like a server or a process โ suddenly stops working. It doesn't lie or misbehave. It just goes silent.
Imagine a cashier who faints mid-shift. They don't cheat or confuse customers โ they just stop responding.
A server loses power. A process crashes due to a bug. These are crash failures.
We can detect crash failures using timeouts and health checks. We recover using failover, retries, or replication.
A Byzantine failure happens when a computer (or node) in a system behaves in an unpredictable or dishonest way.
It's not just "crashing" โ it might:
In short: A Byzantine node is one you can't trust โ it might lie, cheat, or act crazy.
Imagine 4 friends trying to agree on where to meet for dinner by text message.
One friend (say Alex) is acting strangely:
Now, Beth, Chris, and Dana all have different information. If they don't have a way to compare messages and agree, they'll end up in different places โ no consensus!
To handle this, systems use Byzantine Fault Tolerant protocols โ special rules that let honest nodes still agree even if some are lying.
Example: PBFT (Practical Byzantine Fault Tolerance)
To survive f bad nodes, the system needs at least 3f + 1 total nodes.
Blockchain systems
Blockchains (like Ethereum, Cosmos, Hyperledger, etc.) are built on BFT-like protocols. Some participants may be hacked or try to cheat. BFT ensures that the honest majority still agrees on the same ledger โ the same "truth."
For example: Tendermint, used in Cosmos, is a BFT protocol. Even if a few validators lie or go offline, the system can still agree on the next block safely.
| Failure Type | Behavior | Detectable? | Example |
|---|---|---|---|
| Crash | Node stops responding | โ Yes | Power loss, server crash |
| Byzantine | Node lies or sends wrong data | โ No | Malicious or buggy server |
When multiple replicas store or serve data, consistency rules define how updates appear across them.
Every read returns the latest write, no matter which replica you query.
๐ Like a shared Google Doc โ type a word, everyone sees it instantly.
Requires coordination โ slower, less scalable.
๐งฉ Used when correctness > speed (e.g., money transfers)
All replicas will eventually reflect the latest write โ but not immediately.
๐ฆ You update your address at one bank branch โ it takes a few hours for all branches to catch up.
Faster, more scalable โ but users may see stale data temporarily.
๐งฉ Used when availability > strict correctness (e.g., likes, feeds)
After you write something, you'll always see your own update, even if others don't yet.
๐ธ You post a photo on Instagram โ you see it immediately, even if your friends' feed is still catching up.
Improves user experience with minimal coordination.
๐งฉ Used for personalized views or user sessions
If one operation depends on another, the system preserves their order.
๐ฌ If you comment on a post, everyone should see the post before your comment.
Balances performance and correctness. Stronger than eventual, weaker than strong.
๐งฉ Used for user-facing interactive systems
| Model | Guarantee | Example Systems | Trade-Off |
|---|---|---|---|
| Strong | Always latest value | PostgreSQL, Spanner | Slower, coordination heavy |
| Eventual | Eventually same value | DynamoDB, DNS | Fast, but temporary inconsistency |
| Read-Your-Own-Writes | User always sees own changes | Instagram, dashboards | Personalized consistency |
| Causal | Preserves cause-effect order | Chat, collaborative apps | Good balance of speed & order |
Understanding these models gives you the vocabulary to reason about distributed systems and make informed architectural decisions in interviews and real-world scenarios.