🎙️ Why This Matters

In distributed systems, things go wrong. Machines crash. Networks fail. Messages get lost. To design resilient systems — and to do well in interviews — we need to understand two things:

How systems fail

How they stay consistent despite those failures

This lesson is about failure models and consistency models — the foundation for reasoning about trade-offs in real-world architecture.

🧨 Part 1: Failure Models

✅ Crash Failures

A crash failure is when a node — like a server or a process — suddenly stops working. It doesn't lie or misbehave. It just goes silent.

Analogy:

Imagine a cashier who faints mid-shift. They don't cheat or confuse customers — they just stop responding.

Real Example:

A server loses power. A process crashes due to a bug. These are crash failures.

Design Implication:

We can detect crash failures using timeouts and health checks. We recover using failover, retries, or replication.

🧠 Byzantine Failures

A Byzantine failure happens when a computer (or node) in a system behaves in an unpredictable or dishonest way.

It's not just "crashing" — it might:

Send wrong data to some nodes

Send different data to different nodes

Or even pretend to be another node

In short: A Byzantine node is one you can't trust — it might lie, cheat, or act crazy.

💡 Simple Analogy

Imagine 4 friends trying to agree on where to meet for dinner by text message.

One friend (say Alex) is acting strangely:

•He tells Beth: "Let's meet at McDonald's."

•He tells Chris: "Let's meet at Subway."

•He tells Dana: "Let's not meet at all."

Now, Beth, Chris, and Dana all have different information. If they don't have a way to compare messages and agree, they'll end up in different places — no consensus!

🧩 The Solution: Byzantine Fault Tolerance (BFT)

To handle this, systems use Byzantine Fault Tolerant protocols — special rules that let honest nodes still agree even if some are lying.

Example: PBFT (Practical Byzantine Fault Tolerance)

To survive f bad nodes, the system needs at least 3f + 1 total nodes.

1 bad
→ need 4 total

2 bad
→ need 7 total

3 bad
→ need 10 total

💻 Real Example (Modern Context)

Blockchain systems

Blockchains (like Ethereum, Cosmos, Hyperledger, etc.) are built on BFT-like protocols. Some participants may be hacked or try to cheat. BFT ensures that the honest majority still agrees on the same ledger — the same "truth."

For example: Tendermint, used in Cosmos, is a BFT protocol. Even if a few validators lie or go offline, the system can still agree on the next block safely.

🔁 Crash vs. Byzantine Failures

Failure Type	Behavior	Detectable?	Example
Crash	Node stops responding	✅ Yes	Power loss, server crash
Byzantine	Node lies or sends wrong data	❌ No	Malicious or buggy server

🔗 Part 2: Consistency Models

When multiple replicas store or serve data, consistency rules define how updates appear across them.

✅ Strong Consistency

Definition:

Every read returns the latest write, no matter which replica you query.

Analogy:

📝 Like a shared Google Doc — type a word, everyone sees it instantly.

Real-World Example:

SQL databases (PostgreSQL, Spanner)

Distributed locking systems

Trade-Off:

Requires coordination → slower, less scalable.

🧩 Used when correctness > speed (e.g., money transfers)

✅ Eventual Consistency

Definition:

All replicas will eventually reflect the latest write — but not immediately.

Analogy:

🏦 You update your address at one bank branch — it takes a few hours for all branches to catch up.

Real-World Examples:

Amazon DynamoDB

DNS systems

Social media timelines

Trade-Off:

Faster, more scalable → but users may see stale data temporarily.

🧩 Used when availability > strict correctness (e.g., likes, feeds)

✅ Read-Your-Own-Writes

Definition:

After you write something, you'll always see your own update, even if others don't yet.

Analogy:

📸 You post a photo on Instagram — you see it immediately, even if your friends' feed is still catching up.

Real-World Examples:

Session-based web apps

Personal dashboards

Trade-Off:

Improves user experience with minimal coordination.

🧩 Used for personalized views or user sessions

✅ Causal Consistency

Definition:

If one operation depends on another, the system preserves their order.

Analogy:

💬 If you comment on a post, everyone should see the post before your comment.

Real-World Examples:

Chat apps

Collaborative editors

Versioned data stores

Trade-Off:

Balances performance and correctness. Stronger than eventual, weaker than strong.

🧩 Used for user-facing interactive systems

🧩 Summary: Consistency Trade-Offs

Model	Guarantee	Example Systems	Trade-Off
Strong	Always latest value	PostgreSQL, Spanner	Slower, coordination heavy
Eventual	Eventually same value	DynamoDB, DNS	Fast, but temporary inconsistency
Read-Your-Own-Writes	User always sees own changes	Instagram, dashboards	Personalized consistency
Causal	Preserves cause-effect order	Chat, collaborative apps	Good balance of speed & order

🎓 Key Takeaways

Failure models help you design for the right threats

Consistency models help you balance correctness and performance

Trade-offs are inevitable — choose based on your requirements

Understanding these models gives you the vocabulary to reason about distributed systems and make informed architectural decisions in interviews and real-world scenarios.

Failure & Consistency Models

🎙️ Why This Matters

🧨 Part 1: Failure Models

✅ Crash Failures

Analogy:

Real Example:

Design Implication:

🧠 Byzantine Failures

💡 Simple Analogy

🧩 The Solution: Byzantine Fault Tolerance (BFT)

💻 Real Example (Modern Context)

🔁 Crash vs. Byzantine Failures

🔗 Part 2: Consistency Models

✅ Strong Consistency

Definition:

Analogy:

Real-World Example:

Trade-Off:

✅ Eventual Consistency

Definition:

Analogy:

Real-World Examples:

Trade-Off:

✅ Read-Your-Own-Writes

Definition:

Analogy:

Real-World Examples:

Trade-Off:

✅ Causal Consistency

Definition:

Analogy:

Real-World Examples:

Trade-Off:

🧩 Summary: Consistency Trade-Offs

🎓 Key Takeaways