Courses/System Design/Part 1/Failure & Consistency Models

Failure & Consistency Models

Understanding different failure modes and consistency guarantees

๐ŸŽ™๏ธ Why This Matters

In distributed systems, things go wrong. Machines crash. Networks fail. Messages get lost. To design resilient systems โ€” and to do well in interviews โ€” we need to understand two things:

How systems fail
How they stay consistent despite those failures

This lesson is about failure models and consistency models โ€” the foundation for reasoning about trade-offs in real-world architecture.

๐Ÿงจ Part 1: Failure Models

โœ… Crash Failures

A crash failure is when a node โ€” like a server or a process โ€” suddenly stops working. It doesn't lie or misbehave. It just goes silent.

Analogy:

Imagine a cashier who faints mid-shift. They don't cheat or confuse customers โ€” they just stop responding.

Real Example:

A server loses power. A process crashes due to a bug. These are crash failures.

Design Implication:

We can detect crash failures using timeouts and health checks. We recover using failover, retries, or replication.

๐Ÿง  Byzantine Failures

A Byzantine failure happens when a computer (or node) in a system behaves in an unpredictable or dishonest way.

It's not just "crashing" โ€” it might:

Send wrong data to some nodes
Send different data to different nodes
Or even pretend to be another node

In short: A Byzantine node is one you can't trust โ€” it might lie, cheat, or act crazy.

๐Ÿ’ก Simple Analogy

Imagine 4 friends trying to agree on where to meet for dinner by text message.

One friend (say Alex) is acting strangely:

โ€ขHe tells Beth: "Let's meet at McDonald's."
โ€ขHe tells Chris: "Let's meet at Subway."
โ€ขHe tells Dana: "Let's not meet at all."

Now, Beth, Chris, and Dana all have different information. If they don't have a way to compare messages and agree, they'll end up in different places โ€” no consensus!

๐Ÿงฉ The Solution: Byzantine Fault Tolerance (BFT)

To handle this, systems use Byzantine Fault Tolerant protocols โ€” special rules that let honest nodes still agree even if some are lying.

Example: PBFT (Practical Byzantine Fault Tolerance)

To survive f bad nodes, the system needs at least 3f + 1 total nodes.

1 bad
โ†’ need 4 total
2 bad
โ†’ need 7 total
3 bad
โ†’ need 10 total

๐Ÿ’ป Real Example (Modern Context)

Blockchain systems

Blockchains (like Ethereum, Cosmos, Hyperledger, etc.) are built on BFT-like protocols. Some participants may be hacked or try to cheat. BFT ensures that the honest majority still agrees on the same ledger โ€” the same "truth."

For example: Tendermint, used in Cosmos, is a BFT protocol. Even if a few validators lie or go offline, the system can still agree on the next block safely.

๐Ÿ” Crash vs. Byzantine Failures

Failure TypeBehaviorDetectable?Example
CrashNode stops respondingโœ… YesPower loss, server crash
ByzantineNode lies or sends wrong dataโŒ NoMalicious or buggy server

๐Ÿ”— Part 2: Consistency Models

When multiple replicas store or serve data, consistency rules define how updates appear across them.

โœ… Strong Consistency

Definition:

Every read returns the latest write, no matter which replica you query.

Analogy:

๐Ÿ“ Like a shared Google Doc โ€” type a word, everyone sees it instantly.

Real-World Example:

SQL databases (PostgreSQL, Spanner)
Distributed locking systems

Trade-Off:

Requires coordination โ†’ slower, less scalable.

๐Ÿงฉ Used when correctness > speed (e.g., money transfers)

โœ… Eventual Consistency

Definition:

All replicas will eventually reflect the latest write โ€” but not immediately.

Analogy:

๐Ÿฆ You update your address at one bank branch โ€” it takes a few hours for all branches to catch up.

Real-World Examples:

Amazon DynamoDB
DNS systems
Social media timelines

Trade-Off:

Faster, more scalable โ†’ but users may see stale data temporarily.

๐Ÿงฉ Used when availability > strict correctness (e.g., likes, feeds)

โœ… Read-Your-Own-Writes

Definition:

After you write something, you'll always see your own update, even if others don't yet.

Analogy:

๐Ÿ“ธ You post a photo on Instagram โ€” you see it immediately, even if your friends' feed is still catching up.

Real-World Examples:

Session-based web apps
Personal dashboards

Trade-Off:

Improves user experience with minimal coordination.

๐Ÿงฉ Used for personalized views or user sessions

โœ… Causal Consistency

Definition:

If one operation depends on another, the system preserves their order.

Analogy:

๐Ÿ’ฌ If you comment on a post, everyone should see the post before your comment.

Real-World Examples:

Chat apps
Collaborative editors
Versioned data stores

Trade-Off:

Balances performance and correctness. Stronger than eventual, weaker than strong.

๐Ÿงฉ Used for user-facing interactive systems

๐Ÿงฉ Summary: Consistency Trade-Offs

ModelGuaranteeExample SystemsTrade-Off
StrongAlways latest valuePostgreSQL, SpannerSlower, coordination heavy
EventualEventually same valueDynamoDB, DNSFast, but temporary inconsistency
Read-Your-Own-WritesUser always sees own changesInstagram, dashboardsPersonalized consistency
CausalPreserves cause-effect orderChat, collaborative appsGood balance of speed & order

๐ŸŽ“ Key Takeaways

Failure models help you design for the right threats
Consistency models help you balance correctness and performance
Trade-offs are inevitable โ€” choose based on your requirements

Understanding these models gives you the vocabulary to reason about distributed systems and make informed architectural decisions in interviews and real-world scenarios.