🧭 Why Trade-Offs Matter in System Design

In system design, there's no perfect solution. Every decision is a trade-off.

Want high speed? You might sacrifice consistency.

Want high availability? You might accept stale data.

To make smart trade-offs — especially in interviews — you need to understand the qualities we're balancing. These qualities are called Non-Functional Requirements (NFRs).

They don't describe what the system does. They describe how well it behaves.

✅ 1. Availability

What it means:

Is the system up and reachable when users need it?

How we measure it:

Availability is often expressed in "nines":

99.9% = ~8 hours of downtime per year

99.999% = ~5 minutes of downtime per year

Real-world example:

Google Search is available almost all the time. Even if one data center fails, traffic is rerouted instantly.

Techniques to improve availability:

Redundancy (multiple servers or replicas)

Load balancing (distribute traffic across healthy servers)

Health checks and automatic failover

Multi-region deployment

Why it matters: If your system isn't available, users can't use it — no matter how good your features are.

✅ 2. Reliability

What it means:

Does the system behave correctly over time?

Analogy:

A calculator that's always on but sometimes gives wrong answers is unreliable.

Real-world example:

When you order something from Amazon, you expect it to be placed once, charged once, and shipped once — no duplicates or silent failures.

Techniques to improve reliability:

Data validation

Durable storage (e.g., write-ahead logs, replication)

Monitoring and alerting

Idempotency (safe retries without duplicate effects)

Why it matters: A system that's unreliable breaks trust — and trust is everything in software.

✅ 3. Scalability

What it means:

Can the system handle growth — more users, more data, more traffic?

Two types of scaling:

Vertical scaling:

Add more power to one machine (like upgrading your laptop)

Horizontal scaling:

Add more machines and split the work (like hiring more chefs in a busy kitchen)

Real-world example:

Instagram handles billions of photo uploads. It scales horizontally across servers and regions.

Techniques to improve scalability:

Sharding (split data across partitions)

Load balancing

Caching

Queues (to smooth out traffic spikes)

Why it matters: Scalability ensures your system doesn't collapse when demand grows.

✅ 4. Throughput

What it means:

How much work can your system handle per second? It's about volume — not speed.

Analogy:

A toll booth that processes 100 cars per minute has high throughput. If it only handles 5, traffic jams.

Real-world example:

Stripe processes thousands of payments every second. YouTube streams petabytes of video every minute.

Techniques to improve throughput:

Parallelism (multiple workers handle tasks)

Batching (group tasks to reduce overhead)

Asynchronous processing (don't block on slow tasks)

Efficient protocols (like gRPC or QUIC)

Why it matters: High throughput means your system can handle heavy traffic without choking.

✅ 5. Fault Tolerance

What it means:

Can the system survive failures and keep running?

Analogy:

A bridge that still works even if one pillar collapses — that's fault tolerance.

Real-world example:

Netflix randomly shuts down servers to test fault tolerance. Their system keeps working even when parts fail.

Techniques to improve fault tolerance:

Replication (multiple copies of data)

Circuit breakers (stop cascading failures)

Graceful degradation (partial service instead of total failure)

Retry with backoff (handle temporary errors without overwhelming the system)

Why it matters: Failures are inevitable. Fault tolerance makes sure they don't take down your whole system.

✅ 6. Maintainability

What it means:

How easy is it to fix, change, or improve the system?

Analogy:

A car with modular parts is easier to repair than one that needs a specialist for every fix.

Real-world example:

GitHub uses modular microservices — teams can update features independently. Enterprise CRMs must be easy to customize and debug.

Techniques to improve maintainability:

Clean abstractions (hide complexity)

Modular design (isolate components)

Documentation

Logging and observability

Separation of concerns

Why it matters: Maintainable systems evolve faster, break less, and are easier to debug.

🔗 How These NFRs Are Interlinked

NFR A	Related To	Why They're Linked
Availability	Fault Tolerance	Faults must be handled to stay available
Reliability	Fault Tolerance	Surviving failures keeps data safe
Scalability	Throughput	More machines = more capacity
Maintainability	Reliability	Easier fixes = fewer bugs

🧠 Final Thought

Non-functional requirements are the compass of system design. They help us build systems that aren't just functional — they're fast, reliable, scalable, and resilient. As we move forward in this course, we'll explore the techniques that help us achieve these goals — one trade-off at a time.