Microservices Architecture & Patterns – The Complete Guide▼

All Series (154)Microservices Architecture & Patterns – The Complete Guide (35)Modern Agile Engineering – The Complete Guide to Real-World Agile Software Development (8)Software Architecture Fundamentals – The Complete Guide to Modern System Design (32)Design Decisions in Software Architecture (9)Domain-Driven Design – A Complete Guide to Modeling Complex Systems (12)Quality Engineering – The Complete Guide to Modern Software Testing (1)AI & the Future of Work in Software – Skills, Roles, and Mindset for the AI Era (3)Software Security Fundamentals – The Complete Guide to Authentication, Authorization, and Secure Systems (35)Spring Boot – The Complete Developer Guide (6)Micronaut for Spring Boot Developers – The Complete Guide (13)

Learning Paths

Browse All

All Learning Paths154

Learning Paths

Microservices Architecture & Patterns – The Complete Guide35

Modern Agile Engineering – The Complete Guide to Real-World Agile Software Development8

Software Architecture Fundamentals – The Complete Guide to Modern System Design32

Design Decisions in Software Architecture9

Domain-Driven Design – A Complete Guide to Modeling Complex Systems12

Quality Engineering – The Complete Guide to Modern Software Testing1

Software Security Fundamentals – The Complete Guide to Authentication, Authorization, and Secure Systems35

Spring Boot – The Complete Developer Guide6

Micronaut for Spring Boot Developers – The Complete Guide13

Last Updated: April 15, 2026 at 12:30

Low Latency Microservices: Why Systems Are Slow and How to Stop Waiting

Stop asking "how do we make our code faster?" Start asking "how many times does our system wait?" Latency is not the time your system spends computing. It is the time your system spends waiting

Systems are slow not because they do too much work, but because they wait too many times — on networks, locks, downstream services, cold data, and coordination. You can optimize your code to run in 1ms, but if it waits on five network calls each taking 50ms, your latency is 251ms. This article reframes latency as waiting, not computing. You will learn the five sources of latency, why tail latency explodes when you fan out to many services, why systems fall off a cliff rather than slowing gradually, and why the fastest network call is the one you never make. One insight: A system is fast not when it computes quickly, but when it waits rarely. Every millisecond in your system is a place where something waited. Remove the wait, remove the millisecond.

Defining Latency

Latency is the total elapsed time from when a request is initiated to when a response is received by the caller. It is end-to-end waiting time, not processing time.

Two things latency is not:

It is not computation time. The time your CPU spends executing code is typically a small fraction of total latency. It is not throughput. Throughput measures how many requests a system handles per second. A system can have high throughput and poor latency simultaneously.

The implication is important: you can make your code ten times faster and see no meaningful improvement in user-perceived latency — because computation was never the bottleneck. Waiting was.

Why Latency Matters

For many systems, latency is a quality-of-life concern. Pages load a little slower; users are mildly frustrated. For others, it is a hard business constraint with direct financial consequences.

Algorithmic trading systems operate on microsecond windows. A latency disadvantage measured in microseconds translates directly into missed trades. Real-time bidding platforms must respond to ad auction requests within 100ms — miss the deadline and the bid is void. Payment authorisation systems are expected to respond in under 300ms; slowdowns affect conversion rates at scale.

Understanding whether latency is a soft quality concern or a hard constraint determines how aggressively — and where — you should optimise. This distinction matters because optimising latency has real costs, which we will address near the end.

The Core Reframe: Every Millisecond Is a Wait

Here is an assumption almost every team makes: "If our system is slow, we need to make our code faster."

In practice, this is often not where the problem lies.

You can optimise your code to run in 1ms. But if your service calls five downstream services, each taking 50ms, your total latency is 251ms. The code is not the problem. The waiting is. This waiting is often invisible in traditional performance analysis, as it appears not as slow execution, but as time between operations.

This reframe has a practical consequence: the question to ask of a slow system is not "what is running slowly?" It is "where is this request waiting, and why?"

A system is fast not when it computes quickly, but when it waits rarely.

Three Principles That Underpin Everything

Before we go into sources and solutions, three mental models unify what follows. These are not abstract concepts — they are the lens through which every later section should be read.

Principle 1: The Critical Path

Not all work in a system contributes equally to latency. The critical path is the sequence of steps that must complete sequentially before the user receives a response. Everything on the critical path adds to latency directly. Everything off it can be deferred, parallelised, or made asynchronous without affecting response time.

Before optimising anything, identify the critical path. Work that is not on the critical path does not need to be fast — it needs to not block the path.

Principle 2: The Weakest Link

System latency is determined by the slowest component on the critical path.

If you have ten services on the critical path and nine respond in 10ms while one responds in 200ms, your latency is 200ms. The nine fast services are irrelevant until you fix the slow one. This means uniform improvements across all services are usually wasteful. Identify the bottleneck. Fix it. Repeat.

Principle 3: Latency Compounds Under Fan-out

In a distributed system, latency does not simply add — it compounds. When a service fans out to multiple downstream dependencies, each additional call increases the probability that at least one takes longer than expected, widens variance in end-to-end response times, and raises the worst-case ceiling.

This is why optimising individual services in isolation often fails to improve system-level latency. The system's behaviour under fan-out is what matters, not the average performance of any single component.

Together, these three principles form a hierarchy: first, find the critical path. Then, find the weakest link on it. Then, understand how fan-out is compounding that weakness.

A Real Failure: The 350ms Product Page

Before we get to solutions, let us look at how waiting kills performance in practice. This is a story that plays out at nearly every e-commerce company.

A product page needs to load five things: product details, inventory status, pricing with promotions, reviews, and recommendations. The engineering team optimises each service. Each one now runs its business logic in 10ms — genuinely fast code.

But the page still feels slow. Here is why.

The product page calls the catalog service (10ms compute + 50ms network). Then inventory (10ms + 50ms). Then pricing (10ms + 50ms). Then reviews (10ms + 50ms). Then recommendations (10ms + 50ms). These calls happen one after another, each waiting for the previous to finish.

Total latency: 300ms of network waiting + 50ms of compute = 350ms.

Notice what this is: a critical path problem and a fan-out problem simultaneously. Every call is on the critical path, and every call is sequential. The 350ms is not the sum of computation. It is the sum of waiting.

The team adds caching to the catalog. That call now takes 20ms total. But the other four still wait. Latency drops to 270ms. Still slow. Users still complain.

The root cause was never the code. The system was slow because it waited on five sequential network calls. The team had optimised compute. They had not reduced waiting.

What they actually needed: parallel calls, batch APIs, precomputed aggregates, or moving data closer to the service that needs it.

The fastest network call is the one you never make. The second fastest is the one you make in parallel.

The Sources of Latency: A Layered Model

Latency is not a single problem — it is a stack of waiting that a request passes through on its way from the user to a response. Understanding it as a layered model rather than a flat list makes it easier to reason about where to intervene.

The layers are: hardware, infrastructure, security, network and coordination, data access, and contention. Most requests experience all of them. The ones that dominate change depending on the system.

Layer 1: Hardware

Every layer in the memory hierarchy has a different latency cost. An L1 cache access takes roughly 1 nanosecond. Main memory takes around 100 nanoseconds. An NVMe SSD takes around 100 microseconds. A network call within a datacenter takes 1–5 milliseconds. A cross-region network call takes 50–200 milliseconds.

Network calls do not dominate latency because networks are slow in absolute terms. They dominate because they are five to eight orders of magnitude slower than in-process computation. Moving data across a network is a fundamentally different class of operation from reading it from memory. This is the physical foundation for everything that follows.

Layer 2: Infrastructure

Several layers of infrastructure add latency that application code cannot control directly.

Kubernetes scheduling delays can add tens to hundreds of milliseconds when pods are scheduled or rescheduled.

Service mesh sidecars (such as Envoy in Istio) add 1–5ms per request from proxying, telemetry, and policy enforcement.

Load balancer behaviour — health check intervals, connection draining, routing algorithms — introduces latency variability.

And connection overhead is significant enough to deserve its own attention: every new TCP connection requires a round-trip handshake, and TLS requires two more. On a 50ms cross-region link, that is 150ms of overhead before a single byte of your request is sent. Connection pooling and HTTP/2 multiplexing eliminate this by reusing connections.

Cold starts belong here too. When your service autoscales, the first request to a new instance waits for container startup, runtime initialisation, dependency injection, cache warming, and connection pool establishment. This can cost seconds. Pre-warming instances, keeping a minimum instance count, and using readiness checks before traffic hits will protect against it.

Layer 3: Security

Security mechanisms on the request path have real latency costs. TLS handshakes add 1–2 round trips per new connection (TLS 1.3 reduces this to one). Token validation, JWT verification, or external auth calls add latency on every request — token caching is often the correct mitigation. API gateway and WAF inspection can add 5–20ms before your service logic runs. These costs are frequently overlooked on internal services where security feels invisible.

Layer 4: Communication Latency

This layer covers the two sources that dominate most distributed system latency.

Network latency accumulates from cross-service calls, cross-region calls, chatty APIs, and serialisation overhead. Even within the same datacenter, a network call typically costs 1–5ms. Cross-region, it costs 50–200ms. Most systems are slow simply because they talk too much.

Coordination latency comes from waiting for locks, distributed transactions, or multiple services to agree on state. The more coordination a request requires, the more it waits. Coordination paths do not just limit scale — they dominate latency.

Layer 5: Data Access Latency

Waiting for databases, disk I/O, cold caches, or remote storage. Even a fast database query takes 10–50ms. A slow one takes 1000ms. Data that is not in memory, not in cache, or is physically far away is data that makes your system wait.

The N+1 query problem belongs here. Service A fetches a list of 100 items in one query (10ms). Then it fetches details for each item individually — 100 more queries at 10ms each = 1000ms. The code is not slow. The pattern is slow. The fix is to batch: joins, prefetching, or data loaders. Turn N+1 into 1+1.

Processing latency — CPU-heavy work, serialisation, large payload transformations — also falls in this layer. It is real, but it is usually the smallest component of total latency. Teams that focus here first are often solving the wrong problem.

Layer 6: Contention Latency

As load increases, requests begin to compete for shared resources: threads, connection pools, message queues, database locks. This competition produces queueing latency, and it is the source of the most dangerous latency behaviour in distributed systems.

Queue wait time is not proportional to load — it grows exponentially as a system approaches saturation. This is what creates the cliff. A small increase in load near 90% capacity does not cause a small increase in latency. It causes a catastrophic spike. Systems do not get gradually slower. They collapse.

Every request experiences latency across multiple stacked layers: hardware, infrastructure, security, network, data, and contention. The total waiting time is the sum of whatever each layer charges. Reducing latency means identifying which layer is the current bottleneck — and in many systems, multiple layers contribute simultaneously.

In most real systems, the weakest link emerges from only one or two of these layers — typically network communication, data access, or contention under load. Identifying which layer dominates your critical path is the key step in reducing end-to-end latency.

External dependencies deserve special attention.

Not all network calls are equal. Calls to systems you do not control — third-party APIs, external payment gateways, SaaS services, or even internal services owned by other teams — introduce a different class of latency risk.

Their latency characteristics are not governed by your infrastructure, your scaling policies, or your optimisation efforts. They may be slower, more variable, and less predictable. They may degrade under load in ways you cannot see. And when they do, they become part of your critical path.

This creates an important design constraint: you cannot optimise what you do not control. You can only design around it.

The implications are architectural. External calls should be treated as high-latency, high-variance boundaries. They require strict timeouts, circuit breakers, fallbacks, and, where possible, isolation from the critical path. In many systems, the difference between acceptable latency and systemic failure is not how fast internal services are — but how external dependencies are contained.

Tail Latency: Where the Weakest Link Becomes Probabilistic

When a system depends on many services, it only takes one of them to be slow for the entire request to be slow.

Imagine you call ten services in parallel to assemble a response. Each service is usually fast, but occasionally slow — for example, about 1 out of every 100 requests takes much longer than the rest.

What happens when you combine ten such services?

It is not enough that each one is usually fast. Each service has a small chance of being slow on any given request. When you call ten of them, those chances add up.

The probability that all ten respond quickly at the same time is (0.99)^10, which is approximately 90.4%. That means roughly 9.6% of your requests — nearly 1 in 10 — will encounter at least one slow service. And because your response depends on all of them, the entire request becomes slow.

This is what “tail latency” means in practice. Even if each service is slow only 1% of the time (often referred to as its p99 latency), the system as a whole experiences slow responses far more frequently.

You can make each service fast individually and still have a slow system if you call enough of them. Variability in individual services compounds into significant worst-case latency at the system level.

Techniques that help:

Reducing fan-out depth means calling fewer services to build a response. Every additional service you depend on increases the chance that one of them will be slow. If you can combine data ahead of time, or redesign your API so fewer calls are needed, you reduce this risk directly.

Using a Backend for Frontend (BFF) is one way to do this. Instead of the client calling many services, a single backend service gathers all the required data and returns it in one response. This keeps complexity and latency under control in one place.

Hedged requests are a way to deal with unpredictable slowness. If a request to a service is taking longer than expected, you send the same request to another instance of that service and use whichever response comes back first. This reduces the impact of occasional slow instances, at the cost of a small increase in load.

Timeouts define how long you are willing to wait for a dependency. If a service does not respond within that time, you stop waiting and move on — either by returning a fallback response or failing fast. Without timeouts, a slow dependency can block your entire system indefinitely. Circuit breakers build on this idea. If a service is consistently slow or failing, the circuit breaker temporarily stops sending requests to it altogether, allowing the system to recover instead of repeatedly waiting and failing.

Latency Is Not Just Speed — It Is Predictability

A system with consistent 120ms latency is often better than one fluctuating between 50ms and 500ms. Jitter propagates across services. Variance is what breaks SLAs and user experience.

In real-world operation, unpredictability is more damaging than moderate slowness. A predictable 150ms response allows client-side timeouts, UI design, and user expectations to be set accordingly. A system that is sometimes 50ms and sometimes 500ms creates timeouts, retries, and cascading failures.

When you design for low latency, design for predictable latency. This means eliminating sources of variance: garbage collection pauses, lock contention, cold caches, and noisy neighbours. It means using bounded queues, deterministic algorithms, and stable resource allocation. Predictability is the foundation of operational trust.

The Load vs Latency Cliff

At first glance, it seems reasonable to expect latency to scale linearly with load. As utilisation increases, latency should rise in step. Distributed systems do not behave this way.

The actual curve is non-linear. Below 70% utilisation, latency is roughly stable. From 70–80%, it begins to climb. At 90%, it spikes sharply — five to ten times baseline. At 95%, queueing dominates and latency can reach fifty to a hundred times baseline. At 100%, the system collapses.

The reason is the contention layer described above. As load increases, queues form. As queues grow, wait time grows — but not proportionally. Queueing theory tells us that wait time approaches infinity as utilisation approaches 100%. The cliff is not a metaphor. It is the mathematical behaviour of saturated queues.

This is also where retry storms become dangerous. Under high load, requests time out and clients retry. Those retries add more load to an already saturated system, causing more timeouts, causing more retries. A system at 90% load can be pushed into collapse by its own retry logic. Retries should use exponential backoff with jitter — this means increasing the delay between retries each time (for example 100ms, 200ms, 400ms), instead of retrying immediately.

Stay below 70–80% of your capacity ceiling. Autoscale before you approach the cliff, not after you have fallen off it.

Protecting Latency Under Pressure: Control Mechanisms

Reducing latency under normal load is one problem. Protecting it under abnormal load is another. These mechanisms form a second layer of defence.

Backpressure is the ability of a system to signal to its callers that it cannot accept more work at its current rate. Without it, overloaded services queue requests indefinitely, latency explodes, and the system collapses. Backpressure mechanisms include admission control (rejecting requests above a defined threshold before they enter the system), rate limiting (returning HTTP 429 when per-client budgets are exceeded), load shedding (dropping lower-priority requests under saturation to protect capacity for high-priority ones), and load-aware routing (directing traffic toward instances with available capacity).

Timeouts as first-class design. A system without timeouts is not a low-latency system — it is a system that waits indefinitely. Timeouts define how long you are willing to wait. Without them, queues grow, threads block, and cascading failure happens. Every downstream call must have a timeout. Every timeout must have a fallback or failure path. This is not optional.

SLO design turns latency from an aspiration into a contract. Define latency budgets per endpoint and per user journey, not globally — a search endpoint and a checkout endpoint have different requirements. Express SLOs at the tail: "p99(i.e. 99 percentile) < 200ms" is a meaningful contract. "Average < 50ms" is not. Allocate the latency budget across the call chain. Alert on SLO breaches, not on raw latency spikes — this keeps alerting focused on user impact rather than noise.

How to Design a Low Latency System (End-to-End Approach)

If you are designing a system today, follow this sequence. Think of it as moving from definition → structure → optimisation → protection → validation.

1. Define the latency goal (SLO first)

Start by defining what “fast enough” means for the user.

For example:

p99 < 300ms for checkout

This becomes your system contract — everything else is designed around it.

2. Identify the critical path

Map the exact steps a request must go through before it returns a response.

Then remove everything that does not need to block the user.

If something is not on the critical path, move it to async processing.

3. Budget latency across services

Break your total SLO into per-service limits.

Example for a 300ms budget:

API Gateway: 20ms
Service A: 80ms
Service B: 100ms
Database: 80ms
Buffer for network + variance: 20ms

This turns latency from “system behaviour” into a contract between services.

4. Reduce fan-out (fewer moving parts)

Every additional service increases the chance of slowdowns.

Where possible:

merge calls (BFF pattern)
aggregate data earlier
reduce cross-service dependencies

Fewer services on the path = fewer opportunities for waiting.

5. Remove unnecessary calls entirely

Before optimizing anything, ask:

Do we even need this call?

Often the best latency improvement is:

precompute results
denormalize data
redesign the API
use CQRS for read-heavy flows

The fastest call is the one you do not make.

6. Use caching where it actually reduces waiting

Use caching in layers:

in-process (fastest)
distributed cache (Redis)
CDN (closest to user)

Use patterns like stale-while-revalidate to avoid users waiting on cache misses.

7. Use parallelism instead of sequential waiting

If multiple calls are unavoidable, run them in parallel.

Latency becomes:

the slowest call, not the sum of all calls

8. Protect the system under load

This is what prevents collapse.

Backpressure → stop accepting more work than you can handle
Rate limiting → protect services from overload
Exponential backoff with jitter → retries slow down and randomize to avoid traffic spikes
Timeouts → stop waiting after a fixed limit

Without timeouts, latency becomes infinite under failure.

9. Validate with real measurements

Before production:

simulate load
measure p99 latency
trace slow requests

Focus only on the slowest few percent — that is where real problems live.

One-line memory model

Design flows through four ideas:

Contract → Path → Budget → Protection → Measurement

Six Levers for Reducing Latency

These are the practical techniques. Each one maps to a layer in the latency model above.

Lever 1: Eliminate Calls

Do not ask "how do we make this call faster?" Ask "do we need this call at all?" Precomputation, data denormalisation, CQRS (Command Query Responsibility Segregation), and BFF aggregation can all remove round trips from the critical path entirely.

Lever 2: Collapse Round Trips

When you cannot eliminate a call, reduce sequential waiting. Call five services concurrently instead of sequentially — your latency becomes the slowest of them, not the sum. Use batch APIs to fetch multiple records in one request rather than one per item. Parallelism is free latency reduction. Use it everywhere the Weakest Link Principle allows.

Lever 3: Move Data Closer to Compute

Most latency comes from one simple fact: data is far away from the code that needs it.

If every request has to travel across the network to fetch data, it will always be slow. The solution is to bring frequently used data closer to your service.

This is what caching does.

There are three common levels:

In-memory (inside your service) → fastest, but not shared across instances
Distributed cache (like Redis) → slightly slower, but shared across services
CDN / edge cache → data is served from locations close to the user

The closer the data is, the less waiting is involved.

Two practical patterns make caching more effective:

Stale-while-revalidate: return cached data immediately, then update it in the background. The user does not wait, even if the data is slightly outdated.
Avoid cache stampede: when many requests hit an expired cache at once and overload the database. Prevent this by refreshing caches early or in the background instead of all at once.

The goal is simple: reduce how often your system needs to go to slower layers like databases or remote services.

Lever 4: Make Non-Critical Work Async

If the user does not need to wait for something, do not make them wait. Fire-and-forget patterns, event-driven flows, and background processing keep the critical path short. The important distinction: async reduces latency for the caller. The work still happens — it just happens after you have already responded.

Lever 5: Reduce Payload Size

Latency grows with bytes as well as with operations. Compression, field filtering, and pagination reduce the amount of data moving across your network. For algorithmic choices, a database index is a precomputed access path — building the right indexes turns O(n) reads into O(log n) or better. In some contexts, an approximate answer delivered quickly is more valuable than an exact answer delivered slowly.

Lever 6: Optimise for Tail Latency, Not Averages

Your average latency might look fine while your p99 is miserable. Measure p99 and p999 for every service. Use hedged requests, circuit breakers, and strict timeouts. Your system is only as fast as its slowest 1 percent — and your users are the ones who find out.

The Latency vs Consistency Trade-off

Low latency and strong consistency often pull in opposite directions.

To guarantee strong consistency, systems need coordination. For example, after a write, the system may need to confirm that all replicas have updated before responding. Distributed transactions require multiple services to agree before proceeding. Each of these steps adds waiting to the critical path.

That waiting increases latency.

In practice, this means you often have to choose: do you want the fastest response, or the most up-to-date and perfectly consistent data?

Low latency systems reduce waiting by relaxing consistency where it is acceptable:

Payments require strong consistency — correctness matters more than speed
Product reviews can be slightly delayed — a few seconds of inconsistency is acceptable
Shopping carts can update immediately and confirm in the background

The key idea is simple: the less coordination you require, the less your system has to wait.

Low latency is not free. It is achieved by carefully choosing where strict consistency is necessary — and where it is not.

User-Perceived Latency: What Actually Matters

Backend response time is only part of the story. What users experience is different, and often more forgiving if you approach it correctly.

User-perceived latency includes time to first byte, time to interactive, and time to visual completion. A page that loads data in 400ms but shows the user something useful in 100ms feels faster than a page that waits 300ms to show anything at all. Perception is part of the design.

Optimistic UI updates change the interface immediately and confirm with the server in the background. For most actions — liking a post, adding an item to a cart — this is perfectly acceptable. Skeleton screens show an empty layout the instant the page loads, so the user sees structure immediately even while data is still arriving. Progressive loading prioritises the critical content first — product title and price before reviews. Streaming responses send data as it becomes available rather than buffering everything before sending anything.

The user does not care when your database returns. They care when the screen updates.

Low Latency Is a Trade-off, Not a Default Goal

Achieving low latency in distributed systems is not free. It requires duplicating data, increasing infrastructure cost, relaxing consistency guarantees, and accepting greater system complexity. Every latency optimisation adds surface area to the system — parallel execution requires coordination logic, eventual consistency requires compensating logic for edge cases, caching requires invalidation strategies.

Not every part of every system should optimise for the lowest possible latency. Only the parts where latency directly affects user experience or business value. Optimise the critical path. Accept latency everywhere else in exchange for simplicity, consistency, and lower cost.

The appropriate level of optimisation depends entirely on the system's purpose.

Algorithmic trading and real-time bidding: latency is a primary product requirement.

Payment authorisation: latency directly affects conversion, and strong consistency is non-negotiable — optimise within those constraints.

Customer-facing web applications: optimise the critical path; tolerate latency elsewhere.

Analytics pipelines and background processing: throughput and correctness matter more than latency. Batch processing and eventual consistency are often the right defaults.

Anti-Patterns to Eliminate

These are the patterns that most commonly cause waiting in production systems.

The Chatty Service Trap. Service A calls Service B for every item in a list. 100 items, 100 network calls. Fix: batch APIs.

The Synchronous Waterfall. Service A calls B, which calls C, which calls D. Latency is the sum of the chain. Fix: parallelise where the critical path allows; replace synchronous chains with events.

The Cold Cache Problem. Every request misses the cache, hammering the database. Fix: pre-warm caches on startup, use aggressive TTLs, consider cache-aside patterns with fallback.

The Single Source of Truth Trap. Every read — even high-frequency, low-priority reads — goes to the primary database. Fix: read replicas, CQRS, denormalised read models for query-heavy paths.

Tail Latency Ignorance. Optimising average latency while ignoring p99. Fix: measure the tail; set SLOs on p99.

The Cross-Region Penalty. A service in us-east calling a service in eu-west synchronously on every request. Fix: partition workloads by region; keep data and compute together.

The N+1 Query Trap. One query for a list, then N queries for details. Fix: joins, batching, prefetching.

Connection Churn. Every request opens a new TCP connection, paying the handshake cost each time. Fix: connection pooling, HTTP/2 multiplexing.

The Cold Start Trap. The first request after a scale-up event waits for full initialisation. Fix: pre-warm instances, set a minimum instance count, use readiness checks before traffic is sent.

Observability: You Cannot Fix What You Cannot Measure

Everything we discussed so far depends on one thing: you must be able to see where your system is waiting.

If you cannot see it, you cannot fix it.

Start with latency metrics — but not averages. Averages hide problems.

Instead, measure how your system behaves across all requests, not just the average.

p50 (median) → half of your requests are faster than this, and half are slower. This is what a typical user experiences.
p95 / p99 → these show slower requests. For example, p99 means 99 out of 100 requests are faster than this value — but 1 request is slower. This is where problems usually start to appear.
p999 → the rare worst cases. These are the requests that feel very slow to users and often trigger retries or failures.

Why this matters: Your average latency might look fine, but users don’t experience the average. They experience the slow moments.

A system with:

average = 100ms
p99 = 800ms
…will feel slow and unreliable, even though the average looks good.

These numbers tell you how your system behaves when things are not perfect — and that is what users actually experience.

Break latency down across services.

When one service calls another, measure how long each step takes. This helps you see where time is being spent.

Some simple signals help you understand why your system is slow:

Cache hit rate → if low, your system is waiting on databases
Queue depth / wait time → if high, your system is overloaded and requests are waiting in line
Connection pool usage → if exhausted, requests are waiting just to make a call

Set alerts based on real user impact.

Instead of alerting on random spikes, define an SLO like: “99% of requests should complete within 300ms”

Alert when this is violated — not when a single request is slow.

The most powerful tool here is distributed tracing. A trace shows the full journey of a request: which services it touched, how long each took, and where it waited.

Without tracing, you are guessing. With tracing, you can see exactly where the delay is.

The goal of observability is not to monitor everything.

It is to answer one question: Where is the request waiting the most right now?

That is where you focus.

Tools you will often see: Prometheus (metrics), Jaeger or Zipkin (tracing), Grafana or Datadog (dashboards).

Measure the waiting. Then remove it.

A Decision Framework for Latency Bottlenecks

When a system is slow, work through these questions in order. The sequence matters — it follows the principle hierarchy established at the start.

First, identify the critical path. Trace a slow request end to end. Find the gaps — the time between service calls, the time spent in queues, the time waiting on a database.

Second, apply the Weakest Link Principle. Do not optimise broadly. Ignore everything except the slowest component on the critical path. That is where to focus. Fix that first, then repeat.

Third, look for fan-out amplification. Are you calling many services in parallel? How is tail latency compounding across them? Can you reduce fan-out depth or aggregate calls server-side?

Fourth, work through the six levers in order: can we eliminate this call entirely? If not, can we parallelise it? Can we move the data closer? Can we make this work async? Can we reduce the payload? Are we measuring and optimising the tail?

Fifth, check for contention. What is the queue depth? What is connection pool utilisation? Where are we on the load vs latency curve — and do we have backpressure in place if we approach the cliff?

Sixth, consider the trade-offs. What consistency requirement does this path have? Is the latency optimisation worth the added complexity and cost? Is this on the critical path at all?

Final Thought

You can write the fastest code in the world. It will not matter if your system waits on a network call, a database query, a lock, a queue, a connection handshake, a cold start, or an N+1 query pattern.

Latency is not a property of services. It is a property of how waiting flows through a system — layer by layer, hop by hop, queue by queue. The systems that are fast are not the ones that compute quickly. They are the ones that have identified where waiting lives in their stack, eliminated the waits that do not belong on the critical path, and accepted the waits that are the price of correctness.

They apply the Weakest Link Principle before optimising anything. They use fan-out deliberately, knowing it compounds tail risk. They treat the cliff not as a failure mode but as a design constraint to build away from. They relax consistency where the business allows. They make non-critical work async. They measure p99. They stay below the cliff. And they accept latency everywhere else in exchange for simplicity, correctness, and lower cost.

Low latency is not achieved by optimisation. It is achieved by design. A system is fast not when it computes quickly, but when it waits rarely. Every millisecond in your system is a place where something waited. Remove the wait, remove the millisecond.

About N Sharma

Lead Architect at StackAndSystem

N Sharma is a technologist with over 28 years of experience in software engineering, system architecture, and technology consulting. He holds a Bachelor’s degree in Engineering, a DBF, and an MBA. His work focuses on research-driven technology education—explaining software architecture, system design, and development practices through structured tutorials designed to help engineers build reliable, scalable systems.

Disclaimer

This article is for educational purposes only. Assistance from AI-powered generative tools was taken to format and improve language flow. While we strive for accuracy, this content may contain errors or omissions and should be independently verified.