The Silent Saboteur Inside Nginx | Mike CK - Electrical Engineer and Developer

Unmasking Nginx’s DNS Caching Pitfall in Dynamic Environments

As engineers, we rely on Nginx as a steadfast, high-performance pillar of our infrastructure. We configure it to proxy traffic, terminate SSL, and serve static content, trusting its legendary stability. But a subtle, default behavior in its DNS resolution can transform this trusted ally into a silent saboteur, causing mysterious 502 Bad Gateway errors that only a full restart can fix.

This is the story of that saboteur: a common configuration mistake that lies dormant in stable environments but brings down applications in the dynamic, ephemeral world of cloud computing. We will not only diagnose the problem but embark on a journey to reproduce it, uncovering a deceptive “false positive” in Docker before building a definitive simulation that proves the danger and validates the solution.

An Innocent-Looking Configuration

Consider a typical setup: Nginx running as a reverse proxy in front of an AWS Network Load Balancer (NLB) or a Kubernetes service. The configuration seems straightforward and correct:

# A seemingly harmless configuration
location /api/ {
    proxy_pass http://my-api-service.us-east-1.elb.amazonaws.com/;
}

This configuration works perfectly upon deployment. But hours or days later, your monitoring alerts scream with 502 errors. You restart the Nginx process, and the problem vanishes, only to return unpredictably.

The root cause is not a bug in Nginx, but a feature working as designed. When Nginx starts or reloads, it parses this configuration. Seeing a static hostname, it performs a one-time DNS lookup to resolve my-api-service... to its current IP addresses. These IPs are then “baked into” the in-memory configuration.

In a dynamic environment like AWS or Kubernetes, the IPs behind a load balancer or service name are not permanent. They change due to scaling events, deployments, or host failures. When they do, your Nginx instance has no idea. It continues to proxy requests to the old, stale IP addresses, which are now dead endpoints, resulting in a stream of 502 errors.

The Quest for Reproduction

To truly understand this failure mode, we must reproduce it locally. Our tool of choice is Docker Compose, allowing us to simulate the Nginx proxy and a backend service whose IP can be changed on demand.

Our plan:

Define an Nginx proxy and two backend services, backend_v1 and backend_v2.
Give both backends the same network alias, my-backend-service.
Start Nginx proxying to backend_v1.
Stop backend_v1 and start backend_v2, simulating an IP change.
Observe Nginx fail to connect to the new service.

First Attempt

Our initial, simple docker-compose.yml looked something like this:

services:
  nginx_proxy:
    # ...
  backend_v1:
    # ...
    networks:
      my_app_net:
        aliases:
          - my-backend-service
  backend_v2:
    # ...
    networks:
      my_app_net:
        aliases:
          - my-backend-service
# ...

We ran the test:

docker compose up -d nginx_proxy backend_v1
curl http://localhost:8080 -> Success, got response from V1.
docker compose stop backend_v1
docker compose up -d backend_v2
curl http://localhost:8080 -> Success! We got a response from V2.

The test failed to fail. Why? We had uncovered a crucial secondary lesson: Docker’s IP address reuse. When backend_v1 was stopped, its IP was released back into the network’s pool. When backend_v2 was started moments later, Docker’s IPAM simply assigned it the first available IP; the exact same one backend_v1 had used. Nginx’s stale cache was accidentally correct. This lucky coincidence masks the underlying issue in simple tests but offers no protection in the wild.

A Deterministic Failure

To create a reliable test, we must guarantee the IP address changes. The solution is to use static IPs within our Docker Compose setup. By defining a subnet and assigning a different, predictable IP to each backend version, we remove all ambiguity.

Here is our final, robust docker-compose.yml:

# ngix/docker-compose.yml
services:
  nginx_proxy:
    image: nginx:1.23-alpine
    container_name: nginx_proxy
    ports:
      - "8080:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    networks:
      - my_app_net

  backend_v1:
    image: python:3.9-alpine
    container_name: backend_v1
    command: >
      sh -c "echo '<h1>Response from Backend V1 at 172.20.0.10</h1>' > index.html && python -m http.server 8000"
    networks:
      my_app_net:
        aliases:
          - my-backend-service
        ipv4_address: 172.20.0.10

  backend_v2:
    image: python:3.9-alpine
    container_name: backend_v2
    command: >
      sh -c "echo '<h1>Response from Backend V2 at 172.20.0.11</h1>' > index.html && python -m http.server 8000"
    networks:
      my_app_net:
        aliases:
          - my-backend-service
        ipv4_address: 172.20.0.11

networks:
  my_app_net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/24

With this in place, we use our problematic nginx.conf:

# ngix/nginx/nginx.conf
events { worker_connections 1024; }
http {
    server {
        listen 80;
        location / {
            proxy_pass http://my-backend-service:8000;
            proxy_set_header Host $host;
        }
    }
}

Now, we run the simulation:

# Start Nginx and V1. Nginx resolves my-backend-service to 172.20.0.10.
$ docker compose up -d nginx_proxy backend_v1

# Test the connection. It works.
$ curl http://localhost:8080
<h1>Response from Backend V1 at 172.20.0.10</h1>

# Simulate the IP change. The service is now at 172.20.0.11.
$ docker compose stop backend_v1
$ docker compose up -d backend_v2

# Test again. Failure is now guaranteed.
$ curl http://localhost:8080
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.23.4</center>
</body>
</html>

Success! We have definitively reproduced the failure. Nginx is stuck sending traffic to the now-defunct .10 address, completely unaware the service lives on at .11.

The Solution: Forcing Runtime Resolution

Fixing this requires telling Nginx two things: how to resolve DNS at runtime and when to do it.

The resolver Directive: This directive tells Nginx which DNS server to use for runtime queries. Crucially, Nginx uses its own highly performant, non-blocking resolver, so these queries won’t stall the event loop. For Docker, the internal resolver at 127.0.0.11 is perfect. We also add the valid parameter to control the cache TTL, overriding whatever the DNS server provides.
Using a Variable: The trigger for runtime resolution is using a variable in the proxy_pass directive. When Nginx sees a variable, it knows the value could change and defers the resolution until a request is processed.

Here is the corrected nginx.good.conf:

# ngix/nginx/nginx.good.conf
events { worker_connections 1024; }
http {
    server {
        listen 80;

        # 1. Define the resolver and a short cache lifetime.
        resolver 127.0.0.11 valid=5s;
        
        # 2. Store the upstream in a variable.
        set $backend "my-backend-service:8000";

        location / {
            # 3. Use the variable to force runtime resolution.
            proxy_pass http://$backend;
            proxy_set_header Host $host;
        }
    }
}

After reloading Nginx with this configuration (docker compose exec nginx_proxy nginx -s reload), we run our test one last time. When we switch from backend_v1 to backend_v2 and wait for the valid=5s cache to expire, our next curl request succeeds. Nginx automatically re-resolves the hostname, discovers the new IP (172.20.0.11), and seamlessly directs traffic to the correct backend.

The silent saboteur has been neutralized.

Stay Safe

The default behavior of Nginx is optimized for static, predictable environments. In the modern cloud, where infrastructure is fluid, this default becomes a liability. The key takeaway is simple but absolute: if your proxy_pass directive points to a hostname that can change its IP address, you must use the resolver directive in combination with a variable.

By understanding the mechanics and learning how to reproduce the failure deterministically, we can build resilient systems and turn a potential production outage into a solved problem.

You might also like this old and golden resource on GitHub repo nginx proxy pitfalls