Unmasking Nginx’s DNS Caching Pitfall in Dynamic Environments
As engineers, we rely on Nginx as a steadfast, high-performance pillar of our infrastructure. We configure it to proxy traffic, terminate SSL, and serve static content, trusting its legendary stability. But a subtle, default behavior in its DNS resolution can transform this trusted ally into a silent saboteur, causing mysterious 502 Bad Gateway errors that only a full restart can fix.
This is the story of that saboteur: a common configuration mistake that lies dormant in stable environments but brings down applications in the dynamic, ephemeral world of cloud computing. We will not only diagnose the problem but embark on a journey to reproduce it, uncovering a deceptive “false positive” in Docker before building a definitive simulation that proves the danger and validates the solution.
An Innocent-Looking Configuration
Consider a typical setup: Nginx running as a reverse proxy in front of an AWS Network Load Balancer (NLB) or a Kubernetes service. The configuration seems straightforward and correct:
# A seemingly harmless configuration
location /api/ {
proxy_pass http://my-api-service.us-east-1.elb.amazonaws.com/;
}
This configuration works perfectly upon deployment. But hours or days later, your monitoring alerts scream with 502 errors. You restart the Nginx process, and the problem vanishes, only to return unpredictably.
The root cause is not a bug in Nginx, but a feature working as designed. When Nginx starts or reloads, it parses this configuration. Seeing a static hostname, it performs a one-time DNS lookup to resolve my-api-service... to its current IP addresses. These IPs are then “baked into” the in-memory configuration.
In a dynamic environment like AWS or Kubernetes, the IPs behind a load balancer or service name are not permanent. They change due to scaling events, deployments, or host failures. When they do, your Nginx instance has no idea. It continues to proxy requests to the old, stale IP addresses, which are now dead endpoints, resulting in a stream of 502 errors.
The Quest for Reproduction
To truly understand this failure mode, we must reproduce it locally. Our tool of choice is Docker Compose, allowing us to simulate the Nginx proxy and a backend service whose IP can be changed on demand.
Our plan:
- Define an Nginx proxy and two backend services,
backend_v1andbackend_v2. - Give both backends the same network alias,
my-backend-service. - Start Nginx proxying to
backend_v1. - Stop
backend_v1and startbackend_v2, simulating an IP change. - Observe Nginx fail to connect to the new service.
First Attempt
Our initial, simple docker-compose.yml looked something like this:
services:
nginx_proxy:
# ...
backend_v1:
# ...
networks:
my_app_net:
aliases:
- my-backend-service
backend_v2:
# ...
networks:
my_app_net:
aliases:
- my-backend-service
# ...
We ran the test:
docker compose up -d nginx_proxy backend_v1curl http://localhost:8080-> Success, got response from V1.docker compose stop backend_v1docker compose up -d backend_v2curl http://localhost:8080-> Success! We got a response from V2.
The test failed to fail. Why? We had uncovered a crucial secondary lesson: Docker’s IP address reuse. When backend_v1 was stopped, its IP was released back into the network’s pool. When backend_v2 was started moments later, Docker’s IPAM simply assigned it the first available IP; the exact same one backend_v1 had used. Nginx’s stale cache was accidentally correct. This lucky coincidence masks the underlying issue in simple tests but offers no protection in the wild.
A Deterministic Failure
To create a reliable test, we must guarantee the IP address changes. The solution is to use static IPs within our Docker Compose setup. By defining a subnet and assigning a different, predictable IP to each backend version, we remove all ambiguity.
Here is our final, robust docker-compose.yml:
# ngix/docker-compose.yml
services:
nginx_proxy:
image: nginx:1.23-alpine
container_name: nginx_proxy
ports:
- "8080:80"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
networks:
- my_app_net
backend_v1:
image: python:3.9-alpine
container_name: backend_v1
command: >
sh -c "echo '<h1>Response from Backend V1 at 172.20.0.10</h1>' > index.html && python -m http.server 8000"
networks:
my_app_net:
aliases:
- my-backend-service
ipv4_address: 172.20.0.10
backend_v2:
image: python:3.9-alpine
container_name: backend_v2
command: >
sh -c "echo '<h1>Response from Backend V2 at 172.20.0.11</h1>' > index.html && python -m http.server 8000"
networks:
my_app_net:
aliases:
- my-backend-service
ipv4_address: 172.20.0.11
networks:
my_app_net:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/24
With this in place, we use our problematic nginx.conf:
# ngix/nginx/nginx.conf
events { worker_connections 1024; }
http {
server {
listen 80;
location / {
proxy_pass http://my-backend-service:8000;
proxy_set_header Host $host;
}
}
}
Now, we run the simulation:
# Start Nginx and V1. Nginx resolves my-backend-service to 172.20.0.10.
$ docker compose up -d nginx_proxy backend_v1
# Test the connection. It works.
$ curl http://localhost:8080
<h1>Response from Backend V1 at 172.20.0.10</h1>
# Simulate the IP change. The service is now at 172.20.0.11.
$ docker compose stop backend_v1
$ docker compose up -d backend_v2
# Test again. Failure is now guaranteed.
$ curl http://localhost:8080
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.23.4</center>
</body>
</html>
Success! We have definitively reproduced the failure. Nginx is stuck sending traffic to the now-defunct .10 address, completely unaware the service lives on at .11.
The Solution: Forcing Runtime Resolution
Fixing this requires telling Nginx two things: how to resolve DNS at runtime and when to do it.
The
resolverDirective: This directive tells Nginx which DNS server to use for runtime queries. Crucially, Nginx uses its own highly performant, non-blocking resolver, so these queries won’t stall the event loop. For Docker, the internal resolver at127.0.0.11is perfect. We also add thevalidparameter to control the cache TTL, overriding whatever the DNS server provides.Using a Variable: The trigger for runtime resolution is using a variable in the
proxy_passdirective. When Nginx sees a variable, it knows the value could change and defers the resolution until a request is processed.
Here is the corrected nginx.good.conf:
# ngix/nginx/nginx.good.conf
events { worker_connections 1024; }
http {
server {
listen 80;
# 1. Define the resolver and a short cache lifetime.
resolver 127.0.0.11 valid=5s;
# 2. Store the upstream in a variable.
set $backend "my-backend-service:8000";
location / {
# 3. Use the variable to force runtime resolution.
proxy_pass http://$backend;
proxy_set_header Host $host;
}
}
}
After reloading Nginx with this configuration (docker compose exec nginx_proxy nginx -s reload), we run our test one last time. When we switch from backend_v1 to backend_v2 and wait for the valid=5s cache to expire, our next curl request succeeds. Nginx automatically re-resolves the hostname, discovers the new IP (172.20.0.11), and seamlessly directs traffic to the correct backend.
The silent saboteur has been neutralized.
Stay Safe
The default behavior of Nginx is optimized for static, predictable environments. In the modern cloud, where infrastructure is fluid, this default becomes a liability. The key takeaway is simple but absolute: if your proxy_pass directive points to a hostname that can change its IP address, you must use the resolver directive in combination with a variable.
By understanding the mechanics and learning how to reproduce the failure deterministically, we can build resilient systems and turn a potential production outage into a solved problem.
You might also like this old and golden resource on GitHub repo nginx proxy pitfalls

