Skip to main content

Overengineering Local DNS: High-Performance DNS Chain

·1150 words·6 mins
Stanislav Cherkasov
Author
Stanislav Cherkasov
Architecture | Kubernetes | Terraform
homelab - This article is part of a series.
Part : This Article

DNS is one of those services you only notice when it breaks. In a serious homelab, I want more than “it works”:

  • Network-wide filtering (ads/trackers/malware) without touching every device
  • Split-horizon / authoritative zones for internal services
  • Fast resolution under load (low latency + high QPS)
  • Autonomy when upstreams or the WAN get flaky
  • Security controls (encrypted upstreams + DNSSEC validation)
  • Repeatability: the whole thing is deployed, validated, and re-deployed via Ansible

So I built a DNS Chain. Overengineered on purpose.


What this post matches
#

This post reflects my current Ansible role and host layout:

  • dnsdist listens on :53 and load-balances into a backend pool
  • pihole runs as N containers on 127.0.0.1:9991–999N
    • in my lab: N = (CPU cores − 1), which currently equals 7
  • bind9 listens on 127.0.0.1:1053
  • unbound listens on 127.0.0.1:2054

Architecture
#

High-level flow
#

flowchart LR
  C["Clients (LAN/VPN)"] -->|"UDP/TCP 53"| D["dnsdist :53
LB + health checks + packet cache"] D --> P["Pi-hole pool xN
127.0.0.1:9991-999N
blocking + local cache"] P --> B["Bind9
127.0.0.1:1053
authoritative zones / split-horizon"] B --> U["Unbound
127.0.0.1:2054
recursive cache + DoT + DNSSEC validation*"] U -->|"TLS 853"| Up[("Cloudflare / Quad9 / Google")]

* DNSSEC validation is intended to happen in Unbound (more below). I also include a concrete test so I can prove it’s actually enabled.


Request flow (the big “back-and-forth” diagram)
#

This is the single diagram I use when I’m debugging. If you can mentally simulate this flow, you can usually pinpoint where things went wrong in under a minute.

sequenceDiagram
  participant C as Client
  participant D as dnsdist :53
  participant P as Pi-hole (pool)
  participant B as Bind9 :1053
  participant U as Unbound :2054
  participant O as Upstream DoT :853

  C->>D: Query A/AAAA
  alt dnsdist packet-cache HIT
    D-->>C: Answer (cache)
  else MISS
    D->>P: forward (LB + health check)
    alt Blocked by Pi-hole policy
      P-->>D: Blocking answer (NXDOMAIN/0.0.0.0)
      D-->>C: blocked
    else Allowed
      P->>B: forward
      alt Internal zone hit
        B-->>P: authoritative answer
      else External domain
        B->>U: forward
        U->>O: DoT + (DNSSEC validate)
        O-->>U: response
        U-->>B: response
        B-->>P: response
      end
      P-->>D: response
      D-->>C: response
    end
  end

Why this exact order
#

1) dnsdist: a “front door” that stays fast under load
#

dnsdist earns its place by doing three things well:

  • Load balancing across multiple Pi-hole backends
  • Health checks (unhealthy backends are automatically avoided)
  • Packet cache for the hottest queries (answer without touching downstream layers)

This matters because it keeps client configuration simple: clients always use one DNS IP (this host) on port 53.


2) Pi-hole xN: filtering at the edge, scaled horizontally
#

Pi-hole is a convenient “policy layer” for the whole network. I run multiple instances because it provides:

  • Isolation: one container restart does not nuke the whole service
  • Throughput headroom: load spreads across instances
  • Operational flexibility: different lists/behavior can be tested on a subset (if desired)

Implementation detail: containers bind to distinct loopback ports (127.0.0.1:9991–999N), and dnsdist distributes traffic.


3) Bind9: authoritative split-horizon zones
#

Bind9 is where my internal universe lives:

  • authoritative zones (e.g., lab.internal)
  • internal records for services (git.lab.internal, wiki.lab.internal, etc.)
  • optional split-horizon logic (internal view vs external)

If Bind9 can answer from an authoritative zone, it replies immediately. Otherwise, it forwards “the internet” further down the chain.


4) Unbound: recursive engine, cache, and upstream security
#

Unbound is my “last hop” for external domains:

  • large recursive cache (and aggressive performance tuning)
  • DoT (DNS-over-TLS) to upstream providers
  • resilience features like serve-expired (use cached records during upstream turbulence)
  • a sensible place to enforce DNSSEC validation in one component

Caching strategy: layered on purpose
#

Yes, there is caching at multiple layers. That is intentional.

  • dnsdist packet cache: fastest possible responses for repeat queries
  • Pi-hole cache: local caching close to the policy decision (block/allow)
  • Bind9: instant answers for internal authoritative zones + cache for forwarded lookups
  • Unbound: the heavy recursive cache + prefetch + serve-expired

The practical result: most “normal browsing” queries become very low latency once the caches are warm, and the system stays stable under bursts.


Autonomy: when upstreams fail
#

Homelabs are where networking experiments happen: firewall restarts, VPN changes, routing updates, ISP hiccups.

Unbound can be tuned to keep things usable via serve-expired and prefetching. The goal isn’t perfection; it’s graceful degradation: internal services keep resolving, and external browsing is less likely to collapse immediately.


DNSSEC: security goal, concrete test
#

If I say “DNSSEC”, I want it to be verifiable.

Where it should happen: in Unbound (single enforcement point).

How I prove it: I test a known-bad DNSSEC domain. If validation is on, the resolver should return SERVFAIL.

# Should resolve (often signed)
dig @127.0.0.1 -p 2054 cloudflare.com +dnssec

# Should SERVFAIL when DNSSEC validation is actually enabled
dig @127.0.0.1 -p 2054 dnssec-failed.org +dnssec

If this does not SERVFAIL, DNSSEC validation is not actually being enforced (or you are not testing the right resolver/port).


Security warning: do not become an open resolver
#

Two rules I consider non-negotiable:

  1. Restrict who can query you. Enforce LAN/VPN-only access with firewall rules and/or dnsdist ACLs.
  2. Never expose this to the public internet. A publicly reachable recursive resolver will be abused.

I treat dnsdist ACLs and host firewall policy as part of “the design”, not an afterthought.


Performance tuning (kernel + service knobs)
#

I tune the host because high-QPS DNS is mostly “fast UDP + lots of sockets”, and defaults are designed for general-purpose servers.

Example sysctl groups from my role:

  • socket buffers (UDP/TCP)
  • backlog limits
  • TCP reuse/timeouts (important for TLS upstreams)
- name: tune sysctl for dns
  ansible.posix.sysctl:
    sysctl_file: /etc/sysctl.d/9999-ansible-dns.conf
    name:       "{{ item.name }}"
    value:      "{{ item.value }}"
    sysctl_set: yes
  loop:
    - { name: "net.core.rmem_max", value: "4194304" }
    - { name: "net.core.wmem_max", value: "4194304" }
    - { name: "net.core.somaxconn", value: "65535" }
    - { name: "net.ipv4.tcp_tw_reuse", value: "1" }

On the service side:

  • Bind9 runs with multiple worker threads
  • Unbound scales num-threads by CPU
  • dnsdist is configured for caching and backend distribution
  • Pi-hole instances are isolated and can be pinned with cpusets

Ansible as a contract: deploy and verify
#

I do not trust a DNS deploy that does not validate itself.

My role runs sanity checks for the configs and then performs live resolution tests with retries. If resolution fails, the role fails immediately.

- name: Attempt DNS resolution
  command: "dig @{{ resolution_host }} -p {{ resolution_port }} google.com +short"
  register: result
  until: result.rc == 0
  retries: 5
  delay: 10

This turns “I think I deployed DNS” into “I can prove it works end-to-end”.


Why I keep it this way
#

This chain gives me:

  • Network-wide ad/track blocking across every device (including IoT)
  • Internal naming that feels like a real environment (authoritative zones, split-horizon)
  • Fast hot-path DNS with multiple caching layers
  • Resilience when upstreams or WAN connectivity are not perfect
  • A safe lab platform for experiments with resolvers, upstream providers, and policy
  • Automation and reproducibility: rebuildable from scratch using Ansible

Yes, it’s overengineering. But it’s the kind that buys me what I actually care about: autonomy, security, and speed in a homelab environment.

Looking for a Senior DevOps or DevSecOps?

I help companies modernize their infrastructure, optimize Cloud/On-Premise costs, and build secure DevSecOps cultures.

homelab - This article is part of a series.
Part : This Article