Ana içeriğe geç

DNS Strategies — external-dns, NodeLocal, CoreDNS Tuning#

"Production'da %30 incident DNS'tedir. 'It's always DNS' meme'si gerçekçi: TTL yanlış, çözümleme yavaş, NXDOMAIN cache ediliyor. DNS'i 'çalışıyor' demek monitor edilmiyor demektir."

Bu rehber K8s ortamında DNS'i — external-dns, CoreDNS, NodeLocal DNSCache, split-horizon — production-grade kurmanın somut yollarını anlatır.


🎯 K8s DNS Mimarisi#

[Pod] → [resolv.conf]
        nameserver: <CoreDNS_IP>
        [CoreDNS (kube-system)]
            ├── cluster.local domains → in-cluster (kube-dns plugin)
            └── External domains → upstream (cloud / public DNS)

🔧 1️⃣ CoreDNS Tuning#

Default Corefile (kube-system/coredns ConfigMap)#

.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf {
       max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

Tuning notları#

cache 30                   → 30s TTL pozitif + negatif cache
prefer_udp                 → UDP yerine TCP fallback
forward . 1.1.1.1 8.8.8.8  → upstream DNS, multi-source
ttl 5                      → kısa TTL → daha hızlı failover

Production önerileri#

.:53 {
    errors
    health {
       lameduck 10s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       ttl 5     # K8s service IP değişiminde hızlı failover
    }
    prometheus :9153
    forward . 1.1.1.1 8.8.8.8 {
       max_concurrent 1000
       prefer_udp
       expire 10s
    }
    cache 30 {
       success 9984
       denial 9984
       prefetch 10 60s 10%
    }
    loop
    reload
    loadbalance
}

🚀 2️⃣ NodeLocal DNSCache#

Sorun: Her pod CoreDNS'e gider. Cluster büyüdükçe CoreDNS DDoS oluyor.

Çözüm: NodeLocal DNSCache — her node'da local DNS cache.

Mimari#

[Pod] → [Node-local DNSCache: 169.254.20.10]
              ├── Cache hit → pod'a hızlı response
              └── Cache miss → [CoreDNS] → upstream

Install#

# Vanilla
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# Helm
helm install nodelocaldns ...

Avantajlar#

  • Latency: 5ms → 0.5ms (cache hit)
  • CoreDNS yükü: %80 azalır
  • Connection storm: kube-dns DDoS önler
  • conntrack: UDP conntrack pressure azalır

🔑 2026'da büyük cluster'lar için zorunlu.


🌐 3️⃣ external-dns#

Sorun: K8s Ingress / Service oluştu → Route53 / Cloudflare'e DNS record manuel eklenir.

Çözüm: external-dns controller K8s resource'larını okur, otomatik DNS record yaratır.

Install (Helm)#

helm install external-dns external-dns/external-dns \
  -n external-dns --create-namespace \
  --set provider=aws \
  --set aws.region=eu-west-1 \
  --set domainFilters[0]=<DOMAIN> \
  --set policy=sync \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::<ACCT>:role/external-dns

Ingress'te kullanım#

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: payments
  annotations:
    external-dns.alpha.kubernetes.io/hostname: payments.<DOMAIN>
    external-dns.alpha.kubernetes.io/ttl: "60"
    external-dns.alpha.kubernetes.io/cloudflare-proxied: "true"
spec:
  rules:
    - host: payments.<DOMAIN>
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service: {name: payments, port: {number: 80}}

→ external-dns Route53 / Cloudflare'e A record ekler.

Multi-cluster support#

  • ownerID: cluster-1 → cluster-1's records
  • domainFilter: dev. vs prod.

🌍 4️⃣ Split-Horizon DNS#

Internal vs external DNS: aynı domain farklı IP.

api.example.com:
  - Internet → 203.0.113.10  (public LB)
  - Internal VPC → 10.0.5.10  (private LB)

Use case#

  • B2B müşteri public IP'den erişir
  • Internal microservice private IP'den (latency + cost azaltır)

Implementation#

  • Route53: Private hosted zone (VPC-bound) + public hosted zone
  • CoreDNS: rewrite plugin
    rewrite name api.example.com api.internal.svc.cluster.local
    

🛡️ 5️⃣ DNS Security#

DNSSEC#

  • Kullan: domain registrar'da enable
  • Validate: CoreDNS dnssec plugin
    dnssec {
        response_filter
    }
    

DNS-over-TLS (DoT) / DNS-over-HTTPS (DoH)#

forward . tls://1.1.1.1 tls://8.8.8.8 {
    tls_servername cloudflare-dns.com
}

→ DNS query'ler şifreli; man-in-the-middle engellenir.

NXDOMAIN attack koruması#

  • Cache TTL agresif (30s+)
  • Rate limit per pod (CoreDNS rate_limit plugin yok ama upstream'de)

🔍 6️⃣ DNS Troubleshooting#

Pod içinde hızlı debug#

# Pod'a shell aç
kubectl exec -it <POD> -- sh

# Resolution test
nslookup payments.<DOMAIN>
dig payments.<DOMAIN>

# CoreDNS direkt sorgu
dig @<COREDNS_IP> kubernetes.default.svc.cluster.local

CoreDNS log#

# Corefile'a log ekle
.:53 {
    log
    errors
    ...
}
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

Yaygın sorunlar#

Belirti Neden Fix
nslookup timeout NetworkPolicy DNS port (53) deny ediyor NetPol allow-dns ekle
External domain çözmüyor upstream DNS yanlış forward . config kontrol
service.namespace.svc.cluster.local çözmüyor CoreDNS down kubectl get pods -n kube-system
Yavaş resolution NodeLocal yok NodeLocal DNSCache install
NXDOMAIN cache yapışmış Negatif cache TTL yüksek cache.denial düşür
Pod'lar sürekli CoreDNS'e DDoS pattern NodeLocal + rate limit

📊 DNS Monitoring#

Prometheus metrics (CoreDNS export)#

# Query rate
sum(rate(coredns_dns_requests_total[5m]))

# Error rate
rate(coredns_dns_responses_total{rcode!="NOERROR"}[5m])

# Latency
histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))

# Cache hit ratio
rate(coredns_cache_hits_total[5m])
/
rate(coredns_dns_requests_total[5m])

Anahtar alarmlar#

- alert: CoreDNSHighErrorRate
  expr: rate(coredns_dns_responses_total{rcode!="NOERROR"}[5m]) > 0.05

- alert: CoreDNSHighLatency
  expr: histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m])) > 1

- alert: CoreDNSDown
  expr: up{job="kube-dns"} == 0

🚫 Anti-Pattern Tablosu#

Anti-pattern Niye kötü Doğru
NodeLocal DNSCache yok büyük cluster'da CoreDNS DDoS NodeLocal install
TTL 0 — sürekli upstream Latency + load Cache 30s+
TTL 24h — failover yavaş Service IP değişimi 24h Cache 30s, K8s ttl 5s
External-dns yok Manuel DNS update external-dns + annotation
TTL annotation yok external-dns 5min default Per-record TTL
DNSSEC kapalı Spoofing riski Enable + validate
Plain DNS upstream MITM DoT veya DoH
Single CoreDNS replica SPOF min 2-3 replica + autoscaling
Negatif cache yok NXDOMAIN spam cache.denial aktif
CoreDNS resource limit yok OOM kill Request/limit set
cluster.local dışına .local arama mDNS conflict Search path optimize
NetworkPolicy DNS allow yok Pod'lar resolution yapamaz allow-dns NetPol
DNS log SIEM'de değil Forensic eksik Log → Loki

📋 DNS Production Checklist#

[ ] CoreDNS HA: 3+ replica + autoscaling
[ ] CoreDNS resource: requests/limits set
[ ] NodeLocal DNSCache install (büyük cluster)
[ ] external-dns: annotation-driven DNS records
[ ] TTL: K8s ttl 5s, cache 30s, external 60s
[ ] Cache: success + denial + prefetch
[ ] Forward: multi-upstream (1.1.1.1 + 8.8.8.8)
[ ] DNSSEC enabled (registrar + CoreDNS)
[ ] DoT/DoH upstream
[ ] DNS rate limit (anti-DDoS)
[ ] Prometheus metric + alert
[ ] Log → SIEM
[ ] NetworkPolicy: allow-dns kuralı her namespace
[ ] Multi-cluster: per-cluster external-dns ownerID
[ ] Split-horizon: internal + external zones (gerekiyorsa)
[ ] Quarterly: DNS performance review

📚 Referanslar#


"DNS 'arka plan tooling' değil — production'ın belkemiği. Cache, TTL, DNSSEC, monitor edilmediğinde 'her şey çalışıyor' der; incident anında 30 dakika araştırırken bilirsin: always DNS."