Network Troubleshooting — tcpdump, ss, dig, conntrack#
"Connection timeout. Sebep: A) NetworkPolicy, B) DNS, C) firewall, D) sidecar, E) MTU. Bilemiyorsan sırayla elemine et — production'da %30 incident network, flowchart ile çözülür."
Bu rehber prod'da network sorunlarını sistemli debug eden komutları, araçları, ve karar ağacı yöntemini somut komutlarla anlatır.
🌳 Network Troubleshooting Flowchart#
[Connection sorunu]
│
▼
┌─────────────────────────────┐
│ 1. Symptom — net mi? │
│ - timeout │
│ - connection refused │
│ - 503 / 504 │
│ - intermittent │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ 2. Layer'a göre kontrol │
│ L3 (IP) → L4 (TCP/UDP) │
│ → L7 (HTTP) │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ 3. Pod-pod, pod-svc, ext? │
│ - Same node │
│ - Cross-node │
│ - Cross-cluster │
│ - Internet │
└──────────────┬──────────────┘
│
▼
ROOT CAUSE
🔍 Adım 1: Pod İçinde Resolution#
# Pod'a shell aç
kubectl exec -it <POD> -n <NS> -- sh
# DNS test
nslookup payments.<NS>.svc.cluster.local
# Beklenen: ClusterIP geri dönecek
# External DNS
nslookup google.com
# Beklenen: real IP
# DNS detaylı (CoreDNS direkt)
dig @10.96.0.10 payments.<NS>.svc.cluster.local
Sorunlar#
- ❌
;; connection timed out→ CoreDNS down veya NetPol DNS port deny - ❌
NXDOMAIN→ service yanlış isim - ❌
SERVFAIL→ upstream DNS issue
🔌 Adım 2: TCP Bağlantı Test#
# Pod'da nc (netcat) varsa
nc -zv payments.<NS>.svc.cluster.local 8080
# /dev/tcp ile (bash built-in)
timeout 3 bash -c 'cat < /dev/tcp/payments.<NS>.svc.cluster.local/8080'
# curl
curl -v http://payments.<NS>.svc.cluster.local:8080/healthz \
--connect-timeout 5
Sorunlar#
- ❌
Connection timed out→ NetworkPolicy block / firewall - ❌
Connection refused→ port'ta servis yok / pod down - ❌
Connection reset→ app crash / TLS hatası
🛡️ Adım 3: NetworkPolicy Kontrol#
# Mevcut NetPol'leri listele
kubectl get networkpolicies -n <NS>
kubectl describe networkpolicy <POL> -n <NS>
# Cilium varsa daha güzel
cilium policy get
cilium policy trace --src-pod <NS>/<SRC_POD> --dst-pod <NS>/<DST_POD> --dport 8080
Test: NetPol bypass#
# Geçici olarak NetPol'ü patch'le ve dene
kubectl annotate networkpolicy <POL> -n <NS> \
experimental/disable=true
# Test
# Sonra geri al
kubectl annotate networkpolicy <POL> -n <NS> \
experimental/disable-
📊 Adım 4: tcpdump (Pod İçinde)#
# tcpdump bulundur (initContainer veya ephemeral container)
kubectl debug -it <POD> --image=nicolaka/netshoot -n <NS>
# İçerde
tcpdump -i any -n port 8080
tcpdump -i any -n host <DEST_IP>
tcpdump -i any -n -A 'port 80' # ASCII payload
# Belirli pod IP'si filter
tcpdump -i any -n 'host <POD_IP> and port 8080'
netshoot image#
nicolaka/netshoot — tüm network tool'ları içerir: - tcpdump, curl, dig, nslookup, mtr, iperf3, netcat, ss, ip, conntrack
🌊 Adım 5: ss / netstat (Connection Listing)#
# Aktif TCP connections
ss -tunap
# Listening ports
ss -tlnp
# Establishment + state
ss -tan state established
ss -tan state time-wait
# Specific port
ss -tan src :8080
🔥 Adım 6: conntrack (NAT Table)#
# Mevcut connections
conntrack -L | head
# Specific source
conntrack -L --src <POD_IP>
# Saturation kontrol
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# Spike (table dolu → connection drop)
sysctl net.netfilter.nf_conntrack_max
🔑 conntrack table dolu = sessiz connection drop. Yüksek-traffic node'larda izlenmesi şart.
🌐 Adım 7: kube-proxy / Cilium Service Routing#
kube-proxy (iptables mode)#
# Service IP → backend pod listesi
iptables -t nat -L KUBE-SERVICES -n
iptables -t nat -L KUBE-SVC-<HASH> -n
Cilium (eBPF mode)#
# Cilium service map
cilium service list
# Specific service
cilium service get <ID>
# Endpoint identity
cilium endpoint list
🚦 Adım 8: Service Mesh (Sidecar Issue)#
Istio#
# Sidecar status
istioctl proxy-status
# Sidecar config dump
istioctl proxy-config cluster <POD> -n <NS>
istioctl proxy-config route <POD> -n <NS>
istioctl proxy-config listener <POD> -n <NS>
# Sidecar log
kubectl logs <POD> -c istio-proxy -n <NS> --tail=50
Linkerd#
linkerd viz tap deploy/<DEPLOY> -n <NS>
linkerd viz top deploy/<DEPLOY> -n <NS>
linkerd viz edges -n <NS>
Cilium (mesh)#
🌍 Adım 9: External Connectivity#
# Pod'dan internete
curl -v https://api.stripe.com --connect-timeout 5
# Egress NetworkPolicy?
kubectl describe networkpolicy -n <NS> | grep -A 10 Egress
# DNS for external?
nslookup api.stripe.com
# MTR (traceroute + ping kombinasyonu)
mtr -rwc 10 api.stripe.com
Cloud SG / NACL#
- AWS: Security Group + NACL kontrol
- GCP: Firewall rules
- Azure: NSG
📐 Yaygın Senaryolar — Hızlı Çözüm#
Pod-Pod Same Node Timeout#
- NetworkPolicy default-deny aktif mi?
- CNI healthy mi?
- Pod CIDR doğru mu?
Pod-Pod Cross-Node Timeout#
- Yukarıdakilere ek:
- Node-node connectivity (VPC routing)
- Overlay network sağlam mı (VXLAN, BGP)?
Pod → External 503#
- Egress NetworkPolicy?
- NAT gateway saturation?
- External service rate-limit'ledi mi?
Intermittent 503#
- HPA scale-down'da pod kill?
- Pod readiness probe yanlış?
- LB stale endpoint?
MTU Issue (büyük packet drop)#
# Pod MTU
ip link show eth0
# Test (Don't Fragment + büyük packet)
ping -M do -s 1472 <DEST>
# 1472 = 1500 - 28 (header) — IPv4 max MTU
TLS Hata#
# Cert detay
openssl s_client -connect payments.<NS>.svc.cluster.local:443 \
-servername payments.<NS>.svc.cluster.local
# Cert expiry
echo | openssl s_client -connect <HOST>:443 2>/dev/null | \
openssl x509 -noout -dates
🛠️ Tool Kataloğu#
| Tool | Kullanım |
|---|---|
dig, nslookup | DNS resolution |
nc, telnet | TCP port test |
curl -v | HTTP request + headers |
tcpdump | Packet capture |
ss, netstat | Connection listing |
conntrack | NAT table |
iptables -nL | Firewall rules |
mtr | Path tracing |
iperf3 | Bandwidth test |
netshoot | Hepsini birden taşıyan image |
cilium-cli, hubble | eBPF dataflow |
istioctl | Istio sidecar |
wireshark | Capture analiz (offline) |
🚫 Anti-Pattern Tablosu#
| Anti-pattern | Niye kötü | Doğru |
|---|---|---|
Pod'da apt-get install net-tools | Image bloat, runtime mod | netshoot ephemeral container |
iptables -nL her node'da | İptables aşılıyor | Cilium / kube-proxy command |
| tcpdump host yerine pod'da | Capture filter kaçar | Pod CIDR ile filter |
Service IP'ye ping | ICMP service'lerde gerek yok | curl / nc TCP test |
| Restart pod hatayı çözer ama sebep bilinmez | Aynı sorun tekrar | Capture + analyze |
| External-dns çalışmıyor diye manual record | Drift | Annotation kontrol |
| Cert expiry sürpriz | Down | cert-manager + alert |
| MTU default 1500 GCP/AWS overlay | Fragmentation | VXLAN için 1450 |
| conntrack table dolu, alarm yok | Sessiz drop | Saturation alert |
| Sidecar varsa "app sorunu" | Sidecar config / mTLS | Sidecar log + config dump |
📋 Network Troubleshooting Checklist#
[ ] netshoot image cluster'da hazır (ephemeral container)
[ ] CoreDNS HA + NodeLocal DNSCache
[ ] NetworkPolicy + Cilium policy trace
[ ] cilium-cli + hubble observe alışkın
[ ] istioctl / linkerd viz tap (mesh varsa)
[ ] tcpdump pod'da çalıştırma rehberi (runbook)
[ ] conntrack max + saturation alert
[ ] MTU optimize (VXLAN için 1450)
[ ] cert expiry alert (cert-manager)
[ ] Egress NetworkPolicy + FQDN allowlist
[ ] DNS metric + alert
[ ] LB health check + readiness probe doğru
[ ] On-call: network troubleshoot runbook
[ ] Quarterly: yaygın network incident retrospective
📚 Referanslar#
- netshoot — github.com/nicolaka/netshoot
- tcpdump cheat sheet — packetlife.net
- Cilium debug — docs.cilium.io
- Istio troubleshooting — istio.io/docs/ops/diagnostic-tools/
- Linkerd debugging — linkerd.io/2/tasks/debugging-your-service/
Service-Mesh-Comparison.mdCilium-eBPF-Intro.mdDNS-Strategies.mdIngress-NGINX-Patterns.md08-Security/Zero-Trust-Networking.md11-SRE/Runbook-Template.md
"Network troubleshooting 'sezgi' değil — flowchart. Layer-by-layer + tool-by-tool sistemli geçiş, 30 dakikada root cause. 'Acaba network mı' demek yerine kanıtla."