DevOps/SRE System Design — Cheatsheet#

Geleneksel "Design Twitter" sorularından farklı olarak, DevOps/SRE mülakatlarında infra, deploy, observability odaklı sorular gelir.

🎯 Tipik soru kategorileri#

Infrastructure design — "X için cluster tasarla"
Deployment design — "Zero-downtime deploy nasıl"
Observability design — "100 servis için monitoring"
Disaster recovery — "Multi-region failover"
Cost optimization — "Bütçeyi nasıl yarıya indirirsin"

📐 Genel framework (sıralı)#

1. Clarify requirements (5 dk)#

Functional: Ne yapması bekleniyor?
Non-functional: Scale (RPS, kullanıcı, veri), SLO (availability, latency), maliyet bütçesi
Constraints: Cloud vendor, region, compliance (GDPR/HIPAA), team size

2. High-level architecture (10 dk)#

Çiz: kullanıcı → CDN → LB → API → DB → cache → ...

3. Component deep-dive (15 dk)#

Bir veya iki bileşeni detaylandır (bottleneck olabilecek).

4. Trade-offs (10 dk)#

Her kararın "niye bu, niye o değil" — alternatives'i göster.

5. Scale & evolution (5 dk)#

Bugün 1K RPS, yarın 100K → ne değişir?

🏗️ Sample 1: "K8s production cluster tasarla — multi-tenant SaaS, 10K kullanıcı, %99.9 SLO"#

Requirements#

10K kullanıcı, ~5K RPS peak
Multi-tenant (her tenant izole)
%99.9 availability → max 43 dk down/ay
p99 latency < 500ms

High-level#

                          ┌──────────────┐
                          │   CDN/WAF    │
                          │  Cloudflare  │
                          └──────┬───────┘
                                 │
                          ┌──────▼───────┐
                          │  API Gateway │
                          │  (Envoy)     │
                          └──────┬───────┘
                                 │
                  ┌──────────────┴──────────────┐
                  │                             │
            ┌─────▼─────┐                ┌──────▼─────┐
            │ EKS Prod  │                │ EKS Prod   │
            │ Region A  │                │ Region B   │
            │ (active)  │                │ (warm DR)  │
            └─────┬─────┘                └────────────┘
                  │
       ┌──────────┼──────────┐
       │          │          │
   ┌───▼──┐  ┌────▼───┐  ┌───▼──────┐
   │ Tenant│  │ Tenant│  │ Platform │
   │  NS   │  │  NS   │  │  NS      │
   │ A,B,C │  │ D,E,F │  │ (shared) │
   └───────┘  └───────┘  └──────────┘

Multi-tenancy#

Soft: namespace-per-tenant + ResourceQuota + NetworkPolicy
Hard: vCluster per tenant ya da farklı cluster (compliance gerekirse)
10K kullanıcı için soft yeterli; bazı premium tenant için hard

Cluster topology#

3 control plane (managed → EKS / GKE / AKS)
3 zone, her zone'da 1 node group
Karpenter ile spot+on-demand mixed
Min 6 worker, max 60 (HPA driven)

Key components#

Ingress: NGINX Ingress / Gateway API
Service Mesh: Linkerd (basit, hızlı) — istio gerekiyorsa Ambient mode
Observability: kube-prometheus-stack + Grafana + Loki + Tempo
GitOps: ArgoCD ApplicationSet
Secret: External Secrets Operator + AWS Secrets Manager
Backup: Velero
Policy: Kyverno

DR#

Region B "warm" — k8s-config repo aynı, ArgoCD sync
DB: Aurora Global (cross-region replication)
DNS: Route 53 health-check + latency-based routing
RTO: 15 dk, RPO: 1 dk

Cost#

Spot %70, on-demand %30 baseline
1-year SP for compute baseline
Reserved EKS control plane (managed, çok değişmez)
Tagging zorunlu, Kubecost ile per-tenant cost

🚀 Sample 2: "Zero-downtime deploy strategy — 100 servisli mikroservis"#

Requirements#

Hız: günde 50+ deploy
Risk: customer-impacting outage olmasın
Rollback: 5 dk içinde

Strateji#

Trunk-based + feature flag
GitOps (ArgoCD)
Argo Rollouts canary:
5% → pause 5dk → analyze → 25% → pause 10dk → analyze → 100%
Analysis template:
error_rate < %0.1
p99_latency < threshold
SLO burn rate < 1x

Pre-deploy checks (CI)#

Lint + test + SAST + SCA
Image build + Trivy scan + cosign sign
E2E smoke (preview env)

Post-deploy#

Auto-rollback eğer analysis fail
Slack notification (deploy + status)
Datadog deployment marker

Database migrations#

Expand/contract pattern
Migration job hook (post-install)
Backward-compatible mandate
Asla deploy ortasında DROP COLUMN

🔭 Sample 3: "100 servis için observability stack"#

Pillars + 1#

Metrics: Prometheus (Mimir backend, multi-tenant)
Logs: Loki (cheap, K8s-native)
Traces: Tempo (TraceQL, sampling at collector)
Profiles: Pyroscope (continuous profiling)

Instrumentation#

OpenTelemetry SDK her servis (vendor-neutral)
OTel Collector cluster-level (sampling, enrichment, routing)

Sampling strategy#

Head-based sampling: %10 baseline
Tail-based: error veya slow span %100
Cost vs detail trade-off

Alerting#

Alertmanager + PagerDuty
SLO-based (multi-burn-rate)
Severity tier: page (SEV-1) / ticket (SEV-2) / log (SEV-3)

Storage#

Metrics: Mimir, 30 gün hot, 1 yıl cold (S3)
Logs: Loki, 7 gün hot, 90 gün cold
Traces: Tempo, 14 gün
Maliyet: cardinality kontrolü + retention policy

Visualization#

Grafana, multi-tenant
Per-team dashboard ownership
Auto-generated dashboards (per-service template)

💰 Sample 4: "AWS faturasını %50 düşür"#

Step 1: Visibility#

Tagging policy enforce (her resource taglı)
Cost Explorer + per-team dashboard
Anomaly detection

Step 2: Quick wins#

Idle resource cleanup ($)
gp2 → gp3 (%20 ucuz)
Snapshot lifecycle
Boşta EIP, ELB, NAT GW
Right-sizing (Compute Optimizer)

Step 3: Compute strategy#

Karpenter + spot %70
1-year SP baseline
Graviton (ARM) — %20-40 ucuz, performans aynı

Step 4: Storage#

S3 Intelligent-Tiering
Lifecycle: 30g IA → 90g Glacier IR → 365g sil
Logs aggressive retention
Old AMI deregister

Step 5: Egress#

VPC Endpoint (S3, DynamoDB)
CDN front of S3
Cross-region traffic minimize
Cloudflare R2 alternative

Step 6: K8s specific#

Kubecost per-namespace cost
HPA min replica audit (3 yerine 2 mı yeter?)
Idle PVC cleanup
Spot for stateless

Beklenen: ay 1 = %20 quick wins, ay 6 = %35 toplam, ay 12 = %50#

🌍 Sample 5: "Multi-region active-active setup"#

Trade-off'u söyle (önemli)#

"Active-active gerekli mi? Active-passive (warm DR) genelde yeterli ve %50 daha ucuz, çok daha basit. Customer requirement net mi?"

Eğer gerçekten gerekli:

Data layer#

Stateless servis: kolay, her region'da deployment
Stateful (DB): zor
Aurora Global (multi-region read, single-write)
Spanner / CockroachDB / YugabyteDB (multi-write)
DynamoDB Global Tables
CAP theorem — consistency vs availability trade-off
Cache (Redis): her region'da independent (eventual consistency OK)

Network#

DNS-based: Route 53 latency-based + health check
Anycast: Cloudflare / AWS Global Accelerator

Application layer#

Idempotency (her request retry-safe)
Distributed transaction → saga pattern
Conflict resolution (LWW, CRDT)

Failover#

Automated DNS failover
Test ayda bir (game day)

Cost#

2x compute, 2x DB, cross-region transfer
Genelde %80-100 maliyet artışı

🎓 Yaklaşım tüyoları#

"Niye?" sorusunu sor — özellikle multi-region, microservices gibi karmaşıklığa atlama
Trade-off göster — "X yapardım çünkü Y, ama Z trade-off'u var"
Numbers — "yüksek scale" değil "10K RPS, 50M users"
Failure mode düşün — sadece happy path değil
Operability — "deploy nasıl, observability nasıl, on-call nasıl"
Cost — gerçek mühendis maliyeti düşünür
Evolutionary — bugün başlangıç, yarın scale ne olur

🚫 Anti-pattern'ler (yapma)#

❌ Tüm buzzword'leri saç ("Kubernetes + service mesh + Kafka + Cassandra + ML pipeline")
❌ Tek doğru cevap varmış gibi davran
❌ Trade-off söylemeden seç
❌ Sayı vermeden "high availability"
❌ Vendor lock-in görmezden gel
❌ Failure mode'u skip
❌ Operations / on-call'u skip
❌ Cost'u skip

📚 Devamı#

[System Design Interview — Alex Xu]
[Designing Data-Intensive Applications — Martin Kleppmann]
[The Site Reliability Workbook — Google]
05-Kubernetes/Production-Checklist.md
11-SRE/SLI-SLO-Error-Budget.md