Ana içeriğe geç

Spot Instance Strategy — %70 Tasarruf#

"Spot %70 daha ucuz — 'ama interrupt edilebilir' diye kullanmayan ekip, aylık $5K-50K boş para harcıyor. Doğru workload + doğru tooling = spot %30+ adoption."

Bu rehber AWS Spot, GCP Preemptible, Azure Spot için workload uygunluk, graceful interruption handling, ve Karpenter ile mixed fleet stratejisini anlatır.


🎯 Spot Nedir?#

Spot/Preemptible: Cloud sağlayıcının idle kapasitesi. %70-90 daha ucuz; ama 2 dakika öncesinden interruption notice ile kapatılabilir.

Cloud Spot İndirim
AWS Spot Instance %70-90
GCP Preemptible / Spot VM %60-91
Azure Spot VM %50-90

✅ Uygun Workload'lar#

Workload Spot uygunluk
Stateless API (HTTP) ✅ replica > 1 + LB
CI runner ✅ ephemeral
ML training (batch) ✅✅ checkpointing varsa
Data ETL ✅ idempotent
Background queue worker ✅✅ retry varsa
Render farm
Dev / staging ✅✅
Cache (Redis primary) ❌ state
Database primary ❌ data loss
Stateful kafka ⚠️ replica varsa OK
Real-time critical ❌ interrupt etki

🔑 Genel kural: Stateless + replica > 1 + retry varsa spot OK.


🚀 AWS Spot Best Practices#

EC2 Spot#

# Mixed instance type (interruption riski azaltır)
aws ec2 run-instances \
  --instance-market-options 'MarketType=spot,SpotOptions={SpotInstanceType=one-time}' \
  --image-id ami-xxx \
  --instance-type m5.large

EKS Spot Node Group#

eksctl create nodegroup --config-file=- <<EOF
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: prod
managedNodeGroups:
  - name: spot
    instanceTypes: [m5.large, m5a.large, m5n.large, m4.large]   # diversify
    spot: true
    minSize: 2
    maxSize: 20
    labels: {capacity-type: spot}
    taints:
      - key: capacity-type
        value: spot
        effect: NoSchedule
EOF

Karpenter (önerilen)#

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-pool
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: [spot]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: [m5, m5a, m5n, m6i, m7i]   # diversify
        - key: kubernetes.io/arch
          operator: In
          values: [amd64, arm64]
  limits:
    cpu: "1000"
  disruption:
    consolidationPolicy: WhenEmpty
    expireAfter: 720h   # 30 gün max lifetime

🔑 Karpenter spot interruption signal'lerini handle eder, otomatik node replace.


🛡️ Graceful Interruption Handling#

Interruption notice#

  • AWS: 2 dakika öncesi signal (instance metadata)
  • GCP: 30 saniye
  • Azure: 30 saniye

Pod tarafında#

# PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payments
spec:
  minAvailable: 2
  selector:
    matchLabels: {app: payments}

Termination handler#

# AWS Node Termination Handler
helm install aws-node-termination-handler aws-cdk/aws-node-termination-handler \
  -n kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableScheduledEventDraining=true

→ Interruption signal → cordon + drain → pod'lar sağlıklı node'a evict.


🎯 Mixed Fleet Strategy#

[Workload Type]               [Capacity Type]
─────────────────────────────────────────────
Critical / DB                  → on-demand (100%)
Stateless API replica > 3       → on-demand (1) + spot (rest)
Background queue                → spot (100%)
CI runner                       → spot (100%)
Dev/staging                     → spot (100%)
ML training                     → spot (checkpointing)

K8s tolerations + node affinity#

spec:
  tolerations:
    - key: capacity-type
      operator: Equal
      value: spot
      effect: NoSchedule
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: capacity-type
                operator: In
                values: [spot]

📊 Cost Saving Hesabı#

m5.large baseline:
  On-demand: $0.096/saat × 720 saat = $69/ay
  Spot:      $0.025/saat × 720 saat = $18/ay
  Tasarruf: $51/ay (%74)

10 node cluster:
  All on-demand: $690/ay
  All spot:      $180/ay
  Mixed (3 OD + 7 spot): $333/ay   ← %52 tasarruf, kabul edilebilir risk

🚫 Anti-Pattern Tablosu#

Anti-pattern Niye kötü Doğru
Spot kullanmama %70 maliyet kaybı Mixed fleet
Tek instance type spot Interruption riski yüksek Diversify (5+ instance type)
DB primary spot Data loss On-demand
PDB yok Interruption sırasında pod kaybı minAvailable set
Termination handler yok Hard kill, requests fail Drain handler
Spot %100 prod-critical Interruption felaket %30-50 spot
ML training checkpoint yok Interruption = baştan başla Checkpoint per epoch
Spot fiyat dalgalanması ignore Bid stratejisi yok Karpenter auto
Workload spot uygunluğu unutuldu Real-time spot'ta Workload classify

📋 Spot Adoption Checklist#

[ ] Workload classification: spot uygun listesi
[ ] Karpenter / equivalent ile mixed fleet
[ ] Diversify: 5+ instance type
[ ] PDB tüm workload'larda
[ ] Termination handler (aws-node-termination-handler)
[ ] Stateless replica > 1 + LB
[ ] Tolerations + node affinity
[ ] ML workload: checkpoint per epoch
[ ] Cost dashboard: spot adoption %
[ ] Quarterly: spot interruption rate review
[ ] DR: spot tüm down durumda fallback
[ ] Yıllık tasarruf hedef: %30+

📚 Referanslar#


"Spot %70 daha ucuz, 'kullanılmıyor' bahanesi yok. Workload classify + Karpenter + termination handler = $5K-50K aylık tasarruf, 2 hafta implementation."