AI-Augmented Operations — LLM ile DevOps İşi#
"LLM'i 'chatbot' sanmak 2024'tü. 2026'da LLM, DevOps mühendisinin co-pilot'u: log analiz, runbook generation, postmortem draft, alarm triage. 'AI replaces SRE' değil, 'AI accelerates SRE'."
Bu rehber LLM'in DevOps gündelik akışında pratik kullanım alanlarını, agent pattern'lerini, ve "ne otomatize edilmeli, ne insanda kalmalı" sorusuna cevap verir.
🎯 Use Case Matrisi#
| Use case | Otomasyon seviyesi | Risk |
|---|---|---|
| Log analiz / pattern detect | Tam otomatik | Düşük |
| Runbook generation (taslak) | Human-in-the-loop | Düşük |
| Postmortem draft | Human review | Düşük |
| Alarm triage (severity, owner) | Tam otomatik | Düşük |
| Code review (advisory) | Pre-screen, human onay | Düşük |
| Incident summary (executive) | Tam otomatik | Düşük |
| K8s manifest gen | Human review | Orta |
| Auto-remediation | Onay gate'li | Yüksek |
| Production query | Read-only OK, write insanda | Yüksek |
| Direct production deploy | Yapma | Çok yüksek |
🔑 Kural: LLM action destructive ise human-in-the-loop. Read-only / advisory için tam otomatik.
🛠️ 1. Log Analiz (Otomatik)#
Pattern detection#
async def analyze_logs(log_text: str) -> dict:
response = await anthropic.messages.create(
model="claude-opus-4-7",
system=LOG_ANALYZER_PROMPT,
messages=[{"role": "user", "content": log_text}],
max_tokens=1024
)
return json.loads(response.content[0].text)
LOG_ANALYZER_PROMPT:
Sen bir SRE asistanısın. Verilen log'da:
1. Anomali / hata pattern'lerini bul
2. Severity belirle
3. Olası kök neden öner
4. Önerilen runbook adımı
JSON çıktısı ver.
Pipeline integration#
# Falco alert → Lambda → LLM → enriched alert
SourceLog → LLM analyze → {
severity: "high",
root_cause_hypothesis: "DB connection pool exhaustion",
recommended_action: "Pool size increase or restart"
} → Slack alert (enriched)
🛠️ 2. Runbook Generation#
def generate_runbook(alert_definition: str) -> str:
return llm.complete(f"""
Aşağıdaki Prometheus alert için runbook yaz:
{alert_definition}
Format:
- TL;DR
- 1. Verify
- 2. Quick Mitigation (5-10 dk)
- 3. Investigation
- 4. Common Causes (tablo)
- 5. Escalation
- 6. After fix
""")
🔑 LLM draft üretir, mühendis review + production'a uygula.
🛠️ 3. Alarm Triage Bot#
async def triage_alert(alert: dict) -> dict:
response = await llm.complete(f"""
Alert details: {alert}
Recent metrics (Prometheus): {fetch_metrics(alert)}
Recent traces (Tempo): {fetch_traces(alert)}
Recent logs (Loki): {fetch_logs(alert)}
Belirle:
1. Severity (page/warn/info)
2. Owner team
3. Likely cause (3 hypothesis)
4. Recommended runbook URL
""")
return parse(response)
# Alertmanager webhook → triage → enriched route
→ "Generic alarm" → "spesifik enriched alarm + recommended action".
🛠️ 4. Postmortem Draft#
def draft_postmortem(timeline: list, metrics: dict) -> str:
return llm.complete(f"""
Timeline: {timeline}
Metrics during incident: {metrics}
Blameless postmortem yaz:
- Executive summary (3 cümle)
- Impact (customer + revenue)
- Timeline (UTC)
- Root cause (5-Whys)
- What went well
- What went wrong
- Where we got lucky
- Action items (owner + due date placeholder)
""")
→ %70 hazır draft → mühendis 30 dk review + edit → final.
🤖 Agent Patterns#
Read-only Agent (auto-OK)#
[Prometheus alert] → [LLM Agent: Loki query, Tempo trace, Grafana check]
↓
[Slack: enriched alert + suggested action]
Action Agent (human-in-the-loop)#
[Issue] → [Agent proposes action]
↓
[Slack: "Restart pod X? [Approve / Deny]"]
↓ Approve
[kubectl rollout restart deploy/X]
Auto-Remediation Agent (high-confidence)#
[Specific known issue: HighDiskUsage]
↓
[Agent: cat /etc/logrotate.conf check + execute]
↓ (confidence > 0.95)
[logrotate -f]
↓
[Slack: "Auto-remediated, free disk: 80%"]
⚠️ Auto-remediation whitelist'li action'lar için. Bilinmeyen issue → human.
🛡️ Production Concerns#
1. Hallucination#
- LLM "bilmiyorum" demesin → confidence flag
- Citation zorunlu (RAG)
- Bkz
Safety-and-Guardrails.md
2. Cost#
- Cache (semantic) — yaygın query'leri tut
- Token limit per request
- Selective: sadece SEV1+ için detailed analysis
3. PII#
- Log'da PII varsa → mask + hash
- Audit log: agent ne yaptı, hangi data gördü
4. Latency#
- LLM 1-5 saniye
- Critical path'ten uzak (alarm enrichment async OK)
5. Audit trail#
audit_log.write({
"agent": "alarm-triage",
"alert_id": alert.id,
"input_hash": hash(alert),
"llm_model": "claude-opus-4-7",
"decision": triage_result,
"action_taken": action,
"human_approved": approval_id
})
🚧 Vendor Stack — 2026#
| Tool | Niche |
|---|---|
| Anthropic Claude | Best for complex reasoning |
| OpenAI GPT-⅘ | Geniş ekosistem |
| Google Gemini | Multi-modal, context window |
| Self-hosted Llama | Privacy + cost (büyük org için) |
| LangChain / LangGraph | Agent framework |
| AutoGen (MS) | Multi-agent orchestration |
| CrewAI | Role-based agents |
🚫 Anti-Pattern Tablosu#
| Anti-pattern | Niye kötü | Doğru |
|---|---|---|
| LLM ile direct production action | Hallucination + destructive | Human-in-the-loop |
| LLM "her şey için" | Cost + latency | Spesifik use case |
| Hallucination kontrolü yok | Fake report | Citation + confidence |
| PII filter yok | Compliance ihlal | Mask + hash |
| Cost monitoring yok | Bütçe sürpriz | Token + cost dashboard |
| Audit log yok | EU AI Act ihlal | Per-call audit |
| Prompt versioning yok | Regression görünmez | Git'te + eval |
| Tool allowlist yok | Agent her şey | Whitelist |
| Test set yok | "Çalışıyor sanırım" | Eval set + metrics |
| LLM cache yok | Tekrarlanan tahminler | Semantic cache |
📋 AI-Augmented Ops Checklist#
[ ] Use case prioritization (read-only önce)
[ ] LLM provider seçimi (Claude / GPT / Gemini)
[ ] Prompt'lar Git'te versioned
[ ] PII filter (input + output)
[ ] Citation / confidence (hallucination control)
[ ] Audit log (her LLM call)
[ ] Token + cost dashboard
[ ] Semantic cache (frequent queries)
[ ] Action allowlist (destructive ops için)
[ ] Human-in-the-loop critical action'larda
[ ] Eval set (regression testing)
[ ] Quarterly: agent effectiveness review
[ ] EU AI Act compliance (yüksek-risk değilse normal)
[ ] Onboarding: dev'lere agent kullanım kuralları
📚 Referanslar#
- Anthropic Claude API — docs.anthropic.com
- LangChain — python.langchain.com
- AutoGen — microsoft.github.io/autogen
- CrewAI — crewai.com
- OpenTelemetry GenAI Semantic Conventions — opentelemetry.io
LLM-in-Production.mdRAG-Architecture.mdPrompt-Engineering-for-Ops.mdSafety-and-Guardrails.mdSelf-Hosted-LLM.mdModel-Cost-Optimization.md19-Compliance/EU-AI-Act.md
"AI Augmented Ops 'SRE'i replace etmek' değil — SRE'in productivity'sini katlamak. Log triage 30 dk → 3 dk; postmortem draft 4 saat → 30 dk. Mühendis mantık + onay'a odaklanır."