OpenTelemetry Adoption — Vendor-Neutral Observability#
"Datadog SDK + Prometheus client + Loki driver hep aynı bilgiyi tekrar yazıyor. Datadog'u bırakmak istesem, kodun her yerinde değişiklik."
OpenTelemetry (OTel) çözümü: tek SDK, tek protokol (OTLP), vendor-neutral.
🎯 Niye OTel?#
| Eski model | OTel ile |
|---|---|
| Datadog SDK + Prometheus + Loki client | Tek SDK |
| Vendor değiştirmek = kod değişikliği | Collector config değişir, kod aynı |
| Trace ID metric/log'da yok | Auto-correlation built-in |
| Standart yok, herkes farklı tag | Semantic conventions standardı |
| Polyglot stack (Go/Python/Node) ayrı tools | Aynı SDK her dilde |
🏛️ Mimari#
┌─────────────────────────────────────────┐
│ Application │
│ ┌──────────────────┐ │
│ │ OTel SDK │ metrics/logs/ │
│ │ (auto-instrument)│ traces │
│ └────────┬─────────┘ │
└───────────┼──────────────────────────────┘
│ OTLP (gRPC veya HTTP)
▼
┌─────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ │
│ Receivers → Processors → Exporters │
│ │
│ - OTLP receiver (apps'ten) │
│ - Prometheus receiver (scrape) │
│ - Filelog receiver (file logs) │
│ │
│ Processors: │
│ - Batch (efficient send) │
│ - Memory limiter │
│ - Tail sampling (smart trace selection) │
│ - Resource detection (cloud metadata) │
│ - Attributes (PII redact, transform) │
│ │
│ Exporters: │
│ - Prometheus / Mimir / VictoriaMetrics │
│ - Loki / Datadog / Splunk │
│ - Tempo / Jaeger / Honeycomb │
└─────────────────────────────────────────┘
🚀 1. SDK kurulum (örnekler)#
Python (FastAPI)#
# pip install opentelemetry-distro opentelemetry-exporter-otlp \
# opentelemetry-instrumentation-fastapi
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor
from fastapi import FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()
Psycopg2Instrumentor().instrument()
opentelemetry-instrument ile sıfır-kod auto-instrumentation:
opentelemetry-instrument \
--traces_exporter otlp \
--metrics_exporter otlp \
--logs_exporter otlp \
--service_name my-api \
--exporter_otlp_endpoint http://otel-collector:4317 \
uvicorn main:app
Node.js#
// npm i @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const sdk = new NodeSDK({
serviceName: 'my-app',
traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4317' }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Go#
Go'da auto-instrument zayıf, manuel:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
exporter, _ := otlptracegrpc.New(ctx)
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.ServiceName("my-api"),
)),
)
otel.SetTracerProvider(tp)
// HTTP middleware:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
handler := otelhttp.NewHandler(myHandler, "operation")
Manuel span eklemek#
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("compute_tax")
def compute_tax(amount, region):
span = trace.get_current_span()
span.set_attribute("amount", float(amount))
span.set_attribute("region", region)
# ...
⚙️ 2. Collector kurulum (Kubernetes)#
Helm install#
helm install otel-collector \
open-telemetry/opentelemetry-collector \
-n observability --create-namespace \
-f values.yaml
values.yaml#
mode: deployment # veya: daemonset, statefulset
config:
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 512
resource:
attributes:
- key: deployment.environment
value: "<ENV>"
action: upsert
# Tail sampling: errors veya slow span'ları %100 al, geri kalan %5
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: random
type: probabilistic
probabilistic: { sampling_percentage: 5 }
# PII redaction
attributes:
actions:
- key: http.user_agent
action: hash
- key: user.email
action: delete
exporters:
otlphttp/mimir:
endpoint: http://mimir:9009/otlp
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp/tempo:
endpoint: tempo:4317
tls: { insecure: true }
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp/mimir]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource, attributes]
exporters: [loki]
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource, tail_sampling, attributes]
exporters: [otlp/tempo]
🏷️ 3. Semantic Conventions#
OTel'in vendor-neutral'lığının temeli: standart attribute isimleri.
| Doğru | Yanlış |
|---|---|
http.method | httpMethod, method |
http.status_code | statusCode |
http.route | endpoint |
http.target | path |
db.system | database |
db.statement | sql |
service.name | app |
service.version | version |
deployment.environment | env |
Tam liste: https://opentelemetry.io/docs/specs/semconv/
🎯 4. Trace ID propagation#
İki servis konuşuyor → trace ID otomatik forward:
Service A (handler) ─── traceparent: 00-abc-... ─── Service B (handler)
│
Service B'nin span'ı Service A'nın trace'ine bağlı ▼
DB query span'ı eklenir
W3C traceparent header standardı. Çoğu HTTP client SDK auto-inject eder.
Logger correlation#
# logger her log'a trace_id ekle
import logging
from opentelemetry.trace import get_current_span
class TraceIDFilter(logging.Filter):
def filter(self, record):
span = get_current_span()
ctx = span.get_span_context() if span else None
record.trace_id = format(ctx.trace_id, "032x") if ctx and ctx.is_valid else None
return True
# JSON log: { "trace_id": "abc123...", "msg": "DB error" }
# Loki'de: {service="api"} | json | trace_id="abc123..." → ilgili tüm log'lar
Grafana'da trace'ten log'a, log'dan trace'e tek tıklamayla geçilir.
📊 5. Metrics#
OTel SDK Prometheus formatı destekler ama yeni metric'leri OTel API ile yaz:
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
# Counter
order_counter = meter.create_counter(
"orders.total",
unit="1",
description="Number of orders created",
)
order_counter.add(1, {"status": "success", "tier": user.tier})
# Histogram (latency için)
processing_time = meter.create_histogram(
"order.processing_time",
unit="ms",
description="Order processing latency",
)
processing_time.record(elapsed_ms, {"endpoint": "create"})
# Gauge / observable
def cb(observer):
observer.observe(get_queue_depth(), {"queue": "orders"})
queue_gauge = meter.create_observable_gauge("queue.depth", callbacks=[cb])
🔄 6. "Migration": Mevcut stack'ten OTel'e geçiş#
Faz 1: Trace (en az invaziv)#
- OTel SDK + auto-instrument PR
- Collector deploy
- Tempo backend (veya mevcut Datadog'a OTLP gönder)
- Mevcut metric/log stack'inde değişiklik yok
Faz 2: Metric#
- OTel SDK metric API
- Prometheus → OTLP exporter
- Eski Prometheus client kodunu yavaş yavaş OTel API'ye çevir
Faz 3: Log#
- OTel logging bridge
- Eski log forwarder (fluent-bit) → OTel Collector
- Loki/ELK aynı kalsın, ortada Collector
Faz 4: Vendor-neutral#
- Datadog/NewRelic gibi paid SaaS'ten çıkış kararı verirsen, sadece Collector exporter'ı değiştir. Kod aynı.
⚠️ Yaygın tuzaklar#
Cardinality patlaması#
user_idattribute'u her span'a eklersen → milyonlarca unique trace dimension- Çözüm: high-cardinality attribute'ları metrics'e ekleme; trace'te zaten var
Sampling stratejisi#
- Head-based: %5 her şey (basit ama error miss eder)
- Tail-based: error + slow %100, geri kalan %5 (daha akıllı, collector'da)
- Önerilen: tail-based prod için
Auto-instrument noise#
- Auto-instrument bazen istemediğin span'ları yaratır (her HTTP redirect, DB ping)
- Span filter / sampler ile sustur
Performance overhead#
- SDK + collector overhead < %1 (tipik), ama batch processor doğru ayarlanmalı
- Memory limiter zorunlu (OOM önle)
📚 Devamı#
- opentelemetry.io
- OTel Demo App — eksiksiz örnek mikroservis stack
- OTel Collector Receivers/Processors/Exporters list
07-Observability/Prometheus-Best-Practices.md(yakında)