Chapter 54

AI মনিটরিং

Monitoring AI Systems
📡 Model deploy করেই শেষ না
Production-এ data drift, accuracy decay, latency spike, hallucination — সব চুপচাপ ঘটে। Monitoring ছাড়া আপনি কখনোই জানবেন না কখন model fail করল।

চার ধরনের Metric

  • System: CPU, GPU, RAM, QPS, latency p50/p95/p99।
  • Model: accuracy, F1, RMSE (ground-truth এলে)।
  • Data: input distribution, missing values, drift score।
  • Business: conversion, click-through, revenue per request।

Data Drift Detection

  • Covariate drift: P(X) বদলে গেছে।
  • Concept drift: P(Y|X) বদলে গেছে।
  • Tests: KS-test, PSI (Population Stability Index), Wasserstein।
# Evidently দিয়ে drift report
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
report.save_html("drift.html")

Tooling Stack

  • Metrics: Prometheus + Grafana।
  • Logs: Loki / ELK / CloudWatch।
  • Traces: OpenTelemetry + Jaeger।
  • ML-specific: Evidently, Arize, WhyLabs, Fiddler।
  • LLM Observability: LangSmith, Langfuse, Helicone।
  • Errors: Sentry।

FastAPI + Prometheus

from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)

# Custom metric
from prometheus_client import Histogram
INFER_TIME = Histogram("inference_seconds", "Model inference latency")

@INFER_TIME.time()
def run(x): return model(x)

Alerts যা সবসময় থাকা দরকার

  • p99 latency > SLA।
  • Error rate > 1%।
  • GPU OOM / restart loop।
  • Drift score threshold cross।
  • Token/cost spike (LLM)।

Feedback Loop

Production prediction + actual outcome → store → periodic retraining pipeline (Airflow/Prefect)। এটাই continuous learning-এর ভিত্তি।

LLM-specific Monitoring

  • Hallucination rate (LLM-as-judge / RAGAS)।
  • Token usage, cost per request।
  • Prompt + response logging (PII redact করে)।
  • User feedback (thumbs up/down)।
💡 Shadow Deployment
নতুন model production traffic-এর copy পায়, কিন্তু response user-কে যায় না। Old vs new compare করে safe rollout।

সারসংক্ষেপ

✨ এই অধ্যায়ে যা শিখলাম
  • System + Model + Data + Business — চার metric।
  • Drift detection (PSI, KS-test, Evidently)।
  • Prometheus, Grafana, LangSmith — modern stack।
  • Alerts, feedback loop, shadow deploy।