Chapter 58

AI ইনফ্রাস্ট্রাকচার

AI Infrastructure
🏗️ AI চালানোর পেছনের পুরো stack
AI infrastructure = compute + storage + network + orchestration + MLOps tooling। ভালো infra ছাড়া best model-ও production-এ ব্যর্থ।

Compute Layer

  • GPU: NVIDIA A100/H100/B200 — training; L4/T4/L40 — inference।
  • TPU: Google — JAX/TF-এ excellent।
  • Inferentia/Trainium: AWS custom silicon।
  • CPU: light model, embedding — ARM Graviton cost-effective।

Storage Layer

  • Object store: S3, GCS — dataset, model artifact।
  • Parallel FS: Lustre, WekaIO — training-এ high throughput।
  • Local NVMe: dataloader-এর hot cache।
  • Vector DB: Pinecone, Weaviate, Qdrant, pgvector।
  • Feature Store: Feast, Tecton।
  • Warehouse: BigQuery, Snowflake, Databricks।

Network

  • NVLink/NVSwitch — intra-node GPU-GPU।
  • InfiniBand HDR/NDR — inter-node, 200–400 Gbps।
  • CDN — model artifact + static asset।

Orchestration — Kubernetes

# Inference pod (GPU)
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: llm
        image: ghcr.io/org/llm:1.4
        resources:
          limits: { nvidia.com/gpu: 1 }
        readinessProbe:
          httpGet: { path: /health, port: 8000 }

Key Add-ons

  • KServe / Seldon: model serving on K8s।
  • Kubeflow: ML workflows, training operator।
  • Argo Workflows: pipeline DAG।
  • Karpenter: intelligent autoscaler।
  • NVIDIA GPU Operator: driver + MIG + DCGM।

MLOps Tooling

  • Experiment: MLflow, Weights & Biases, Neptune।
  • Data versioning: DVC, LakeFS।
  • Pipeline: Airflow, Prefect, Dagster।
  • CI/CD: GitHub Actions, GitLab CI, Buildkite।
  • Secrets: Vault, AWS Secrets Manager।

Build vs Buy

  • Buy (Managed): SageMaker, Vertex AI, Modal, Replicate, Anyscale — fast, costly।
  • Build (Self-host): K8s + GPU node + open source stack — cheap at scale, ops-heavy।
  • Hybrid: training self-host (cheap), serving managed (uptime)।

Security & Governance

  • VPC isolation, private endpoint।
  • IAM least-privilege।
  • PII redaction in logs।
  • Model card + data lineage।
  • EU AI Act, SOC2, HIPAA compliance।
💡 Stack ছোট শুরু করুন
Day-1 stack: Postgres + S3 + 1 GPU VM + FastAPI + Docker + Grafana। Scale-এর সাথে Kafka, vector DB, K8s, Kubeflow যোগ হবে।

সারসংক্ষেপ

✨ এই অধ্যায়ে যা শিখলাম
  • Compute, storage, network, orchestration — চার pillar।
  • K8s + GPU Operator + KServe — modern serving infra।
  • MLOps tool — experiment থেকে production পর্যন্ত।
  • Build vs Buy — context-dependent trade-off।