Chapter 55

স্কেলেবল AI সিস্টেম

Scalable AI Systems

📈 ১০ user → ১০ মিলিয়ন

Scalable AI system মানে শুধু "বড় সার্ভার" নয় — load balancing, caching, async queue, autoscaling, region distribution — সব মিলিয়ে design।

Scaling-এর তিন মাত্রা

Vertical: বড় GPU/CPU (A100 → H100)।
Horizontal: অনেক replica + load balancer।
Functional: microservice — preprocess, infer, postprocess আলাদা।

Reference Architecture

Client
  │
  ▼
[CDN / Edge Cache]
  │
  ▼
[API Gateway] ── Auth, Rate Limit
  │
  ▼
[Load Balancer]
  │
  ├─► [Inference Pod × N]  ──► [GPU Pool]
  ▼
[Redis Cache] [Postgres] [Vector DB] [Object Storage]
  │
  ▼
[Async Queue (Kafka/SQS)] ──► [Worker × M] (heavy/long jobs)

Caching Strategy

Response cache: একই input-এ একই output (Redis, TTL)।
Embedding cache: RAG-এ document embedding একবারই compute।
Semantic cache: similar prompt → cached response (GPTCache)।
KV-cache: LLM token reuse।

Async vs Sync

Sync: <2s latency target — chat, search।
Async: video gen, batch embedding — queue + webhook/poll।

Autoscaling

# Kubernetes HPA — GPU util-based
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric: { name: nvidia_gpu_duty_cycle }
        target: { type: AverageValue, averageValue: "70" }

Cost Control

Spot/preemptible GPU (60–80% cheaper)।
Scale-to-zero idle hours-এ (Modal, Replicate)।
Smaller model first → fallback to large (model cascade)।
Quantization, distillation।

Reliability Patterns

Circuit breaker (cascading failure আটকায়)।
Retry with exponential backoff + jitter।
Bulkhead — pool isolate।
Graceful degradation — LLM fail হলে cached/simpler response।
Multi-region failover।

💡 Bottleneck সবসময় একটাই থাকে

Scale করার আগে load test (Locust, k6) চালিয়ে আসল bottleneck খুঁজুন — GPU, network, database, queue — যেকোনো একটা হতে পারে।

সারসংক্ষেপ

✨ এই অধ্যায়ে যা শিখলাম

Vertical, horizontal, functional scaling।
Cache + async queue — throughput-এর গোপন সূত্র।
Autoscale, spot GPU, cascade — cost optimization।
Reliability pattern ছাড়া scale মানেই বড় outage।

পূর্ববর্তী

AI মনিটরিং

পরবর্তী

রিয়েল-টাইম AI পাইপলাইন