Chapter 50

AI-এর জন্য FastAPI

FastAPI for AI

⚡ AI-এর জন্য FastAPI কেন

FastAPI = async + Pydantic validation + auto OpenAPI docs। ML model serve করতে Flask-এর চেয়ে দ্রুত, type-safe, এবং production-ready।

Minimal Inference API

# main.py
from fastapi import FastAPI
from pydantic import BaseModel, Field
import torch, joblib

app = FastAPI(title="Iris Classifier")
model = joblib.load("model.pkl")

class Features(BaseModel):
    sepal_length: float = Field(gt=0, lt=10)
    sepal_width:  float = Field(gt=0, lt=10)
    petal_length: float = Field(gt=0, lt=10)
    petal_width:  float = Field(gt=0, lt=10)

@app.get("/health")
def health(): return {"status": "ok"}

@app.post("/predict")
def predict(x: Features):
    pred = model.predict([[x.sepal_length, x.sepal_width,
                           x.petal_length, x.petal_width]])
    return {"class": int(pred[0])}

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Lifespan — মডেল একবারই load

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.model = torch.load("model.pt", map_location="cuda")
    app.state.model.eval()
    yield
    # cleanup

app = FastAPI(lifespan=lifespan)

Async + Batching

একই সময়ে আসা request-গুলো একসাথে batch করে GPU utilization বাড়ান।

# pseudo: micro-batching with asyncio.Queue
queue = asyncio.Queue()

async def worker():
    while True:
        batch = []
        item = await queue.get(); batch.append(item)
        try:
            while len(batch) < 32:
                batch.append(queue.get_nowait())
        except asyncio.QueueEmpty: pass
        # run model on batch, resolve futures

Streaming Response (LLM)

from fastapi.responses import StreamingResponse

@app.post("/chat")
def chat(prompt: str):
    def gen():
        for token in llm.stream(prompt):
            yield token
    return StreamingResponse(gen(), media_type="text/plain")

Production Essentials

CORS middleware।
Auth (API key / JWT)।
Rate limiting (slowapi)।
Request ID + structured logging।
Prometheus metrics (prometheus-fastapi-instrumentator)।
Gunicorn + uvicorn workers।

💡 GPU দিয়ে workers

GPU model-এ --workers 1 রাখুন (প্রতি worker আলাদা VRAM)। Throughput বাড়াতে batching ব্যবহার করুন, worker count নয়।

সারসংক্ষেপ

✨ এই অধ্যায়ে যা শিখলাম

FastAPI দিয়ে দ্রুত, type-safe ML API।
Lifespan, async batching, streaming response।
Production checklist — auth, metrics, rate limit।