Chapter 39

Large Language Models

🎬 ভাষার বিশাল মডেল

Large Language Model (LLM) মানে — বিলিয়ন parameter, ট্রিলিয়ন token-এ pretrained decoder Transformer। GPT-4, Claude, Gemini, LLaMA, Mistral, Qwen — সব এই পরিবারের।

একটি LLM-এর জীবনচক্র

Pretraining: বিশাল text-এ next-token prediction (মাস, কোটি ডলার)।
Supervised Fine-Tuning (SFT): instruction-following শেখানো।
RLHF / DPO: human preference থেকে align।
Safety / Red-teaming: harmful output ছাঁকা।
Serving: quantization, KV-cache, batching।

Emergent Capabilities

একটি নির্দিষ্ট scale পেরোলে — in-context learning, chain-of-thought reasoning, code generation, translation — সব এমনি "emerge" করে, যদিও সরাসরি শেখানো হয়নি।

Context Window

মডেল

Context

উল্লেখ্য

GPT-3.5

16K

—

GPT-4o

128K

multimodal

Claude 3.5

200K

long-doc strong

Gemini 1.5

1M–2M

video/audio support

LLaMA 3.1

128K

open-weights

Open বনাম Closed

Closed (API)

Open (weights)

GPT-4, Claude, Gemini

LLaMA, Mistral, Qwen, Phi

সবচেয়ে capable, paid

self-host, fine-tune, free

সহজ integration

privacy, control, cost

API দিয়ে দ্রুত শুরু

# OpenAI
from openai import OpenAI
client = OpenAI()
r = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"user","content":"বাংলায় ১টি হাইকু লেখো"}],
)
print(r.choices[0].message.content)

# HuggingFace local
from transformers import pipeline
chat = pipeline("text-generation", model="meta-llama/Llama-3.2-1B-Instruct")
print(chat("Explain attention in 2 lines.", max_new_tokens=80)[0]["generated_text"])

Local Run — Ollama / llama.cpp

# CLI
ollama pull llama3.2:3b
ollama run llama3.2:3b "summarize attention in 3 lines"

# python
import requests
requests.post("http://localhost:11434/api/generate",
              json={"model":"llama3.2:3b","prompt":"hello"})

Quantization — চালানোর জন্য

FP16 / BF16: default training precision।
INT8: মাঝারি accuracy loss।
4-bit (GPTQ, AWQ, GGUF Q4_K_M): consumer GPU-তে 7B–13B model।

Cost ও Speed

KV-cache → repeat compute কমে।
Batch inference → throughput বাড়ে।
Speculative decoding → 2–3x দ্রুত।
Streaming → user-perceived latency কম।

🔑 কোন মডেল কখন?

Quick prototype → API (GPT-4o-mini, Claude Haiku)। Production-এ low-latency / privacy → self-hosted LLaMA/Mistral + vLLM/TGI।

অনুশীলন

১. একই prompt OpenAI ও Ollama (LLaMA)-তে চালিয়ে output তুলনা।

২. Token count / cost estimator লিখুন।

৩. 4-bit quantized model GPU-তে চালিয়ে memory measure।

সারসংক্ষেপ

✨ এই অধ্যায়ে যা শিখলাম

LLM = scaled decoder Transformer + alignment।
Closed = capable, Open = flexible।
Quantization ও KV-cache দিয়ে cheap serving।

পূর্ববর্তী

Stable Diffusion

পরবর্তী

প্রম্পট ইঞ্জিনিয়ারিং