Chapter 30

LSTM

🎬 দীর্ঘমেয়াদি স্মৃতি

Long Short-Term Memory (LSTM) RNN-এর vanishing gradient সমস্যা সমাধান করতে designed — cell state + তিনটি gate, যা সিদ্ধান্ত নেয় কী মনে রাখবে, কী ভুলে যাবে।

LSTM Cell গঠন

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)     # Forget gate
i_t = σ(W_i · [h_{t-1}, x_t] + b_i)     # Input gate
C̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)  # Candidate cell
C_t = f_t * C_{t-1} + i_t * C̃_t          # New cell state
o_t = σ(W_o · [h_{t-1}, x_t] + b_o)     # Output gate
h_t = o_t * tanh(C_t)                    # New hidden state

তিনটি gate-এর কাজ

Forget: পুরোনো cell state থেকে কী বাদ দেব?
Input: কোন নতুন তথ্য যোগ করব?
Output: cell state থেকে কী prediction-এ পাঠাব?

🔑 Cell state-ই key

Cell state একটি "highway" — gate-এর elementwise অপারেশন ছাড়া রৈখিকভাবে চলে। তাই gradient দীর্ঘ পথ পেরিয়েও ক্ষীণ হয় না।

Keras LSTM

from tensorflow.keras import layers, models
m = models.Sequential([
    layers.Embedding(20000, 128, mask_zero=True),
    layers.Bidirectional(layers.LSTM(128, return_sequences=True)),
    layers.Bidirectional(layers.LSTM(64)),
    layers.Dense(1, activation="sigmoid"),
])
m.compile("adam", "binary_crossentropy", metrics=["accuracy"])

PyTorch LSTM

import torch.nn as nn
class LSTMSent(nn.Module):
    def __init__(self, vocab, emb=128, hid=128):
        super().__init__()
        self.emb = nn.Embedding(vocab, emb, padding_idx=0)
        self.lstm = nn.LSTM(emb, hid, num_layers=2, batch_first=True,
                            bidirectional=True, dropout=0.3)
        self.fc = nn.Linear(hid*2, 1)
    def forward(self, x):
        x = self.emb(x)
        out, (h, c) = self.lstm(x)
        last = torch.cat([h[-2], h[-1]], dim=1)   # bi-LSTM শেষ layer
        return self.fc(last).squeeze(-1)

Stacked & Bidirectional

উপরে আরেকটি LSTM চাপালে hierarchical pattern শেখা যায়। Bidirectional উভয় দিক থেকে context পায় — sequence labeling, NER, sentiment-এ স্ট্যান্ডার্ড।

Sequence Generation

# Char-level text generation
for _ in range(200):
    x = encode(seed)[-seq_len:]
    probs = model.predict(x[None])[0, -1]
    next_id = np.random.choice(len(probs), p=probs)
    seed += idx2char[next_id]

সাধারণ ব্যবহার

Sentiment analysis, NER, POS tagging।
Time-series forecasting (stock, weather)।
Speech recognition (CTC head)।
Music generation, char-level text।

সাধারণ ভুল

⚠️ এড়িয়ে চলবেন

Padding mask না দেওয়া — pad token মডেলকে noise দেয়।
খুব দীর্ঘ sequence — truncate বা windowing দরকার।
Gradient clipping ভুলে যাওয়া।

অনুশীলন

১. IMDB-তে BiLSTM ও SimpleRNN-এর accuracy তুলনা করুন।

২. Char-level LSTM দিয়ে শেক্সপিয়ার-style text generate করুন।

৩. Stacked-LSTM-এ dropout 0.0 vs 0.3 effect দেখুন।

সারসংক্ষেপ

✨ এই অধ্যায়ে যা শিখলাম

LSTM = cell state + 3 gate → long-range memory।
BiLSTM context উভয় দিক থেকে।
Pre-Transformer যুগে NLP-র backbone।