Neural Network Encyclopedia

A comprehensive mathematical reference covering every major neural network architecture — from classical perceptrons to modern diffusion models.

01

Perceptron

The simplest neural network — a single linear classifier.

Supervised Classification 1958 — Rosenblatt

Forward Pass

Given input vector $\mathbf{x} \in \mathbb{R}^n$, weight vector $\mathbf{w} \in \mathbb{R}^n$, and bias $b$:

$$z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b$$ $$\hat{y} = \sigma(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}$$

Learning Rule

For a training sample $(\mathbf{x}, y)$ with learning rate $\eta$:

$$\mathbf{w} \leftarrow \mathbf{w} + \eta \,(y - \hat{y})\,\mathbf{x}$$ $$b \leftarrow b + \eta \,(y - \hat{y})$$

Convergence Theorem

If the training data is linearly separable with margin $\gamma = \min_i \frac{y_i(\mathbf{w}^{*\top}\mathbf{x}_i)}{\|\mathbf{w}^*\|}$, the perceptron converges in at most $\left(\frac{R}{\gamma}\right)^2$ updates, where $R = \max_i \|\mathbf{x}_i\|$.

x₁ ──w₁──╮ x₂ ──w₂──┤→ Σ + b → step(·) → ŷ x₃ ──w₃──╯
02

Multi-Layer Perceptron (MLP)

Feedforward network with one or more hidden layers — a universal function approximator.

Supervised Classification / Regression Universal Approximation

Architecture

An MLP with $L$ layers maps input $\mathbf{x}$ through a series of affine transformations and nonlinearities:

$$\mathbf{h}^{(0)} = \mathbf{x}$$ $$\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad l = 1, \dots, L$$ $$\mathbf{h}^{(l)} = f\!\left(\mathbf{z}^{(l)}\right), \quad l = 1, \dots, L-1$$ $$\hat{\mathbf{y}} = g\!\left(\mathbf{z}^{(L)}\right)$$

Where $f$ is a hidden activation (e.g. ReLU) and $g$ is the output activation (e.g. softmax for classification, identity for regression).

Universal Approximation Theorem

A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of $\mathbb{R}^n$, given a non-polynomial activation function.

$$\forall\, \varepsilon > 0,\;\exists\, N,\; \mathbf{W}, \mathbf{b}:\quad \sup_{\mathbf{x} \in K} \left| f(\mathbf{x}) - \sum_{i=1}^{N} v_i \,\sigma\!\left(\mathbf{w}_i^\top \mathbf{x} + b_i\right) \right| < \varepsilon$$

Loss Functions

Mean Squared Error (Regression)

$$\mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}\|\mathbf{y}_i - \hat{\mathbf{y}}_i\|^2$$

Cross-Entropy (Classification)

$$\mathcal{L}_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c}\,\log\hat{y}_{i,c}$$
Input Hidden 1 Hidden 2 Output ○─────╲ ○──────●──────●──────● ○──────●──────●──────●──────○ ŷ ○──────●──────●──────● ○─────╱
03

Activation Functions

Nonlinearities that give neural networks their expressive power.

NameFormula $f(z)$Derivative $f'(z)$
Sigmoid$\frac{1}{1+e^{-z}}$$f(z)(1-f(z))$
Tanh$\frac{e^z - e^{-z}}{e^z + e^{-z}}$$1 - f(z)^2$
ReLU$\max(0, z)$$\begin{cases}1 & z>0\\0 & z\leq 0\end{cases}$
Leaky ReLU$\max(\alpha z, z)$$\begin{cases}1 & z>0\\\alpha & z\leq 0\end{cases}$
ELU$\begin{cases}z & z>0\\\alpha(e^z-1) & z\leq 0\end{cases}$$\begin{cases}1 & z>0\\f(z)+\alpha & z\leq 0\end{cases}$
GELU$z \cdot \Phi(z)$$\Phi(z) + z\,\phi(z)$
Swish / SiLU$z \cdot \sigma(z)$$f(z) + \sigma(z)(1 - f(z))$
Softmax$\frac{e^{z_i}}{\sum_j e^{z_j}}$$f_i(\delta_{ij} - f_j)$
Mish$z \cdot \tanh(\ln(1+e^z))$See chain rule expansion

GELU (Gaussian Error Linear Unit)

$$\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]$$ $$\approx 0.5\,z\left(1 + \tanh\!\left[\sqrt{\frac{2}{\pi}}\left(z + 0.044715\,z^3\right)\right]\right)$$
04

Backpropagation

The chain rule applied layer-by-layer to compute gradients efficiently.

Core Algorithm 1986 — Rumelhart, Hinton, Williams

Chain Rule (Vector Form)

For loss $\mathcal{L}$ with respect to parameters in layer $l$:

$$\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{z}^{(L)}}\mathcal{L} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}$$ $$\boldsymbol{\delta}^{(l)} = \left(\mathbf{W}^{(l+1)\top}\boldsymbol{\delta}^{(l+1)}\right) \odot f'\!\left(\mathbf{z}^{(l)}\right)$$

Parameter Gradients

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \mathbf{h}^{(l-1)\top}$$ $$\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$$

Computational Complexity

For a network with $L$ layers and $n$ neurons per layer, backpropagation has $O(Ln^2)$ time complexity — the same as the forward pass — making it highly efficient compared to numerical differentiation.

05

Optimization Algorithms

Methods for traversing the loss landscape to find good minima.

Stochastic Gradient Descent (SGD)

$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)$$

SGD with Momentum

$$\mathbf{v}_{t+1} = \mu\, \mathbf{v}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{v}_{t+1}$$

Nesterov Accelerated Gradient

$$\mathbf{v}_{t+1} = \mu\, \mathbf{v}_t - \eta\, \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t + \mu\,\mathbf{v}_t)$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{v}_{t+1}$$

AdaGrad

$$\mathbf{G}_{t} = \mathbf{G}_{t-1} + \mathbf{g}_t^2$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{G}_t + \epsilon}}\, \mathbf{g}_t$$

RMSProp

$$\mathbf{v}_t = \rho\,\mathbf{v}_{t-1} + (1-\rho)\,\mathbf{g}_t^2$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}}\, \mathbf{g}_t$$

Adam

$$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t$$ $$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2$$ $$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t}$$ $$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}\,\hat{\mathbf{m}}_t$$

AdamW (Decoupled Weight Decay)

$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda\,\boldsymbol{\theta}_t\right)$$
06

Regularization Techniques

Methods to prevent overfitting and improve generalization.

L1 Regularization (Lasso)

$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \sum_{l}\|\mathbf{W}^{(l)}\|_1 = \mathcal{L} + \lambda \sum_{l}\sum_{i,j}|W^{(l)}_{ij}|$$

L2 Regularization (Ridge / Weight Decay)

$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2}\sum_{l}\|\mathbf{W}^{(l)}\|_F^2 = \mathcal{L} + \frac{\lambda}{2}\sum_{l}\sum_{i,j}(W^{(l)}_{ij})^2$$

Dropout

During training, each neuron is independently set to zero with probability $p$:

$$\mathbf{m} \sim \text{Bernoulli}(1-p)$$ $$\tilde{\mathbf{h}}^{(l)} = \mathbf{m} \odot \mathbf{h}^{(l)}$$ $$\text{At test time:}\quad \mathbf{h}^{(l)}_{\text{test}} = (1-p)\,\mathbf{h}^{(l)}$$

Batch Normalization

$$\mu_B = \frac{1}{m}\sum_{i=1}^m z_i, \quad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^m(z_i - \mu_B)^2$$ $$\hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$ $$y_i = \gamma\,\hat{z}_i + \beta$$

Layer Normalization

$$\mu = \frac{1}{H}\sum_{i=1}^{H}h_i, \quad \sigma^2 = \frac{1}{H}\sum_{i=1}^{H}(h_i - \mu)^2$$ $$\hat{h}_i = \frac{h_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma\,\hat{h}_i + \beta$$

RMSNorm

$$\text{RMS}(\mathbf{h}) = \sqrt{\frac{1}{H}\sum_{i=1}^H h_i^2}$$ $$\hat{h}_i = \frac{h_i}{\text{RMS}(\mathbf{h})}\,\gamma_i$$
07

Convolutional Neural Network (CNN)

Networks exploiting spatial structure through shared local filters.

Supervised Computer Vision 1989 — LeCun

2D Convolution

For input $\mathbf{X} \in \mathbb{R}^{C_{in} \times H \times W}$ and filter $\mathbf{K} \in \mathbb{R}^{C_{in} \times k \times k}$:

$$(\mathbf{X} * \mathbf{K})[i,j] = \sum_{c=1}^{C_{in}}\sum_{m=0}^{k-1}\sum_{n=0}^{k-1} X[c,\, i+m,\, j+n] \cdot K[c,\, m,\, n]$$

Output Dimensions

$$H_{\text{out}} = \left\lfloor\frac{H + 2p - k}{s}\right\rfloor + 1, \quad W_{\text{out}} = \left\lfloor\frac{W + 2p - k}{s}\right\rfloor + 1$$

Where $p$ is padding, $s$ is stride, $k$ is kernel size.

Depthwise Separable Convolution

Factorizes a standard convolution into a depthwise and pointwise step:

$$\text{Standard cost:}\quad C_{in} \cdot k^2 \cdot C_{out} \cdot H' \cdot W'$$ $$\text{Separable cost:}\quad C_{in} \cdot k^2 \cdot H' \cdot W' + C_{in} \cdot C_{out} \cdot H' \cdot W'$$ $$\text{Reduction ratio:}\quad \frac{1}{C_{out}} + \frac{1}{k^2}$$

Dilated (Atrous) Convolution

$$(\mathbf{X} *_d \mathbf{K})[i,j] = \sum_{m}\sum_{n} X[i + d \cdot m,\; j + d \cdot n] \cdot K[m, n]$$

Effective receptive field: $k + (k-1)(d-1)$, where $d$ is the dilation rate.

Pooling Operations

$$\text{Max Pool:}\quad y_{ij} = \max_{(m,n) \in \mathcal{R}_{ij}} x_{mn}$$ $$\text{Avg Pool:}\quad y_{ij} = \frac{1}{|\mathcal{R}_{ij}|}\sum_{(m,n) \in \mathcal{R}_{ij}} x_{mn}$$

Transposed Convolution

Used for upsampling. Equivalent to convolving with fractional strides or padding the input:

$$H_{\text{out}} = (H_{\text{in}} - 1) \cdot s - 2p + k + p_{\text{out}}$$
08

U-Net

Encoder-decoder with skip connections for dense prediction tasks. Backbone of all modern diffusion models.

Supervised Segmentation / Diffusion 2015 — Ronneberger et al.

Architecture

Symmetric encoder (contracting) and decoder (expanding) path with skip connections concatenating encoder features to decoder features at each resolution:

$$\text{Encoder:}\quad \mathbf{e}^{(l)} = \text{MaxPool}\!\left(\text{ConvBlock}(\mathbf{e}^{(l-1)})\right)$$ $$\text{Decoder:}\quad \mathbf{d}^{(l)} = \text{ConvBlock}\!\left([\text{UpConv}(\mathbf{d}^{(l+1)});\; \mathbf{e}^{(l)}]\right)$$

Where $[\cdot;\cdot]$ denotes channel-wise concatenation (skip connection).

ConvBlock

$$\text{ConvBlock}(\mathbf{x}) = \text{ReLU}(\text{BN}(\text{Conv}_{3\times3}(\text{ReLU}(\text{BN}(\text{Conv}_{3\times3}(\mathbf{x}))))))$$

U-Net in Diffusion Models

In DDPM / Stable Diffusion, the U-Net is conditioned on timestep $t$ and optional conditioning $c$:

$$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) = \text{U-Net}(\mathbf{x}_t,\; \text{SinEmb}(t),\; \text{CrossAttn}(c))$$

Time embeddings are injected via addition/FiLM layers; text conditioning via cross-attention at each resolution level.

Encoder Decoder x ──[Conv]──┐ ┌──[Conv]── ŷ 64 │ skip │ 64 [Pool] ├───────────────→┤ [UpConv] ──[Conv]──┐ │ │ ┌──[Conv]── 128 │ │ skip │ │ 128 [Pool] ├─┼───────────────→┼─┤ [UpConv] ─[Conv]─┐ │ │ │ │ ┌─[Conv]─ 256 │ │ │ skip │ │ │ 256 [Pool] ├─┼─┼──────────────→┼─┼─┤[UpConv] ─[Conv]─╯ │ │ Bottleneck │ │ ╰─[Conv]─ 512 │ │ ────[Conv]── │ │ 512
09

Recurrent Neural Network (Vanilla RNN)

Networks with temporal memory via recurrent connections.

Supervised Sequence Modeling Temporal

Hidden State Dynamics

$$\mathbf{h}_t = \tanh\!\left(\mathbf{W}_{hh}\,\mathbf{h}_{t-1} + \mathbf{W}_{xh}\,\mathbf{x}_t + \mathbf{b}_h\right)$$ $$\mathbf{y}_t = \mathbf{W}_{hy}\,\mathbf{h}_t + \mathbf{b}_y$$

Backpropagation Through Time (BPTT)

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\sum_{k=1}^{t} \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t}\left(\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}$$

Vanishing/Exploding Gradient Problem

The product of Jacobians $\prod_j \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}$ can shrink or grow exponentially:

$$\left\|\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right\| \leq \left(\|\mathbf{W}_{hh}\| \cdot \gamma\right)^{t-k}$$

Where $\gamma = \max |f'(z)|$. If $\|\mathbf{W}_{hh}\| \cdot \gamma < 1$, gradients vanish; if $> 1$, they explode.

10

Long Short-Term Memory (LSTM)

Gated RNN architecture solving the vanishing gradient problem.

Supervised Sequence Modeling 1997 — Hochreiter & Schmidhuber

Gate Equations

$$\mathbf{f}_t = \sigma\!\left(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f\right) \quad \text{(forget gate)}$$ $$\mathbf{i}_t = \sigma\!\left(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i\right) \quad \text{(input gate)}$$ $$\tilde{\mathbf{c}}_t = \tanh\!\left(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c\right) \quad \text{(candidate)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \quad \text{(cell state)}$$ $$\mathbf{o}_t = \sigma\!\left(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o\right) \quad \text{(output gate)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$

Gradient Flow through Cell State

The cell state provides a highway for gradients:

$$\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t)$$ $$\frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_k} = \prod_{j=k+1}^{T}\text{diag}(\mathbf{f}_j)$$

When $\mathbf{f}_t \approx 1$, gradients flow unattenuated over many timesteps.

Parameter Count

$$\text{Params} = 4\left[(d_h + d_x)\cdot d_h + d_h\right]$$
11

Gated Recurrent Unit (GRU)

A simplified gating mechanism merging cell and hidden state.

Supervised Sequence Modeling 2014 — Cho et al.

Gate Equations

$$\mathbf{r}_t = \sigma\!\left(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r\right) \quad \text{(reset gate)}$$ $$\mathbf{z}_t = \sigma\!\left(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z\right) \quad \text{(update gate)}$$ $$\tilde{\mathbf{h}}_t = \tanh\!\left(\mathbf{W}_h[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_h\right)$$ $$\mathbf{h}_t = (1 - \mathbf{z}_t)\odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$$

Parameter Count

$$\text{Params} = 3\left[(d_h + d_x)\cdot d_h + d_h\right]$$

GRU uses 25% fewer parameters than LSTM (3 gates vs 4).

12

Extended LSTM (xLSTM)

Modernized LSTM with exponential gating and matrix memory for LLM-scale performance.

Supervised Sequence Modeling 2024 — Beck et al.

sLSTM (Scalar Memory)

Extends LSTM with exponential gating and a normalizer state for numerical stability:

$$\mathbf{f}_t = \exp\!\left(\mathbf{w}_f^\top \mathbf{x}_t + b_f\right) \quad \text{(exponential forget gate)}$$ $$\mathbf{i}_t = \exp\!\left(\mathbf{w}_i^\top \mathbf{x}_t + b_i\right) \quad \text{(exponential input gate)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$ $$\mathbf{n}_t = \mathbf{f}_t \odot \mathbf{n}_{t-1} + \mathbf{i}_t \quad \text{(normalizer state)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \frac{\mathbf{c}_t}{\mathbf{n}_t}$$

mLSTM (Matrix Memory)

Replaces the scalar cell state with a matrix $\mathbf{C}_t \in \mathbb{R}^{d \times d}$, enabling key-value storage:

$$\mathbf{k}_t = \mathbf{W}_k\mathbf{x}_t, \quad \mathbf{v}_t = \mathbf{W}_v\mathbf{x}_t, \quad \mathbf{q}_t = \mathbf{W}_q\mathbf{x}_t$$ $$\mathbf{C}_t = f_t\,\mathbf{C}_{t-1} + i_t\,\mathbf{v}_t\mathbf{k}_t^\top$$ $$\mathbf{n}_t = f_t\,\mathbf{n}_{t-1} + i_t\,\mathbf{k}_t$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \frac{\mathbf{C}_t\,\mathbf{q}_t}{\max(|\mathbf{n}_t^\top\mathbf{q}_t|, 1)}$$

mLSTM is fully parallelizable (no hidden-to-hidden recurrence) and can be viewed as a linearized self-attention with a decay factor.

13

Bidirectional RNN

Processing sequences in both forward and backward directions.

Sequence Modeling 1997 — Schuster & Paliwal

Architecture

$$\overrightarrow{\mathbf{h}}_t = f\!\left(\mathbf{W}_{\overrightarrow{h}}\,\overrightarrow{\mathbf{h}}_{t-1} + \mathbf{W}_{x\overrightarrow{h}}\,\mathbf{x}_t + \mathbf{b}_{\overrightarrow{h}}\right)$$ $$\overleftarrow{\mathbf{h}}_t = f\!\left(\mathbf{W}_{\overleftarrow{h}}\,\overleftarrow{\mathbf{h}}_{t+1} + \mathbf{W}_{x\overleftarrow{h}}\,\mathbf{x}_t + \mathbf{b}_{\overleftarrow{h}}\right)$$ $$\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t;\, \overleftarrow{\mathbf{h}}_t] \in \mathbb{R}^{2d_h}$$ $$\mathbf{y}_t = \mathbf{W}_y\,\mathbf{h}_t + \mathbf{b}_y$$
14

Attention Mechanism

Learning to focus on relevant parts of the input.

Core Mechanism 2014 — Bahdanau et al.

Additive (Bahdanau) Attention

$$e_{ij} = \mathbf{v}^\top \tanh\!\left(\mathbf{W}_1\mathbf{h}_i + \mathbf{W}_2\mathbf{s}_j\right)$$ $$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k}\exp(e_{kj})}$$ $$\mathbf{c}_j = \sum_i \alpha_{ij}\,\mathbf{h}_i$$

Multiplicative (Luong) Attention

$$e_{ij} = \mathbf{s}_j^\top \mathbf{W} \mathbf{h}_i \quad\text{(general)}$$ $$e_{ij} = \mathbf{s}_j^\top \mathbf{h}_i \quad\text{(dot)}$$

Scaled Dot-Product Attention

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}$$

The $\sqrt{d_k}$ scaling prevents softmax saturation when dot products grow large.

15

Transformer

Attention-only architecture that revolutionized NLP and beyond.

Supervised / Self-Supervised NLP / Vision / Multimodal 2017 — Vaswani et al.

Multi-Head Attention

$$\mathbf{Q}_i = \mathbf{X}\mathbf{W}_i^Q,\quad \mathbf{K}_i = \mathbf{X}\mathbf{W}_i^K,\quad \mathbf{V}_i = \mathbf{X}\mathbf{W}_i^V$$ $$\text{head}_i = \text{Attention}(\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i)$$ $$\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\,\mathbf{W}^O$$

Where $\mathbf{W}_i^Q, \mathbf{W}_i^K \in \mathbb{R}^{d \times d_k}$, $\mathbf{W}_i^V \in \mathbb{R}^{d \times d_v}$, $\mathbf{W}^O \in \mathbb{R}^{hd_v \times d}$.

Sinusoidal Positional Encoding

$$\text{PE}_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right)$$ $$\text{PE}_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$$

Rotary Positional Embedding (RoPE)

$$\mathbf{R}_\theta^{(m)} = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 \\ \sin m\theta_1 & \cos m\theta_1 \\ & & \cos m\theta_2 & -\sin m\theta_2 \\ & & \sin m\theta_2 & \cos m\theta_2 \\ & & & & \ddots \end{pmatrix}$$ $$\mathbf{q}_m^\top \mathbf{k}_n = (\mathbf{R}_\theta^{(m)}\mathbf{W}_q \mathbf{x}_m)^\top (\mathbf{R}_\theta^{(n)}\mathbf{W}_k \mathbf{x}_n)$$

Feed-Forward Network (per position)

$$\text{FFN}(\mathbf{x}) = \mathbf{W}_2\,\text{GELU}(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$

Encoder Block

$$\mathbf{x}' = \text{LayerNorm}(\mathbf{x} + \text{MultiHead}(\mathbf{x}))$$ $$\mathbf{x}'' = \text{LayerNorm}(\mathbf{x}' + \text{FFN}(\mathbf{x}'))$$

Decoder Block (with causal mask)

The causal mask $\mathbf{M}$ sets future positions to $-\infty$ before softmax:

$$M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}$$ $$\text{CausalAttn}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \mathbf{M}\right)\mathbf{V}$$

Grouped-Query Attention (GQA)

Shares key-value heads across groups of query heads to reduce memory:

$$\text{KV heads} = \frac{h}{G}, \quad \text{Each KV head serves } G \text{ query heads}$$

Computational Complexity

$$\text{Self-Attention:}\quad O(n^2 \cdot d)$$ $$\text{FFN:}\quad O(n \cdot d^2)$$ $$\text{Total per layer:}\quad O(n^2 d + n d^2)$$
16

BERT (Encoder-Only Transformer)

Bidirectional encoder pre-trained with masked language modeling — the foundation for NLU tasks.

Self-Supervised → Supervised NLU / Classification / NER 2018 — Devlin et al.

Masked Language Modeling (MLM)

Randomly mask 15% of input tokens and predict the originals:

$$\mathcal{L}_{\text{MLM}} = -\mathbb{E}\!\left[\sum_{i \in \mathcal{M}} \log P_\theta(x_i | \mathbf{x}_{\backslash\mathcal{M}})\right]$$

Of the 15% selected: 80% are replaced with [MASK], 10% with a random token, 10% unchanged.

Next Sentence Prediction (NSP)

$$P(\text{IsNext} | [\text{CLS}]\, A\, [\text{SEP}]\, B) = \sigma(\mathbf{w}^\top \mathbf{h}_{[\text{CLS}]} + b)$$

Input Representation

$$\mathbf{h}_0 = \mathbf{E}_{\text{tok}}(\mathbf{x}) + \mathbf{E}_{\text{pos}} + \mathbf{E}_{\text{seg}}$$

Segment embeddings distinguish sentence A from B. The [CLS] token representation is used for classification tasks.

Fine-Tuning

$$\text{Classification:}\quad \hat{y} = \text{softmax}(\mathbf{W}\,\mathbf{h}_{[\text{CLS}]} + \mathbf{b})$$ $$\text{Token-level (NER):}\quad \hat{y}_i = \text{softmax}(\mathbf{W}\,\mathbf{h}_i + \mathbf{b})$$
ModelLayersHiddenHeadsParams
BERT-Base1276812110M
BERT-Large24102416340M
RoBERTa24102416355M
17

Sequence-to-Sequence (Encoder-Decoder)

Mapping variable-length input sequences to variable-length output sequences.

Supervised Translation / Summarization 2014 — Sutskever et al.

RNN-Based Seq2Seq

$$\text{Encoder:}\quad \mathbf{h}_t^{\text{enc}} = f_{\text{enc}}(\mathbf{x}_t, \mathbf{h}_{t-1}^{\text{enc}})$$ $$\mathbf{c} = \mathbf{h}_T^{\text{enc}} \quad \text{(context vector = final encoder state)}$$ $$\text{Decoder:}\quad \mathbf{h}_t^{\text{dec}} = f_{\text{dec}}(y_{t-1}, \mathbf{h}_{t-1}^{\text{dec}}, \mathbf{c})$$ $$P(y_t | y_{

Seq2Seq with Attention

$$\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_j\exp(e_{tj})}, \quad e_{ti} = \text{score}(\mathbf{h}_t^{\text{dec}}, \mathbf{h}_i^{\text{enc}})$$ $$\mathbf{c}_t = \sum_i \alpha_{ti}\,\mathbf{h}_i^{\text{enc}}$$ $$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_c[\mathbf{c}_t;\,\mathbf{h}_t^{\text{dec}}])$$

Transformer Encoder-Decoder (T5 / BART)

$$\text{Encoder:}\quad \mathbf{H}^{\text{enc}} = \text{TransformerEncoder}(\mathbf{x})$$ $$\text{Decoder layer:}\quad \mathbf{h}' = \text{CausalSelfAttn}(\mathbf{h}) + \mathbf{h}$$ $$\mathbf{h}'' = \text{CrossAttn}(\mathbf{h}', \mathbf{H}^{\text{enc}}) + \mathbf{h}'$$ $$\mathbf{h}''' = \text{FFN}(\mathbf{h}'') + \mathbf{h}''$$

Teacher Forcing

$$\text{Training:}\quad \hat{y}_t = f(y_{t-1}^{\text{gold}}, \mathbf{h}_{t-1}) \quad \text{(use ground truth as input)}$$ $$\text{Inference:}\quad \hat{y}_t = f(\hat{y}_{t-1}, \mathbf{h}_{t-1}) \quad \text{(use model's own prediction)}$$
18

Vision Transformer (ViT)

Applying the transformer architecture directly to image patches.

Supervised / Self-Supervised Computer Vision 2020 — Dosovitskiy et al.

Patch Embedding

An image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ is split into $N$ patches of size $P \times P$:

$$N = \frac{H \cdot W}{P^2}$$ $$\mathbf{x}_p^{(i)} \in \mathbb{R}^{P^2 \cdot C} \quad \text{(flattened patch } i\text{)}$$ $$\mathbf{z}_0^{(i)} = \mathbf{x}_p^{(i)}\,\mathbf{E} + \mathbf{e}_{\text{pos}}^{(i)}, \quad \mathbf{E} \in \mathbb{R}^{(P^2 C) \times d}$$

CLS Token

$$\mathbf{z}_0 = [\mathbf{x}_{\text{cls}};\; \mathbf{z}_0^{(1)};\; \mathbf{z}_0^{(2)};\; \dots;\; \mathbf{z}_0^{(N)}] + \mathbf{E}_{\text{pos}}$$ $$\hat{y} = \text{MLP}(\text{LayerNorm}(\mathbf{z}_L^{(0)}))$$

Full Forward Pass

$$\mathbf{z}'_l = \text{MSA}(\text{LN}(\mathbf{z}_{l-1})) + \mathbf{z}_{l-1}$$ $$\mathbf{z}_l = \text{FFN}(\text{LN}(\mathbf{z}'_l)) + \mathbf{z}'_l$$

Variants

ModelPatch SizeLayersHiddenHeadsParams
ViT-B/1616127681286M
ViT-L/161624102416307M
ViT-H/141432128016632M
19

RWKV (Receptance Weighted Key Value)

Linear-complexity RNN that matches transformer quality — trainable like a transformer, runs like an RNN.

Self-Supervised Language Modeling 2023 — Peng et al.

Time Mixing (Attention Replacement)

$$\mathbf{r}_t = \mathbf{W}_r(\mu_r \odot \mathbf{x}_t + (1-\mu_r)\odot\mathbf{x}_{t-1})$$ $$\mathbf{k}_t = \mathbf{W}_k(\mu_k \odot \mathbf{x}_t + (1-\mu_k)\odot\mathbf{x}_{t-1})$$ $$\mathbf{v}_t = \mathbf{W}_v(\mu_v \odot \mathbf{x}_t + (1-\mu_v)\odot\mathbf{x}_{t-1})$$

WKV Mechanism (Linear Attention)

$$\text{wkv}_t = \frac{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_i}\mathbf{v}_i + e^{u+k_t}\mathbf{v}_t}{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_i} + e^{u+k_t}}$$ $$\mathbf{o}_t = \sigma(\mathbf{r}_t) \odot \text{wkv}_t$$

Where $w$ is a learned decay vector and $u$ is a learned bonus for the current token. This can be computed recurrently in $O(1)$ per step.

Channel Mixing (FFN Replacement)

$$\mathbf{r}_t' = \sigma(\mathbf{W}_{r'}(\mu_{r'}\odot\mathbf{x}_t + (1-\mu_{r'})\odot\mathbf{x}_{t-1}))$$ $$\mathbf{k}_t' = \mathbf{W}_{k'}(\mu_{k'}\odot\mathbf{x}_t + (1-\mu_{k'})\odot\mathbf{x}_{t-1})$$ $$\mathbf{o}_t = \mathbf{r}_t' \odot (\mathbf{W}_v'\,\max(\mathbf{k}_t', 0)^2)$$

Complexity

$$\text{Training:}\quad O(Td) \quad\text{(parallelizable like transformer)}$$ $$\text{Inference:}\quad O(d) \text{ per token} \quad\text{(constant, like RNN)}$$
20

Autoencoder

Learning compressed representations via reconstruction.

Unsupervised Representation Learning

Architecture

$$\text{Encoder:}\quad \mathbf{z} = f_\phi(\mathbf{x}) = \sigma(\mathbf{W}_e\mathbf{x} + \mathbf{b}_e)$$ $$\text{Decoder:}\quad \hat{\mathbf{x}} = g_\theta(\mathbf{z}) = \sigma(\mathbf{W}_d\mathbf{z} + \mathbf{b}_d)$$ $$\mathcal{L} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2$$

Denoising Autoencoder

$$\tilde{\mathbf{x}} = \mathbf{x} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2\mathbf{I})$$ $$\mathcal{L}_{\text{DAE}} = \|\mathbf{x} - g_\theta(f_\phi(\tilde{\mathbf{x}}))\|^2$$

Sparse Autoencoder

$$\mathcal{L}_{\text{sparse}} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2 + \lambda \sum_j \text{KL}(\rho \,\|\, \hat{\rho}_j)$$ $$\text{KL}(\rho\,\|\,\hat{\rho}_j) = \rho\log\frac{\rho}{\hat{\rho}_j} + (1-\rho)\log\frac{1-\rho}{1-\hat{\rho}_j}$$
21

Variational Autoencoder (VAE)

Probabilistic generative model with a learned latent space.

Generative Latent Variable Model 2013 — Kingma & Welling

Generative Model

$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z})\,p(\mathbf{z})\,d\mathbf{z}, \quad p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Evidence Lower Bound (ELBO)

$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - \text{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z})\right) = \text{ELBO}$$

Reparameterization Trick

$$q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}\!\left(\boldsymbol{\mu}_\phi(\mathbf{x}),\, \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x}))\right)$$ $$\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

KL Divergence (Closed Form for Gaussians)

$$\text{KL}(q\,\|\,p) = -\frac{1}{2}\sum_{j=1}^{d}\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$

Loss Function

$$\mathcal{L}_{\text{VAE}} = -\mathbb{E}_{q_\phi}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] + \text{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x})\,\|\,p(\mathbf{z})\right)$$
22

Generative Adversarial Network (GAN)

Two networks competing in a minimax game to generate realistic data.

Generative Image Synthesis 2014 — Goodfellow et al.

Minimax Objective

$$\min_G \max_D\; V(D, G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}\!\left[\log D(\mathbf{x})\right] + \mathbb{E}_{\mathbf{z}\sim p_z}\!\left[\log(1 - D(G(\mathbf{z})))\right]$$

Optimal Discriminator

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$

Global Optimum

At the Nash equilibrium, $p_g = p_{\text{data}}$ and $D^*(\mathbf{x}) = \frac{1}{2}$:

$$V(D^*, G^*) = -\log 4$$ $$C(G) = -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \,\|\, p_g)$$

Wasserstein GAN (WGAN)

$$W(p_{\text{data}}, p_g) = \sup_{\|f\|_L \leq 1}\; \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[f(\mathbf{x})] - \mathbb{E}_{\mathbf{x}\sim p_g}[f(\mathbf{x})]$$

The critic $f$ (replacing the discriminator) is enforced to be 1-Lipschitz via gradient penalty:

$$\mathcal{L}_{\text{GP}} = \lambda\,\mathbb{E}_{\hat{\mathbf{x}}}\!\left[\left(\|\nabla_{\hat{\mathbf{x}}} f(\hat{\mathbf{x}})\|_2 - 1\right)^2\right]$$ $$\hat{\mathbf{x}} = \alpha\,\mathbf{x}_{\text{real}} + (1-\alpha)\,\mathbf{x}_{\text{fake}},\quad \alpha\sim U[0,1]$$
23

Diffusion Models (DDPM)

Generating data by learning to reverse a gradual noising process.

Generative Image / Audio / Video 2020 — Ho, Jain, Abbeel

Forward Process (Diffusion)

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t\mathbf{I}\right)$$ $$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\, (1-\bar{\alpha}_t)\mathbf{I}\right)$$

Where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$.

Reverse Process

$$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \sigma_t^2\mathbf{I}\right)$$ $$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right)$$

Training Objective (Simplified)

$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\left(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\; t\right)\right\|^2\right]$$

Score-Based Formulation

$$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) = -\sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t) = -\sqrt{1-\bar{\alpha}_t}\,\mathbf{s}_\theta(\mathbf{x}_t, t)$$

Classifier-Free Guidance

$$\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, c) = (1+w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)$$
24

Normalizing Flows

Exact likelihood models using invertible transformations.

Generative Exact Likelihood

Change of Variables

$$\mathbf{x} = f(\mathbf{z}), \quad \mathbf{z} = f^{-1}(\mathbf{x})$$ $$\log p_\mathbf{x}(\mathbf{x}) = \log p_\mathbf{z}(f^{-1}(\mathbf{x})) + \log\left|\det\frac{\partial f^{-1}}{\partial \mathbf{x}}\right|$$

Composition of Flows

$$\mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z})$$ $$\log p(\mathbf{x}) = \log p(\mathbf{z}) - \sum_{k=1}^{K}\log\left|\det\frac{\partial f_k}{\partial \mathbf{h}_{k-1}}\right|$$

Coupling Layer (RealNVP)

$$\mathbf{x}_{1:d} = \mathbf{z}_{1:d}$$ $$\mathbf{x}_{d+1:D} = \mathbf{z}_{d+1:D} \odot \exp\!\left(s(\mathbf{z}_{1:d})\right) + t(\mathbf{z}_{1:d})$$

The Jacobian is triangular, so $\det = \prod \exp(s_i) = \exp(\sum s_i)$, computed in $O(D)$.

25

Energy-Based Models

Defining probability distributions via scalar energy functions.

Generative Unnormalized

Energy Function

$$p_\theta(\mathbf{x}) = \frac{\exp(-E_\theta(\mathbf{x}))}{Z_\theta}, \quad Z_\theta = \int \exp(-E_\theta(\mathbf{x}))\,d\mathbf{x}$$

Score Matching

$$\mathcal{L}_{\text{SM}} = \mathbb{E}_{p_{\text{data}}}\!\left[\frac{1}{2}\|\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\|^2 + \text{tr}(\nabla^2_\mathbf{x} \log p_\theta(\mathbf{x}))\right]$$

Contrastive Divergence

$$\nabla_\theta \log p_\theta(\mathbf{x}) = -\nabla_\theta E_\theta(\mathbf{x}) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(\mathbf{x})]$$ $$\approx -\nabla_\theta E_\theta(\mathbf{x}_{\text{data}}) + \nabla_\theta E_\theta(\tilde{\mathbf{x}})$$

Where $\tilde{\mathbf{x}}$ is obtained from a few steps of MCMC starting from data.

26

Siamese Networks & Contrastive Learning

Learning representations by comparing pairs or groups of inputs — foundation of CLIP, SimCLR, and modern self-supervised vision.

Self-Supervised Representation Learning 1993 — Bromley et al. / 2020 — Chen et al.

Siamese Network

Two identical networks sharing weights process two inputs and compare their embeddings:

$$\mathbf{z}_1 = f_\theta(\mathbf{x}_1), \quad \mathbf{z}_2 = f_\theta(\mathbf{x}_2)$$ $$d(\mathbf{x}_1, \mathbf{x}_2) = \|\mathbf{z}_1 - \mathbf{z}_2\|_2$$

Contrastive Loss

$$\mathcal{L}_{\text{contrastive}} = (1-y)\frac{1}{2}d^2 + y\frac{1}{2}\max(0, m - d)^2$$

Where $y=0$ for similar pairs, $y=1$ for dissimilar, and $m$ is the margin.

Triplet Loss

$$\mathcal{L}_{\text{triplet}} = \max\!\left(0,\; \|f(\mathbf{x}_a) - f(\mathbf{x}_p)\|^2 - \|f(\mathbf{x}_a) - f(\mathbf{x}_n)\|^2 + \alpha\right)$$

NT-Xent Loss (SimCLR)

Normalized temperature-scaled cross-entropy over a batch of $2N$ augmented pairs:

$$\text{sim}(\mathbf{z}_i, \mathbf{z}_j) = \frac{\mathbf{z}_i^\top\mathbf{z}_j}{\|\mathbf{z}_i\|\,\|\mathbf{z}_j\|}$$ $$\ell_{i,j} = -\log\frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j)/\tau)}{\sum_{k=1}^{2N}\mathbf{1}_{[k\neq i]}\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k)/\tau)}$$

CLIP (Contrastive Language-Image Pre-training)

Aligns image and text embeddings using a symmetric contrastive loss over a batch of $N$ image-text pairs:

$$\mathbf{z}_I = f_{\text{image}}(\mathbf{x}_I), \quad \mathbf{z}_T = f_{\text{text}}(\mathbf{x}_T)$$ $$\text{logits} = \mathbf{Z}_I\,\mathbf{Z}_T^\top \cdot e^\tau$$ $$\mathcal{L}_{\text{CLIP}} = \frac{1}{2}\left(\text{CE}(\text{logits},\, \mathbf{I}_N) + \text{CE}(\text{logits}^\top,\, \mathbf{I}_N)\right)$$

BYOL / SimSiam (No Negatives)

$$\mathcal{L}_{\text{BYOL}} = 2 - 2\cdot\frac{\langle q_\theta(\mathbf{z}_1),\, \text{sg}(\mathbf{z}_2')\rangle}{\|q_\theta(\mathbf{z}_1)\|\,\|\mathbf{z}_2'\|}$$

Where $\text{sg}(\cdot)$ is stop-gradient, $\mathbf{z}_2'$ comes from an EMA target encoder, and $q_\theta$ is a predictor MLP.

27

JEPA (Joint Embedding Predictive Architecture)

Yann LeCun's proposed path to human-level AI — predicting in latent space rather than pixel space.

Self-Supervised Representation Learning / World Models 2022 — LeCun / 2023 — Assran et al. (I-JEPA)

Core Principle

Unlike generative models (which predict pixels) or contrastive models (which compare positive/negative pairs), JEPA predicts the representation of a target from a context — entirely in embedding space:

$$\text{Generative:}\quad \text{predict } \mathbf{x} \;\text{(pixel space)}$$ $$\text{Contrastive:}\quad \text{maximize } \text{sim}(f(\mathbf{x}), f(\mathbf{x}^+)) \text{ vs } f(\mathbf{x}^-)$$ $$\text{JEPA:}\quad \text{predict } \bar{f}(\mathbf{y}) \text{ from } f(\mathbf{x}) \;\text{(latent space)}$$

Architecture

$$\mathbf{s}_x = f_\theta(\mathbf{x}) \quad \text{(context encoder)}$$ $$\bar{\mathbf{s}}_y = \bar{f}_{\bar{\theta}}(\mathbf{y}) \quad \text{(target encoder — EMA of } f_\theta\text{)}$$ $$\hat{\mathbf{s}}_y = g_\phi(\mathbf{s}_x, \mathbf{m}) \quad \text{(predictor, conditioned on mask } \mathbf{m}\text{)}$$

Loss Function

$$\mathcal{L}_{\text{JEPA}} = \|\hat{\mathbf{s}}_y - \text{sg}(\bar{\mathbf{s}}_y)\|^2$$

Where $\text{sg}(\cdot)$ is stop-gradient. The target encoder $\bar{f}$ is updated via exponential moving average (EMA):

$$\bar{\theta} \leftarrow \alpha\,\bar{\theta} + (1-\alpha)\,\theta, \quad \alpha \in [0.996, 1)$$

I-JEPA (Image JEPA)

The context encoder sees a partial view of the image (with masked patches), and the predictor must predict target block representations in latent space:

$$\mathbf{x}_{\text{context}} = \text{ViT}_\theta(\text{visible patches})$$ $$\hat{\mathbf{s}}_{y_m} = g_\phi(\mathbf{x}_{\text{context}}, \text{pos}(y_m)) \quad \forall\, m \in \text{target blocks}$$ $$\mathcal{L}_{\text{I-JEPA}} = \frac{1}{M}\sum_{m=1}^M \|\hat{\mathbf{s}}_{y_m} - \text{sg}(\bar{\mathbf{s}}_{y_m})\|^2$$

V-JEPA (Video JEPA)

Extends to video by masking spacetime tubes and predicting their latent representations:

$$\mathbf{x} \in \mathbb{R}^{T \times H \times W \times C} \rightarrow \text{mask spacetime tubes} \rightarrow \text{predict in latent space}$$

JEPA vs Other Paradigms

MethodPrediction SpaceNegatives?Collapse Prevention
AutoencoderPixel / InputNoBottleneck
Contrastive (SimCLR)Latent (similarity)YesNegative pairs
BYOL / SimSiamLatentNoEMA + stop-gradient
JEPALatent (prediction)NoEMA + stop-gradient + masking
28

Graph Neural Networks

Neural networks operating on graph-structured data.

Supervised / Semi-Supervised Graphs & Networks

Message Passing Framework

$$\mathbf{m}_v^{(l)} = \bigoplus_{u \in \mathcal{N}(v)} M^{(l)}\!\left(\mathbf{h}_v^{(l)}, \mathbf{h}_u^{(l)}, \mathbf{e}_{vu}\right)$$ $$\mathbf{h}_v^{(l+1)} = U^{(l)}\!\left(\mathbf{h}_v^{(l)}, \mathbf{m}_v^{(l)}\right)$$

Graph Convolutional Network (GCN)

$$\mathbf{H}^{(l+1)} = \sigma\!\left(\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}\mathbf{H}^{(l)}\mathbf{W}^{(l)}\right)$$

Where $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ (adjacency with self-loops), $\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{A}_{ij}$.

Graph Attention Network (GAT)

$$e_{ij} = \text{LeakyReLU}\!\left(\mathbf{a}^\top [\mathbf{W}\mathbf{h}_i \,\|\, \mathbf{W}\mathbf{h}_j]\right)$$ $$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k\in\mathcal{N}(i)}\exp(e_{ik})}$$ $$\mathbf{h}_i' = \sigma\!\left(\sum_{j\in\mathcal{N}(i)}\alpha_{ij}\,\mathbf{W}\mathbf{h}_j\right)$$

GraphSAGE

$$\mathbf{h}_{\mathcal{N}(v)}^{(l)} = \text{AGGREGATE}\!\left(\left\{\mathbf{h}_u^{(l)}: u \in \mathcal{N}(v)\right\}\right)$$ $$\mathbf{h}_v^{(l+1)} = \sigma\!\left(\mathbf{W}^{(l)}\cdot[\mathbf{h}_v^{(l)} \,\|\, \mathbf{h}_{\mathcal{N}(v)}^{(l)}]\right)$$

Graph Readout

$$\mathbf{h}_G = \text{READOUT}\!\left(\left\{\mathbf{h}_v^{(L)} : v \in G\right\}\right) = \frac{1}{|V|}\sum_{v\in V}\mathbf{h}_v^{(L)}$$
29

Capsule Networks

Encoding part-whole relationships with vector-valued capsules.

Supervised Computer Vision 2017 — Sabour, Frosst, Hinton

Squash Function

$$\text{squash}(\mathbf{s}_j) = \frac{\|\mathbf{s}_j\|^2}{1 + \|\mathbf{s}_j\|^2}\cdot\frac{\mathbf{s}_j}{\|\mathbf{s}_j\|}$$

Dynamic Routing

$$\hat{\mathbf{u}}_{j|i} = \mathbf{W}_{ij}\,\mathbf{u}_i \quad \text{(prediction vectors)}$$ $$c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})} \quad \text{(coupling coefficients)}$$ $$\mathbf{s}_j = \sum_i c_{ij}\,\hat{\mathbf{u}}_{j|i}, \quad \mathbf{v}_j = \text{squash}(\mathbf{s}_j)$$ $$b_{ij} \leftarrow b_{ij} + \hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j \quad \text{(routing update)}$$

Margin Loss

$$\mathcal{L}_k = T_k \max(0, m^+ - \|\mathbf{v}_k\|)^2 + \lambda(1-T_k)\max(0, \|\mathbf{v}_k\| - m^-)^2$$
30

Hopfield Network

Associative memory via energy minimization in a fully connected network.

Unsupervised Associative Memory 1982 — Hopfield

Energy Function

$$E = -\frac{1}{2}\sum_{i\neq j} w_{ij}\,s_i\,s_j - \sum_i \theta_i\,s_i = -\frac{1}{2}\mathbf{s}^\top\mathbf{W}\mathbf{s} - \boldsymbol{\theta}^\top\mathbf{s}$$

Hebbian Learning (Storage)

$$w_{ij} = \frac{1}{N}\sum_{\mu=1}^{P}\xi_i^\mu\,\xi_j^\mu, \quad w_{ii}=0$$ $$\mathbf{W} = \frac{1}{N}\sum_{\mu=1}^{P}\boldsymbol{\xi}^\mu (\boldsymbol{\xi}^\mu)^\top - \frac{P}{N}\mathbf{I}$$

Update Rule (Asynchronous)

$$s_i \leftarrow \text{sgn}\!\left(\sum_j w_{ij}\,s_j + \theta_i\right)$$

Storage Capacity

$$P_{\max} \approx \frac{N}{2\ln N}$$

Modern Hopfield Network (2020)

$$E = -\text{lse}(\beta\,\boldsymbol{\Xi}^\top\boldsymbol{\xi}) + \frac{1}{2}\boldsymbol{\xi}^\top\boldsymbol{\xi} + \text{const}$$ $$\text{Update:}\quad \boldsymbol{\xi}_{\text{new}} = \boldsymbol{\Xi}\,\text{softmax}(\beta\,\boldsymbol{\Xi}^\top\boldsymbol{\xi})$$

This update rule is equivalent to the attention mechanism in transformers.

31

Boltzmann Machine

Stochastic neural network based on statistical mechanics.

Generative Stochastic 1985 — Hinton & Sejnowski

Energy Function

$$E(\mathbf{v}, \mathbf{h}) = -\mathbf{v}^\top\mathbf{W}\mathbf{h} - \mathbf{b}^\top\mathbf{v} - \mathbf{c}^\top\mathbf{h} - \frac{1}{2}\mathbf{v}^\top\mathbf{L}\mathbf{v} - \frac{1}{2}\mathbf{h}^\top\mathbf{J}\mathbf{h}$$

Probability Distribution

$$p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z}\exp(-E(\mathbf{v}, \mathbf{h})), \quad Z = \sum_{\mathbf{v},\mathbf{h}}\exp(-E(\mathbf{v},\mathbf{h}))$$

Stochastic Update

$$p(s_i = 1 | \mathbf{s}_{-i}) = \sigma\!\left(\sum_j w_{ij} s_j + b_i\right)$$
32

Restricted Boltzmann Machine (RBM)

A bipartite Boltzmann machine enabling efficient training via Gibbs sampling.

Generative 2006 — Hinton

Energy

$$E(\mathbf{v}, \mathbf{h}) = -\mathbf{v}^\top\mathbf{W}\mathbf{h} - \mathbf{b}^\top\mathbf{v} - \mathbf{c}^\top\mathbf{h}$$

Conditional Distributions

$$p(h_j = 1|\mathbf{v}) = \sigma\!\left(\mathbf{W}_{:,j}^\top\mathbf{v} + c_j\right)$$ $$p(v_i = 1|\mathbf{h}) = \sigma\!\left(\mathbf{W}_{i,:}\mathbf{h} + b_i\right)$$

Contrastive Divergence (CD-k)

$$\Delta \mathbf{W} = \eta\left(\langle\mathbf{v}\mathbf{h}^\top\rangle_{\text{data}} - \langle\mathbf{v}\mathbf{h}^\top\rangle_{\text{recon}}\right)$$

Free Energy

$$F(\mathbf{v}) = -\mathbf{b}^\top\mathbf{v} - \sum_j \log\!\left(1 + \exp(\mathbf{W}_{:,j}^\top\mathbf{v} + c_j)\right)$$
33

Radial Basis Function Network

Using radial basis functions as activation in a single hidden layer.

Supervised Function Approximation

Architecture

$$\phi_j(\mathbf{x}) = \exp\!\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_j\|^2}{2\sigma_j^2}\right)$$ $$f(\mathbf{x}) = \sum_{j=1}^{K} w_j\,\phi_j(\mathbf{x}) + b = \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}) + b$$

Training

Typically a two-phase process: (1) find centers $\boldsymbol{\mu}_j$ via k-means clustering; (2) solve for weights $\mathbf{w}$ via least squares:

$$\mathbf{w}^* = (\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{y}$$

Where $\Phi_{ij} = \phi_j(\mathbf{x}_i)$ is the interpolation matrix.

34

Self-Organizing Map (SOM)

Unsupervised learning that maps high-dimensional data to a low-dimensional grid preserving topology.

Unsupervised Dimensionality Reduction 1982 — Kohonen

Best Matching Unit (BMU)

$$c = \arg\min_j \|\mathbf{x} - \mathbf{w}_j\|$$

Weight Update

$$\mathbf{w}_j(t+1) = \mathbf{w}_j(t) + \eta(t)\,h_{cj}(t)\,\left(\mathbf{x}(t) - \mathbf{w}_j(t)\right)$$

Neighborhood Function

$$h_{cj}(t) = \exp\!\left(-\frac{\|r_c - r_j\|^2}{2\sigma(t)^2}\right)$$

Both $\eta(t)$ and $\sigma(t)$ decrease monotonically over training.

35

Residual Networks (ResNet)

Skip connections enabling training of very deep networks.

Supervised Computer Vision 2015 — He et al.

Residual Block

$$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$$

The network learns the residual $\mathcal{F}(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ rather than the full mapping.

Bottleneck Block

$$\mathcal{F}(\mathbf{x}) = \mathbf{W}_3\,\text{ReLU}\!\left(\text{BN}\!\left(\mathbf{W}_2\,\text{ReLU}\!\left(\text{BN}(\mathbf{W}_1\mathbf{x})\right)\right)\right)$$

$\mathbf{W}_1$ reduces channels (1×1), $\mathbf{W}_2$ is 3×3 conv, $\mathbf{W}_3$ expands channels (1×1).

Gradient Flow

$$\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L}\left(1 + \frac{\partial}{\partial \mathbf{x}_l}\sum_{i=l}^{L-1}\mathcal{F}(\mathbf{x}_i)\right)$$

The "1 +" term ensures gradients can flow directly to any layer without attenuation.

Pre-Activation ResNet

$$\mathbf{y} = \mathbf{x} + \mathbf{W}_2\,\text{ReLU}\!\left(\text{BN}\!\left(\mathbf{W}_1\,\text{ReLU}(\text{BN}(\mathbf{x}))\right)\right)$$
36

Neural Ordinary Differential Equations

Continuous-depth networks defined by differential equations.

Architecture 2018 — Chen et al.

Continuous Dynamics

$$\frac{d\mathbf{h}(t)}{dt} = f_\theta(\mathbf{h}(t), t)$$ $$\mathbf{h}(T) = \mathbf{h}(0) + \int_0^T f_\theta(\mathbf{h}(t), t)\,dt$$

Adjoint Method (Memory-Efficient Backprop)

$$\mathbf{a}(t) = -\frac{\partial \mathcal{L}}{\partial \mathbf{h}(t)}$$ $$\frac{d\mathbf{a}}{dt} = -\mathbf{a}(t)^\top \frac{\partial f_\theta}{\partial \mathbf{h}}$$ $$\frac{d\mathcal{L}}{d\theta} = -\int_T^0 \mathbf{a}(t)^\top \frac{\partial f_\theta(\mathbf{h}(t), t)}{\partial \theta}\,dt$$

Memory cost is $O(1)$ regardless of depth, since states are recomputed during the backward ODE solve.

Connection to ResNets

$$\text{ResNet:}\quad \mathbf{h}_{t+1} = \mathbf{h}_t + f_\theta(\mathbf{h}_t) \quad\longleftrightarrow\quad \text{Neural ODE:}\quad \frac{d\mathbf{h}}{dt} = f_\theta(\mathbf{h}, t)$$
37

Echo State Network (Reservoir Computing)

A fixed random recurrent reservoir with only output weights trained.

Supervised Time Series 2001 — Jaeger

Reservoir Dynamics

$$\mathbf{h}(t) = (1-\alpha)\mathbf{h}(t-1) + \alpha\,\tanh\!\left(\mathbf{W}_{\text{res}}\mathbf{h}(t-1) + \mathbf{W}_{\text{in}}\mathbf{x}(t) + \mathbf{b}\right)$$

$\mathbf{W}_{\text{res}}$ and $\mathbf{W}_{\text{in}}$ are random and fixed. $\alpha$ is the leaking rate.

Output (Readout)

$$\mathbf{y}(t) = \mathbf{W}_{\text{out}}\,[\mathbf{h}(t);\, \mathbf{x}(t)]$$ $$\mathbf{W}_{\text{out}} = \mathbf{Y}\mathbf{H}^\top(\mathbf{H}\mathbf{H}^\top + \lambda\mathbf{I})^{-1}$$

Echo State Property

The reservoir must satisfy the echo state property: the spectral radius $\rho(\mathbf{W}_{\text{res}}) < 1$ ensures that the effect of initial conditions fades over time.

38

Spiking Neural Network

Biologically plausible networks where neurons communicate via discrete spikes.

Neuromorphic Event-Driven

Leaky Integrate-and-Fire (LIF) Model

$$\tau_m \frac{dV(t)}{dt} = -[V(t) - V_{\text{rest}}] + R\,I(t)$$ $$\text{If } V(t) \geq V_{\text{th}}:\quad \text{emit spike, } V(t) \leftarrow V_{\text{reset}}$$

Discrete LIF

$$V[t] = \beta\,V[t-1] + \sum_j w_j\,S_j[t] - V_{\text{th}}\,S_{\text{out}}[t-1]$$ $$S_{\text{out}}[t] = \Theta(V[t] - V_{\text{th}})$$

Where $\beta = \exp(-\Delta t / \tau_m)$ is the decay factor and $\Theta$ is the Heaviside step function.

Surrogate Gradient

Since $\Theta'(x) = \delta(x)$ is not useful for backprop, replace with a smooth surrogate:

$$\tilde{\Theta}'(x) = \frac{1}{\pi}\cdot\frac{1}{1 + (\pi x)^2} \quad\text{(fast sigmoid surrogate)}$$

Spike-Timing-Dependent Plasticity (STDP)

$$\Delta w = \begin{cases} A_+ \exp\!\left(-\frac{\Delta t}{\tau_+}\right) & \text{if } \Delta t > 0 \text{ (pre before post)} \\ -A_- \exp\!\left(\frac{\Delta t}{\tau_-}\right) & \text{if } \Delta t < 0 \text{ (post before pre)} \end{cases}$$
39

Kolmogorov-Arnold Network (KAN)

Learnable activation functions on edges, based on the Kolmogorov-Arnold representation theorem.

Architecture 2024 — Liu et al.

Kolmogorov-Arnold Representation Theorem

$$f(\mathbf{x}) = f(x_1, \dots, x_n) = \sum_{q=0}^{2n}\Phi_q\!\left(\sum_{p=1}^n \phi_{q,p}(x_p)\right)$$

KAN Layer

Each edge $(i, j)$ has a learnable univariate function $\phi_{ij}$, parameterized by B-splines:

$$\phi_{ij}(x) = w_b\,\text{SiLU}(x) + w_s\,\text{Spline}(x)$$ $$\text{Spline}(x) = \sum_k c_k\,B_k(x)$$

Layer Computation

$$x_j^{(l+1)} = \sum_{i=1}^{n_l} \phi_{ij}^{(l)}(x_i^{(l)})$$

Compared to MLPs which have fixed activations on nodes and learnable linear weights on edges, KANs have learnable nonlinear functions on edges and summation on nodes.

40

State Space Models (S4 / Mamba)

Sequence models based on continuous-time state space representations with efficient linear-time computation.

Sequence Modeling 2021 — Gu et al.

Continuous State Space

$$\frac{d\mathbf{h}(t)}{dt} = \mathbf{A}\,\mathbf{h}(t) + \mathbf{B}\,x(t)$$ $$y(t) = \mathbf{C}\,\mathbf{h}(t) + D\,x(t)$$

Discretization (Zero-Order Hold)

$$\bar{\mathbf{A}} = \exp(\Delta\mathbf{A}) \approx (\mathbf{I} - \Delta\mathbf{A}/2)^{-1}(\mathbf{I} + \Delta\mathbf{A}/2)$$ $$\bar{\mathbf{B}} = (\Delta\mathbf{A})^{-1}(\bar{\mathbf{A}} - \mathbf{I})\cdot\Delta\mathbf{B}$$

Discrete Recurrence

$$\mathbf{h}_k = \bar{\mathbf{A}}\,\mathbf{h}_{k-1} + \bar{\mathbf{B}}\,x_k$$ $$y_k = \mathbf{C}\,\mathbf{h}_k + D\,x_k$$

Convolution Form

$$\bar{\mathbf{K}} = (\mathbf{C}\bar{\mathbf{B}},\; \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\; \dots,\; \mathbf{C}\bar{\mathbf{A}}^{L-1}\bar{\mathbf{B}})$$ $$\mathbf{y} = \bar{\mathbf{K}} * \mathbf{x}$$

Computed in $O(L \log L)$ via FFT during training.

HiPPO Initialization

$$A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases}$$

Selective SSM (Mamba)

Makes parameters input-dependent for content-aware reasoning:

$$\mathbf{B}_k = s_B(\mathbf{x}_k), \quad \mathbf{C}_k = s_C(\mathbf{x}_k), \quad \Delta_k = \text{softplus}(s_\Delta(\mathbf{x}_k))$$
41

Hypernetworks

Networks that generate the weights of another network.

Meta-Learning 2016 — Ha, Dai, Le

Formulation

$$\boldsymbol{\theta} = h_\psi(\mathbf{z})$$ $$\hat{\mathbf{y}} = f_{\boldsymbol{\theta}}(\mathbf{x}) = f_{h_\psi(\mathbf{z})}(\mathbf{x})$$

The hypernetwork $h_\psi$ maps an embedding $\mathbf{z}$ (which can be task-specific, layer-specific, or input-dependent) to the parameters of the main network $f$.

Training

$$\mathcal{L}(\psi) = \mathbb{E}\!\left[\ell\!\left(f_{h_\psi(\mathbf{z})}(\mathbf{x}),\, \mathbf{y}\right)\right]$$ $$\nabla_\psi\mathcal{L} = \nabla_\theta\ell \cdot \frac{\partial h_\psi(\mathbf{z})}{\partial \psi}$$
42

Neural Cellular Automata

Learned local update rules that produce global emergent behavior.

Self-Organizing Morphogenesis 2020 — Mordvintsev et al.

Cell State Update

$$\text{Perception:}\quad \mathbf{p}_i = [\text{Sobel}_x * \mathbf{s}_i;\; \text{Sobel}_y * \mathbf{s}_i;\; \mathbf{s}_i]$$ $$\text{Update:}\quad \Delta\mathbf{s}_i = f_\theta(\mathbf{p}_i)$$ $$\text{Stochastic mask:}\quad m_i \sim \text{Bernoulli}(p)$$ $$\mathbf{s}_i^{t+1} = \mathbf{s}_i^t + m_i \cdot \Delta\mathbf{s}_i$$

All cells share the same neural network $f_\theta$, and the stochastic update mask enforces asynchrony for robustness.

Training via Differentiable Simulation

$$\mathcal{L} = \mathbb{E}_{t\sim[t_{\min}, t_{\max}]}\!\left[\|\mathbf{S}^{(t)} - \mathbf{S}_{\text{target}}\|^2\right]$$

Gradients are backpropagated through time across the simulation steps.

43

Neural Turing Machine / Differentiable Neural Computer

Neural networks augmented with external differentiable memory — capable of learning algorithms.

Supervised Algorithmic Reasoning 2014 — Graves et al.

Architecture

A controller network (LSTM or MLP) interacts with an external memory matrix $\mathbf{M} \in \mathbb{R}^{N \times M}$ via differentiable read/write heads:

Addressing — Content-Based

$$w_t^c(i) = \frac{\exp(\beta_t\, K[\mathbf{k}_t, \mathbf{M}_t(i)])}{\sum_j \exp(\beta_t\, K[\mathbf{k}_t, \mathbf{M}_t(j)])}$$ $$K[\mathbf{u}, \mathbf{v}] = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\,\|\mathbf{v}\|} \quad \text{(cosine similarity)}$$

Addressing — Location-Based

$$\mathbf{w}_t^g = g_t\,\mathbf{w}_t^c + (1-g_t)\,\mathbf{w}_{t-1} \quad \text{(interpolation)}$$ $$\tilde{w}_t(i) = \sum_{j=0}^{N-1} w_t^g(j)\, s_t(i - j) \quad \text{(convolutional shift)}$$ $$w_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}} \quad \text{(sharpening)}$$

Read & Write

$$\text{Read:}\quad \mathbf{r}_t = \sum_i w_t^r(i)\,\mathbf{M}_t(i)$$ $$\text{Write:}\quad \mathbf{M}_t = \mathbf{M}_{t-1}\odot(\mathbf{1} - \mathbf{w}_t^w\,\mathbf{e}_t^\top) + \mathbf{w}_t^w\,\mathbf{a}_t^\top$$

$\mathbf{e}_t$ is the erase vector and $\mathbf{a}_t$ is the add vector.

44

Bayesian Neural Network

Placing probability distributions over weights for principled uncertainty quantification.

Probabilistic Uncertainty Estimation

Bayesian Inference over Weights

$$p(\boldsymbol{\theta}|\mathcal{D}) = \frac{p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})}{p(\mathcal{D})} = \frac{p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})}{\int p(\mathcal{D}|\boldsymbol{\theta})\,p(\boldsymbol{\theta})\,d\boldsymbol{\theta}}$$

Predictive Distribution

$$p(\mathbf{y}^*|\mathbf{x}^*, \mathcal{D}) = \int p(\mathbf{y}^*|\mathbf{x}^*, \boldsymbol{\theta})\,p(\boldsymbol{\theta}|\mathcal{D})\,d\boldsymbol{\theta}$$ $$\approx \frac{1}{S}\sum_{s=1}^{S} p(\mathbf{y}^*|\mathbf{x}^*, \boldsymbol{\theta}^{(s)}), \quad \boldsymbol{\theta}^{(s)} \sim p(\boldsymbol{\theta}|\mathcal{D})$$

Variational Inference (Bayes by Backprop)

Approximate the intractable posterior with $q_\phi(\boldsymbol{\theta})$:

$$\mathcal{L}_{\text{VI}} = \text{KL}(q_\phi(\boldsymbol{\theta})\,\|\,p(\boldsymbol{\theta})) - \mathbb{E}_{q_\phi}[\log p(\mathcal{D}|\boldsymbol{\theta})]$$

With the reparameterization trick: $\theta_i = \mu_i + \sigma_i\,\epsilon$, $\epsilon\sim\mathcal{N}(0,1)$.

MC Dropout as Approximate BNN

$$\text{Var}[\mathbf{y}^*] \approx \frac{1}{T}\sum_{t=1}^{T}\hat{\mathbf{y}}_t^2 - \left(\frac{1}{T}\sum_{t=1}^T \hat{\mathbf{y}}_t\right)^2$$

Running $T$ forward passes with dropout enabled at test time provides uncertainty estimates.

45

Liquid Neural Network

Continuous-time neural networks with input-dependent dynamics — inspired by C. elegans neuroscience.

Neuromorphic Time Series / Robotics 2021 — Hasani et al. (MIT)

Liquid Time-Constant (LTC) Neuron

$$\frac{d\mathbf{h}(t)}{dt} = -\left[\frac{1}{\tau} + f_\theta(\mathbf{h}(t), \mathbf{x}(t))\right]\odot\mathbf{h}(t) + f_\theta(\mathbf{h}(t), \mathbf{x}(t))\odot A$$

The key insight: the time constant $\tau$ is modulated by the input, making dynamics input-dependent.

Neural Circuit Policy

$$f_\theta(\mathbf{h}, \mathbf{x}) = \sigma\!\left(\mathbf{W}\,[\mathbf{h};\,\mathbf{x}] + \mathbf{b}\right)$$ $$\tau_{\text{eff}}(t) = \frac{\tau}{1 + \tau\,f_\theta(\mathbf{h}(t), \mathbf{x}(t))}$$

Closed-Form Continuous-Depth (CfC)

An analytical solution avoiding ODE solvers:

$$\mathbf{h}(t) = \left(\mathbf{h}_0 - f_\infty\right)\odot\exp\!\left(-\frac{t}{\tau_{\text{eff}}}\right) + f_\infty$$

Where $f_\infty = A\,\sigma(\mathbf{W}[\mathbf{h}_0;\,\mathbf{x}] + \mathbf{b})$ is the steady-state.

Properties

Liquid networks are remarkably compact (19 neurons can drive a car) and inherently interpretable due to their neuroscience-inspired wiring.

46

Mixture Density Network

Predicting full conditional probability distributions using a mixture of Gaussians.

Supervised Multi-Modal Regression 1994 — Bishop

Output Parameterization

A neural network outputs the parameters of a Gaussian mixture model:

$$p(\mathbf{y}|\mathbf{x}) = \sum_{k=1}^{K}\pi_k(\mathbf{x})\,\mathcal{N}\!\left(\mathbf{y};\, \boldsymbol{\mu}_k(\mathbf{x}),\, \sigma_k^2(\mathbf{x})\mathbf{I}\right)$$

Network Outputs

$$\boldsymbol{\pi}(\mathbf{x}) = \text{softmax}(\mathbf{z}_\pi), \quad \sum_k\pi_k = 1$$ $$\boldsymbol{\mu}_k(\mathbf{x}) = \mathbf{z}_{\mu_k} \quad \text{(unconstrained)}$$ $$\sigma_k(\mathbf{x}) = \exp(\mathbf{z}_{\sigma_k}) \quad \text{(positive)}$$

Loss (Negative Log-Likelihood)

$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \log\sum_{k=1}^K \pi_k(\mathbf{x}_i)\,\mathcal{N}(\mathbf{y}_i;\, \boldsymbol{\mu}_k(\mathbf{x}_i), \sigma_k^2(\mathbf{x}_i))$$

MDNs can model one-to-many mappings (e.g., inverse kinematics, handwriting generation) where a single input maps to multiple valid outputs.

47

WaveNet

Autoregressive generative model with dilated causal convolutions for raw audio synthesis.

Generative Audio / Speech 2016 — van den Oord et al. (DeepMind)

Autoregressive Formulation

$$p(\mathbf{x}) = \prod_{t=1}^{T} p(x_t | x_1, \dots, x_{t-1})$$

Dilated Causal Convolutions

Stack convolutions with exponentially increasing dilation rates to grow the receptive field efficiently:

$$(f *_d x)_t = \sum_{k=0}^{K-1} f_k \cdot x_{t - d \cdot k}$$ $$\text{Dilations:}\quad d = 1, 2, 4, 8, \dots, 512 \quad \text{(repeated)}$$ $$\text{Receptive field} = \text{blocks} \times \sum_{l=0}^{L-1} 2^l \times (K-1) + 1$$

Gated Activation

$$\mathbf{z} = \tanh(\mathbf{W}_{f,k} * \mathbf{x}) \odot \sigma(\mathbf{W}_{g,k} * \mathbf{x})$$

Conditional WaveNet

$$\mathbf{z} = \tanh(\mathbf{W}_f * \mathbf{x} + \mathbf{V}_f * \mathbf{c}) \odot \sigma(\mathbf{W}_g * \mathbf{x} + \mathbf{V}_g * \mathbf{c})$$

Where $\mathbf{c}$ is a conditioning signal (e.g., mel spectrogram, speaker ID, linguistic features).

μ-Law Quantization

$$f(x_t) = \text{sign}(x_t)\frac{\ln(1 + \mu|x_t|)}{\ln(1+\mu)}, \quad \mu = 255$$

Compresses the 16-bit audio range into 256 values for categorical output via softmax.

48

Large Language Model (LLM) Architecture LLM DEEP DIVE

How modern LLMs like GPT-4, Claude, LLaMA, Gemini, and Mistral combine the neural network building blocks documented above into a single coherent system.

LLM Core Self-Supervised + RLHF Language / Multimodal 2017–present

Component Map — What LLMs Use

Every building block below is documented in detail in the sections above. An LLM is fundamentally a decoder-only Transformer composed of these pieces:

Transformer (Decoder-Only)

The core architecture. Stacked blocks with causal self-attention, preventing tokens from attending to future positions.

→ Section 13: Transformer

Scaled Dot-Product Attention

The fundamental operation: $\text{softmax}(\mathbf{QK}^\top/\sqrt{d_k})\mathbf{V}$. Every token attends to all previous tokens.

→ Section 12: Attention Mechanism

Multi-Head / GQA

Parallel attention heads capture different relationship types. GQA shares KV heads to reduce memory 4–8×.

→ Section 13: Multi-Head Attention

RoPE Positional Encoding

Rotary embeddings encode relative position directly into Q/K vectors. Used by LLaMA, Mistral, Claude, Gemma.

→ Section 13: RoPE

Feed-Forward Network (MLP)

Two-layer MLP at each position: project up 4×, apply SiLU/GELU, project back down. The "memory" of the model.

→ Section 2: MLP, Section 13: FFN

SiLU / GELU Activation

Smooth activations used inside transformer FFNs. SiLU (Swish) in LLaMA/Mistral; GELU in GPT/BERT.

→ Section 3: Activation Functions

RMSNorm / LayerNorm

Normalizes activations for training stability. Modern LLMs prefer RMSNorm (no mean subtraction, faster).

→ Section 6: Regularization

Residual Connections

Skip connections around every attention and FFN sublayer. Essential for training 100+ layer models.

→ Section 27: Residual Networks

AdamW Optimizer

Adam with decoupled weight decay. The standard optimizer for all LLM pretraining.

→ Section 5: Optimizers

Backpropagation

Gradient computation through billions of parameters. Combined with gradient checkpointing for memory efficiency.

→ Section 4: Backpropagation

Dropout (Optional)

Used in GPT-2/3 training. Many modern LLMs (LLaMA, PaLM) omit dropout entirely, relying on data scale.

→ Section 6: Regularization

SSM / Mamba (Hybrid)

Some architectures (Jamba, Zamba) combine transformer blocks with Mamba SSM layers for linear-time long sequences.

→ Section 32: State Space Models

Tokenization

LLMs do not operate on raw characters. Text is first split into subword tokens using algorithms like BPE (Byte Pair Encoding), which iteratively merges the most frequent byte pairs:

$$\text{BPE: merge}(a, b) = ab \quad\text{where}\quad (a,b) = \arg\max_{(x,y)} \text{count}(xy)$$ $$|\mathcal{V}| \;\text{typically}\; 32{,}000 \;\text{to}\; 128{,}000 \;\text{tokens}$$

Each token is mapped to an integer ID, which is then looked up in the embedding table.

Token & Positional Embeddings

$$\mathbf{h}_0 = \mathbf{E}_{\text{tok}}[x_1, x_2, \dots, x_n] + \mathbf{E}_{\text{pos}}$$ $$\mathbf{E}_{\text{tok}} \in \mathbb{R}^{|\mathcal{V}| \times d}, \quad d \;\text{typically}\; 4096\;\text{to}\;16384$$

Modern LLMs typically use RoPE instead of learned positional embeddings, applied directly to Q and K vectors inside each attention layer rather than added to the input.

Weight Tying

Many LLMs share the token embedding matrix with the output projection (language model head):

$$\text{logits} = \mathbf{h}_L \cdot \mathbf{E}_{\text{tok}}^\top \in \mathbb{R}^{n \times |\mathcal{V}|}$$

The LLM Transformer Block (Full Equations)

A modern LLM (e.g. LLaMA-style) stacks $L$ identical blocks. Each block performs:

Step 1: Pre-Norm + Causal Multi-Head Attention + Residual

$$\mathbf{x}' = \text{RMSNorm}(\mathbf{h}^{(l)})$$ $$\mathbf{Q} = \mathbf{x}'\mathbf{W}_Q, \quad \mathbf{K} = \mathbf{x}'\mathbf{W}_K, \quad \mathbf{V} = \mathbf{x}'\mathbf{W}_V$$ $$\mathbf{Q} = \text{RoPE}(\mathbf{Q}), \quad \mathbf{K} = \text{RoPE}(\mathbf{K})$$ $$\text{Attn} = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \mathbf{M}_{\text{causal}}\right)\mathbf{V}$$ $$\mathbf{h}^{(l)}_{\text{mid}} = \mathbf{h}^{(l)} + \text{MultiHead}(\text{Attn})$$

Step 2: Pre-Norm + SwiGLU FFN + Residual

$$\mathbf{x}'' = \text{RMSNorm}(\mathbf{h}^{(l)}_{\text{mid}})$$ $$\text{SwiGLU}(\mathbf{x}'') = (\mathbf{x}''\mathbf{W}_1 \odot \text{SiLU}(\mathbf{x}''\mathbf{W}_{\text{gate}}))\,\mathbf{W}_2$$ $$\mathbf{h}^{(l+1)} = \mathbf{h}^{(l)}_{\text{mid}} + \text{SwiGLU}(\mathbf{x}'')$$

Where $\mathbf{W}_1, \mathbf{W}_{\text{gate}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ and $\mathbf{W}_2 \in \mathbb{R}^{d_{\text{ff}} \times d}$, with $d_{\text{ff}} \approx \frac{8}{3}d$ (SwiGLU adjustment).

Final Output

$$\mathbf{h}_{\text{final}} = \text{RMSNorm}(\mathbf{h}^{(L)})$$ $$P(x_{n+1} | x_1, \dots, x_n) = \text{softmax}(\mathbf{h}_{\text{final}}[n]\,\mathbf{W}_{\text{head}})$$
┌─────────────────────────────────────────────────────┐ │ FULL LLM FORWARD PASS │ ├─────────────────────────────────────────────────────┤ │ │ │ Tokens: [x₁, x₂, ..., xₙ] │ │ │ │ │ ▼ │ │ ┌──────────┐ │ │ │ Embedding │ h₀ = E_tok[tokens] │ │ └────┬─────┘ │ │ │ │ │ ▼ ×L layers │ │ ╔═══════════════════════════════════╗ │ │ ║ ┌──────────┐ ║ │ │ ║ │ RMSNorm │ ║ │ │ ║ └────┬─────┘ ║ │ │ ║ ▼ ║ │ │ ║ ┌──────────────────────┐ ║ │ │ ║ │ Causal Multi-Head │ ║ │ │ ║ │ Attention + RoPE │◄─ KV Cache │ │ ║ │ (with GQA) │ ║ │ │ ║ └────┬─────────────────┘ ║ │ │ ║ │ + residual ║ │ │ ║ ▼ ║ │ │ ║ ┌──────────┐ ║ │ │ ║ │ RMSNorm │ ║ │ │ ║ └────┬─────┘ ║ │ │ ║ ▼ ║ │ │ ║ ┌──────────────────────┐ ║ │ │ ║ │ SwiGLU FFN │ ║ │ │ ║ │ (W₁ ⊙ SiLU(W_gate)) │ ║ │ │ ║ │ × W₂ │ ║ │ │ ║ └────┬─────────────────┘ ║ │ │ ║ │ + residual ║ │ │ ╚═══════╪═══════════════════════════╝ │ │ ▼ │ │ ┌──────────┐ │ │ │ RMSNorm │ (final) │ │ └────┬─────┘ │ │ ▼ │ │ ┌──────────┐ │ │ │ LM Head │ logits = h · W_head │ │ └────┬─────┘ │ │ ▼ │ │ ┌──────────┐ │ │ │ Softmax │ → P(next token) │ │ └──────────┘ │ └─────────────────────────────────────────────────────┘

KV Cache (Inference Optimization)

During autoregressive generation, previously computed key and value vectors are cached to avoid redundant computation:

$$\text{At step } t: \quad \mathbf{K}_{\text{cache}} = [\mathbf{k}_1, \mathbf{k}_2, \dots, \mathbf{k}_t], \quad \mathbf{V}_{\text{cache}} = [\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_t]$$ $$\text{Only compute:}\quad \mathbf{q}_t = \mathbf{x}_t\mathbf{W}_Q, \quad \mathbf{k}_t = \mathbf{x}_t\mathbf{W}_K, \quad \mathbf{v}_t = \mathbf{x}_t\mathbf{W}_V$$ $$\text{Attend:}\quad \mathbf{o}_t = \text{softmax}\!\left(\frac{\mathbf{q}_t \mathbf{K}_{\text{cache}}^\top}{\sqrt{d_k}}\right)\mathbf{V}_{\text{cache}}$$

KV Cache Memory

$$\text{Memory} = 2 \times L \times n_{\text{kv\_heads}} \times d_k \times n_{\text{seq}} \times \text{bytes per param}$$

For a 70B model with 8K context in FP16: ~2–4 GB of KV cache per sequence.

PagedAttention (vLLM)

Manages KV cache as virtual memory pages to eliminate fragmentation and enable efficient batching of variable-length sequences.

LLM Training Pipeline

Phase 1: Pre-Training (Next Token Prediction)

$$\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_1, \dots, x_{t-1})$$ $$= -\sum_{t=1}^{T}\log\frac{\exp(\mathbf{h}_t^\top \mathbf{e}_{x_t})}{\sum_{v\in\mathcal{V}}\exp(\mathbf{h}_t^\top \mathbf{e}_v)}$$

Trained on trillions of tokens from web text, books, code, etc.

Phase 2: Supervised Fine-Tuning (SFT)

$$\mathcal{L}_{\text{SFT}} = -\sum_{t \in \text{response}} \log P_\theta(x_t | \text{prompt}, x_1, \dots, x_{t-1})$$

Only the response tokens contribute to the loss; prompt tokens are masked.

Phase 3: RLHF (Reinforcement Learning from Human Feedback)

Step 3a: Train a reward model $r_\phi$ on human preference pairs $(y_w \succ y_l)$:

$$\mathcal{L}_{\text{reward}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log\sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$

Step 3b: Optimize the policy with PPO, constrained by a KL penalty from the reference model $\pi_{\text{ref}}$:

$$\max_{\pi_\theta}\; \mathbb{E}_{x\sim\mathcal{D},\, y\sim\pi_\theta(y|x)}\!\left[r_\phi(x,y)\right] - \beta\,\text{KL}\!\left(\pi_\theta(y|x)\,\|\,\pi_{\text{ref}}(y|x)\right)$$

DPO (Direct Preference Optimization)

Bypasses the reward model entirely by reparameterizing the RLHF objective:

$$\mathcal{L}_{\text{DPO}} = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$

Scaling Laws

Kaplan Scaling (OpenAI, 2020)

$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$$

Loss follows power laws in model parameters $N$, dataset size $D$, and compute $C$.

Chinchilla Optimal (Hoffmann et al., 2022)

$$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$ $$\text{Rule of thumb:}\quad D \approx 20 \times N$$

For a given compute budget, model size and data should be scaled equally — a 10B model needs ~200B tokens.

LLM Parameter Count

$$N \approx 12\,L\,d^2 \quad\text{(for standard transformer with } d_{\text{ff}} = 4d\text{)}$$
ModelLayers $L$Dim $d$Heads $h$Params
GPT-2481600251.5B
LLaMA-2 7B324096326.7B
LLaMA-2 70B8081926470B
GPT-4 (est.)120~1228896~1.8T (MoE)

Mixture of Experts (MoE)

Replace the dense FFN with a sparse set of expert FFNs, routing each token to the top-$k$ experts:

$$\mathbf{g}(\mathbf{x}) = \text{softmax}(\mathbf{W}_g\,\mathbf{x}) \in \mathbb{R}^{E}$$ $$\text{TopK}(\mathbf{g}, k): \quad \mathcal{S} = \{i : g_i \text{ is in top-}k\}$$ $$\text{MoE}(\mathbf{x}) = \sum_{i \in \mathcal{S}} \frac{g_i(\mathbf{x})}{\sum_{j\in\mathcal{S}} g_j(\mathbf{x})}\cdot \text{FFN}_i(\mathbf{x})$$

Load Balancing Loss

$$\mathcal{L}_{\text{balance}} = \alpha\,E \sum_{i=1}^{E} f_i \cdot P_i$$ $$f_i = \frac{\text{tokens routed to expert } i}{\text{total tokens}}, \quad P_i = \frac{1}{T}\sum_{t=1}^T g_i(\mathbf{x}_t)$$

Encourages equal load across experts. Mixtral 8×7B uses $E=8$ experts with $k=2$, giving 47B total params but only ~13B active per token.

Sampling & Decoding Strategies

Temperature Scaling

$$P(x_t = v) = \frac{\exp(z_v / \tau)}{\sum_{v'}\exp(z_{v'} / \tau)}$$

$\tau \to 0$: greedy (argmax). $\tau = 1$: standard softmax. $\tau > 1$: more random.

Top-$k$ Sampling

$$P'(v) = \begin{cases} P(v) / \sum_{v' \in V_k} P(v') & \text{if } v \in V_k \\ 0 & \text{otherwise} \end{cases}$$

Top-$p$ (Nucleus) Sampling

$$V_p = \min\left\{V' \subseteq \mathcal{V} : \sum_{v \in V'} P(v) \geq p\right\}$$

Min-$p$ Sampling

$$V_{\min p} = \{v : P(v) \geq p_{\min} \cdot \max_{v'}P(v')\}$$

Beam Search

$$\text{score}(\mathbf{y}_{1:t}) = \frac{1}{t^\alpha}\sum_{i=1}^t \log P(y_i | y_1, \dots, y_{i-1})$$

Maintains top-$B$ candidates at each step, with length normalization exponent $\alpha$.

Repetition Penalty

$$z'_v = \begin{cases} z_v / \theta & \text{if } v \in \text{generated tokens and } z_v > 0 \\ z_v \cdot \theta & \text{if } v \in \text{generated tokens and } z_v \leq 0 \end{cases}$$

LoRA & Parameter-Efficient Fine-Tuning

LoRA (Low-Rank Adaptation)

Freeze the pretrained weights and inject trainable low-rank decompositions:

$$\mathbf{W}' = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$$ $$\mathbf{B} \in \mathbb{R}^{d \times r}, \quad \mathbf{A} \in \mathbb{R}^{r \times d}, \quad r \ll d$$ $$h = \mathbf{W}_0\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}$$

Typical $r = 8\text{–}64$, reducing trainable parameters by 1000× (e.g., 70B model → ~100M trainable params).

QLoRA

Combines LoRA with 4-bit quantized base weights using NormalFloat4 (NF4) data type:

$$\mathbf{W}_{\text{NF4}} = \text{quantize}_{4\text{bit}}(\mathbf{W}_0)$$ $$h = \text{dequant}(\mathbf{W}_{\text{NF4}})\,\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}$$

Fine-tune a 70B model on a single 48GB GPU.

Other PEFT Methods

MethodApproachTrainable Params
Prefix TuningLearnable "virtual tokens" prepended to keys/values~0.1%
Prompt TuningLearnable soft prompt embeddings~0.01%
AdaptersSmall bottleneck layers inserted between transformer sublayers~1–3%
IA³Learned vectors that rescale keys, values, and FFN activations~0.01%

Long Context Techniques

RoPE Frequency Scaling

$$\theta_i' = \theta_i \cdot s^{-1} = \frac{10000^{-2i/d}}{s} \quad\text{(linear scaling, factor } s \text{)}$$

Extending a 4K model to 32K context uses $s = 8$.

YaRN (Yet another RoPE extensioN)

$$\theta_i' = \begin{cases} \theta_i & \text{if } \lambda_i < \lambda_{\min} \;\text{(high freq, no change)} \\ \theta_i / s & \text{if } \lambda_i > \lambda_{\max} \;\text{(low freq, full scale)} \\ (1-\gamma)\theta_i + \gamma\,\theta_i/s & \text{otherwise (interpolate)} \end{cases}$$

Flash Attention

IO-aware exact attention that avoids materializing the $n \times n$ attention matrix:

$$\text{Standard:}\quad O(n^2) \text{ memory}, \quad \text{Flash:}\quad O(n) \text{ memory}$$ $$\text{Compute:}\quad O(n^2 d) \;\text{(same)} \quad\text{but}\;\sim 2\text{–}4\times \text{faster via HBM reduction}$$

Uses online softmax and tiling to keep intermediate results in SRAM, avoiding slow HBM reads/writes.

Ring Attention

Distributes sequence across devices in a ring topology, overlapping communication with computation for near-infinite context:

$$\text{Effective context} = n_{\text{devices}} \times n_{\text{per\_device}}$$

Sliding Window Attention

$$\text{Attn}(i, j) = \begin{cases} \text{softmax}(\mathbf{q}_i\mathbf{k}_j^\top/\sqrt{d_k}) & \text{if } |i-j| \leq w \\ 0 & \text{otherwise} \end{cases}$$

Used in Mistral. With $L$ layers and window $w$, effective receptive field is $L \times w$ tokens.

Neural Network Encyclopedia — 48 architectures — Generated March 2026

Covers: Perceptron, MLP, CNN, U-Net, RNN, LSTM, GRU, xLSTM, Bidirectional RNN, Attention, Transformer, BERT, Seq2Seq, ViT, RWKV, Autoencoder, VAE, GAN, Diffusion, Normalizing Flows, Energy-Based, Siamese/Contrastive (SimCLR, CLIP), JEPA, GNN, Capsule, Hopfield, Boltzmann, RBM, RBF, SOM, ResNet, Neural ODE, Echo State, Spiking NN, KAN, SSM/Mamba, Hypernetworks, Neural Cellular Automata, Neural Turing Machine, Bayesian NN, Liquid NN, Mixture Density, WaveNet + LLM Architecture Deep Dive