Perceptron
The simplest neural network — a single linear classifier.
Forward Pass
Given input vector $\mathbf{x} \in \mathbb{R}^n$, weight vector $\mathbf{w} \in \mathbb{R}^n$, and bias $b$:
Learning Rule
For a training sample $(\mathbf{x}, y)$ with learning rate $\eta$:
Convergence Theorem
If the training data is linearly separable with margin $\gamma = \min_i \frac{y_i(\mathbf{w}^{*\top}\mathbf{x}_i)}{\|\mathbf{w}^*\|}$, the perceptron converges in at most $\left(\frac{R}{\gamma}\right)^2$ updates, where $R = \max_i \|\mathbf{x}_i\|$.
Multi-Layer Perceptron (MLP)
Feedforward network with one or more hidden layers — a universal function approximator.
Architecture
An MLP with $L$ layers maps input $\mathbf{x}$ through a series of affine transformations and nonlinearities:
Where $f$ is a hidden activation (e.g. ReLU) and $g$ is the output activation (e.g. softmax for classification, identity for regression).
Universal Approximation Theorem
A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of $\mathbb{R}^n$, given a non-polynomial activation function.
Loss Functions
Mean Squared Error (Regression)
Cross-Entropy (Classification)
Activation Functions
Nonlinearities that give neural networks their expressive power.
| Name | Formula $f(z)$ | Derivative $f'(z)$ |
|---|---|---|
| Sigmoid | $\frac{1}{1+e^{-z}}$ | $f(z)(1-f(z))$ |
| Tanh | $\frac{e^z - e^{-z}}{e^z + e^{-z}}$ | $1 - f(z)^2$ |
| ReLU | $\max(0, z)$ | $\begin{cases}1 & z>0\\0 & z\leq 0\end{cases}$ |
| Leaky ReLU | $\max(\alpha z, z)$ | $\begin{cases}1 & z>0\\\alpha & z\leq 0\end{cases}$ |
| ELU | $\begin{cases}z & z>0\\\alpha(e^z-1) & z\leq 0\end{cases}$ | $\begin{cases}1 & z>0\\f(z)+\alpha & z\leq 0\end{cases}$ |
| GELU | $z \cdot \Phi(z)$ | $\Phi(z) + z\,\phi(z)$ |
| Swish / SiLU | $z \cdot \sigma(z)$ | $f(z) + \sigma(z)(1 - f(z))$ |
| Softmax | $\frac{e^{z_i}}{\sum_j e^{z_j}}$ | $f_i(\delta_{ij} - f_j)$ |
| Mish | $z \cdot \tanh(\ln(1+e^z))$ | See chain rule expansion |
GELU (Gaussian Error Linear Unit)
Backpropagation
The chain rule applied layer-by-layer to compute gradients efficiently.
Chain Rule (Vector Form)
For loss $\mathcal{L}$ with respect to parameters in layer $l$:
Parameter Gradients
Computational Complexity
For a network with $L$ layers and $n$ neurons per layer, backpropagation has $O(Ln^2)$ time complexity — the same as the forward pass — making it highly efficient compared to numerical differentiation.
Optimization Algorithms
Methods for traversing the loss landscape to find good minima.
Stochastic Gradient Descent (SGD)
SGD with Momentum
Nesterov Accelerated Gradient
AdaGrad
RMSProp
Adam
AdamW (Decoupled Weight Decay)
Regularization Techniques
Methods to prevent overfitting and improve generalization.
L1 Regularization (Lasso)
L2 Regularization (Ridge / Weight Decay)
Dropout
During training, each neuron is independently set to zero with probability $p$:
Batch Normalization
Layer Normalization
RMSNorm
Convolutional Neural Network (CNN)
Networks exploiting spatial structure through shared local filters.
2D Convolution
For input $\mathbf{X} \in \mathbb{R}^{C_{in} \times H \times W}$ and filter $\mathbf{K} \in \mathbb{R}^{C_{in} \times k \times k}$:
Output Dimensions
Where $p$ is padding, $s$ is stride, $k$ is kernel size.
Depthwise Separable Convolution
Factorizes a standard convolution into a depthwise and pointwise step:
Dilated (Atrous) Convolution
Effective receptive field: $k + (k-1)(d-1)$, where $d$ is the dilation rate.
Pooling Operations
Transposed Convolution
Used for upsampling. Equivalent to convolving with fractional strides or padding the input:
U-Net
Encoder-decoder with skip connections for dense prediction tasks. Backbone of all modern diffusion models.
Architecture
Symmetric encoder (contracting) and decoder (expanding) path with skip connections concatenating encoder features to decoder features at each resolution:
Where $[\cdot;\cdot]$ denotes channel-wise concatenation (skip connection).
ConvBlock
U-Net in Diffusion Models
In DDPM / Stable Diffusion, the U-Net is conditioned on timestep $t$ and optional conditioning $c$:
Time embeddings are injected via addition/FiLM layers; text conditioning via cross-attention at each resolution level.
Recurrent Neural Network (Vanilla RNN)
Networks with temporal memory via recurrent connections.
Hidden State Dynamics
Backpropagation Through Time (BPTT)
Vanishing/Exploding Gradient Problem
The product of Jacobians $\prod_j \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}$ can shrink or grow exponentially:
Where $\gamma = \max |f'(z)|$. If $\|\mathbf{W}_{hh}\| \cdot \gamma < 1$, gradients vanish; if $> 1$, they explode.
Long Short-Term Memory (LSTM)
Gated RNN architecture solving the vanishing gradient problem.
Gate Equations
Gradient Flow through Cell State
The cell state provides a highway for gradients:
When $\mathbf{f}_t \approx 1$, gradients flow unattenuated over many timesteps.
Parameter Count
Gated Recurrent Unit (GRU)
A simplified gating mechanism merging cell and hidden state.
Gate Equations
Parameter Count
GRU uses 25% fewer parameters than LSTM (3 gates vs 4).
Extended LSTM (xLSTM)
Modernized LSTM with exponential gating and matrix memory for LLM-scale performance.
sLSTM (Scalar Memory)
Extends LSTM with exponential gating and a normalizer state for numerical stability:
mLSTM (Matrix Memory)
Replaces the scalar cell state with a matrix $\mathbf{C}_t \in \mathbb{R}^{d \times d}$, enabling key-value storage:
mLSTM is fully parallelizable (no hidden-to-hidden recurrence) and can be viewed as a linearized self-attention with a decay factor.
Bidirectional RNN
Processing sequences in both forward and backward directions.
Architecture
Attention Mechanism
Learning to focus on relevant parts of the input.
Additive (Bahdanau) Attention
Multiplicative (Luong) Attention
Scaled Dot-Product Attention
The $\sqrt{d_k}$ scaling prevents softmax saturation when dot products grow large.
Transformer
Attention-only architecture that revolutionized NLP and beyond.
Multi-Head Attention
Where $\mathbf{W}_i^Q, \mathbf{W}_i^K \in \mathbb{R}^{d \times d_k}$, $\mathbf{W}_i^V \in \mathbb{R}^{d \times d_v}$, $\mathbf{W}^O \in \mathbb{R}^{hd_v \times d}$.
Sinusoidal Positional Encoding
Rotary Positional Embedding (RoPE)
Feed-Forward Network (per position)
Encoder Block
Decoder Block (with causal mask)
The causal mask $\mathbf{M}$ sets future positions to $-\infty$ before softmax:
Grouped-Query Attention (GQA)
Shares key-value heads across groups of query heads to reduce memory:
Computational Complexity
BERT (Encoder-Only Transformer)
Bidirectional encoder pre-trained with masked language modeling — the foundation for NLU tasks.
Masked Language Modeling (MLM)
Randomly mask 15% of input tokens and predict the originals:
Of the 15% selected: 80% are replaced with [MASK], 10% with a random token, 10% unchanged.
Next Sentence Prediction (NSP)
Input Representation
Segment embeddings distinguish sentence A from B. The [CLS] token representation is used for classification tasks.
Fine-Tuning
| Model | Layers | Hidden | Heads | Params |
|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
| RoBERTa | 24 | 1024 | 16 | 355M |
Sequence-to-Sequence (Encoder-Decoder)
Mapping variable-length input sequences to variable-length output sequences.
RNN-Based Seq2Seq
Seq2Seq with Attention
Transformer Encoder-Decoder (T5 / BART)
Teacher Forcing
Vision Transformer (ViT)
Applying the transformer architecture directly to image patches.
Patch Embedding
An image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ is split into $N$ patches of size $P \times P$:
CLS Token
Full Forward Pass
Variants
| Model | Patch Size | Layers | Hidden | Heads | Params |
|---|---|---|---|---|---|
| ViT-B/16 | 16 | 12 | 768 | 12 | 86M |
| ViT-L/16 | 16 | 24 | 1024 | 16 | 307M |
| ViT-H/14 | 14 | 32 | 1280 | 16 | 632M |
RWKV (Receptance Weighted Key Value)
Linear-complexity RNN that matches transformer quality — trainable like a transformer, runs like an RNN.
Time Mixing (Attention Replacement)
WKV Mechanism (Linear Attention)
Where $w$ is a learned decay vector and $u$ is a learned bonus for the current token. This can be computed recurrently in $O(1)$ per step.
Channel Mixing (FFN Replacement)
Complexity
Autoencoder
Learning compressed representations via reconstruction.
Architecture
Denoising Autoencoder
Sparse Autoencoder
Variational Autoencoder (VAE)
Probabilistic generative model with a learned latent space.
Generative Model
Evidence Lower Bound (ELBO)
Reparameterization Trick
KL Divergence (Closed Form for Gaussians)
Loss Function
Generative Adversarial Network (GAN)
Two networks competing in a minimax game to generate realistic data.
Minimax Objective
Optimal Discriminator
Global Optimum
At the Nash equilibrium, $p_g = p_{\text{data}}$ and $D^*(\mathbf{x}) = \frac{1}{2}$:
Wasserstein GAN (WGAN)
The critic $f$ (replacing the discriminator) is enforced to be 1-Lipschitz via gradient penalty:
Diffusion Models (DDPM)
Generating data by learning to reverse a gradual noising process.
Forward Process (Diffusion)
Where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$.
Reverse Process
Training Objective (Simplified)
Score-Based Formulation
Classifier-Free Guidance
Normalizing Flows
Exact likelihood models using invertible transformations.
Change of Variables
Composition of Flows
Coupling Layer (RealNVP)
The Jacobian is triangular, so $\det = \prod \exp(s_i) = \exp(\sum s_i)$, computed in $O(D)$.
Energy-Based Models
Defining probability distributions via scalar energy functions.
Energy Function
Score Matching
Contrastive Divergence
Where $\tilde{\mathbf{x}}$ is obtained from a few steps of MCMC starting from data.
Siamese Networks & Contrastive Learning
Learning representations by comparing pairs or groups of inputs — foundation of CLIP, SimCLR, and modern self-supervised vision.
Siamese Network
Two identical networks sharing weights process two inputs and compare their embeddings:
Contrastive Loss
Where $y=0$ for similar pairs, $y=1$ for dissimilar, and $m$ is the margin.
Triplet Loss
NT-Xent Loss (SimCLR)
Normalized temperature-scaled cross-entropy over a batch of $2N$ augmented pairs:
CLIP (Contrastive Language-Image Pre-training)
Aligns image and text embeddings using a symmetric contrastive loss over a batch of $N$ image-text pairs:
BYOL / SimSiam (No Negatives)
Where $\text{sg}(\cdot)$ is stop-gradient, $\mathbf{z}_2'$ comes from an EMA target encoder, and $q_\theta$ is a predictor MLP.
JEPA (Joint Embedding Predictive Architecture)
Yann LeCun's proposed path to human-level AI — predicting in latent space rather than pixel space.
Core Principle
Unlike generative models (which predict pixels) or contrastive models (which compare positive/negative pairs), JEPA predicts the representation of a target from a context — entirely in embedding space:
Architecture
Loss Function
Where $\text{sg}(\cdot)$ is stop-gradient. The target encoder $\bar{f}$ is updated via exponential moving average (EMA):
I-JEPA (Image JEPA)
The context encoder sees a partial view of the image (with masked patches), and the predictor must predict target block representations in latent space:
V-JEPA (Video JEPA)
Extends to video by masking spacetime tubes and predicting their latent representations:
JEPA vs Other Paradigms
| Method | Prediction Space | Negatives? | Collapse Prevention |
|---|---|---|---|
| Autoencoder | Pixel / Input | No | Bottleneck |
| Contrastive (SimCLR) | Latent (similarity) | Yes | Negative pairs |
| BYOL / SimSiam | Latent | No | EMA + stop-gradient |
| JEPA | Latent (prediction) | No | EMA + stop-gradient + masking |
Graph Neural Networks
Neural networks operating on graph-structured data.
Message Passing Framework
Graph Convolutional Network (GCN)
Where $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ (adjacency with self-loops), $\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{A}_{ij}$.
Graph Attention Network (GAT)
GraphSAGE
Graph Readout
Capsule Networks
Encoding part-whole relationships with vector-valued capsules.
Squash Function
Dynamic Routing
Margin Loss
Hopfield Network
Associative memory via energy minimization in a fully connected network.
Energy Function
Hebbian Learning (Storage)
Update Rule (Asynchronous)
Storage Capacity
Modern Hopfield Network (2020)
This update rule is equivalent to the attention mechanism in transformers.
Boltzmann Machine
Stochastic neural network based on statistical mechanics.
Energy Function
Probability Distribution
Stochastic Update
Restricted Boltzmann Machine (RBM)
A bipartite Boltzmann machine enabling efficient training via Gibbs sampling.
Energy
Conditional Distributions
Contrastive Divergence (CD-k)
Free Energy
Radial Basis Function Network
Using radial basis functions as activation in a single hidden layer.
Architecture
Training
Typically a two-phase process: (1) find centers $\boldsymbol{\mu}_j$ via k-means clustering; (2) solve for weights $\mathbf{w}$ via least squares:
Where $\Phi_{ij} = \phi_j(\mathbf{x}_i)$ is the interpolation matrix.
Self-Organizing Map (SOM)
Unsupervised learning that maps high-dimensional data to a low-dimensional grid preserving topology.
Best Matching Unit (BMU)
Weight Update
Neighborhood Function
Both $\eta(t)$ and $\sigma(t)$ decrease monotonically over training.
Residual Networks (ResNet)
Skip connections enabling training of very deep networks.
Residual Block
The network learns the residual $\mathcal{F}(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ rather than the full mapping.
Bottleneck Block
$\mathbf{W}_1$ reduces channels (1×1), $\mathbf{W}_2$ is 3×3 conv, $\mathbf{W}_3$ expands channels (1×1).
Gradient Flow
The "1 +" term ensures gradients can flow directly to any layer without attenuation.
Pre-Activation ResNet
Neural Ordinary Differential Equations
Continuous-depth networks defined by differential equations.
Continuous Dynamics
Adjoint Method (Memory-Efficient Backprop)
Memory cost is $O(1)$ regardless of depth, since states are recomputed during the backward ODE solve.
Connection to ResNets
Echo State Network (Reservoir Computing)
A fixed random recurrent reservoir with only output weights trained.
Reservoir Dynamics
$\mathbf{W}_{\text{res}}$ and $\mathbf{W}_{\text{in}}$ are random and fixed. $\alpha$ is the leaking rate.
Output (Readout)
Echo State Property
The reservoir must satisfy the echo state property: the spectral radius $\rho(\mathbf{W}_{\text{res}}) < 1$ ensures that the effect of initial conditions fades over time.
Spiking Neural Network
Biologically plausible networks where neurons communicate via discrete spikes.
Leaky Integrate-and-Fire (LIF) Model
Discrete LIF
Where $\beta = \exp(-\Delta t / \tau_m)$ is the decay factor and $\Theta$ is the Heaviside step function.
Surrogate Gradient
Since $\Theta'(x) = \delta(x)$ is not useful for backprop, replace with a smooth surrogate:
Spike-Timing-Dependent Plasticity (STDP)
Kolmogorov-Arnold Network (KAN)
Learnable activation functions on edges, based on the Kolmogorov-Arnold representation theorem.
Kolmogorov-Arnold Representation Theorem
KAN Layer
Each edge $(i, j)$ has a learnable univariate function $\phi_{ij}$, parameterized by B-splines:
Layer Computation
Compared to MLPs which have fixed activations on nodes and learnable linear weights on edges, KANs have learnable nonlinear functions on edges and summation on nodes.
State Space Models (S4 / Mamba)
Sequence models based on continuous-time state space representations with efficient linear-time computation.
Continuous State Space
Discretization (Zero-Order Hold)
Discrete Recurrence
Convolution Form
Computed in $O(L \log L)$ via FFT during training.
HiPPO Initialization
Selective SSM (Mamba)
Makes parameters input-dependent for content-aware reasoning:
Hypernetworks
Networks that generate the weights of another network.
Formulation
The hypernetwork $h_\psi$ maps an embedding $\mathbf{z}$ (which can be task-specific, layer-specific, or input-dependent) to the parameters of the main network $f$.
Training
Neural Cellular Automata
Learned local update rules that produce global emergent behavior.
Cell State Update
All cells share the same neural network $f_\theta$, and the stochastic update mask enforces asynchrony for robustness.
Training via Differentiable Simulation
Gradients are backpropagated through time across the simulation steps.
Neural Turing Machine / Differentiable Neural Computer
Neural networks augmented with external differentiable memory — capable of learning algorithms.
Architecture
A controller network (LSTM or MLP) interacts with an external memory matrix $\mathbf{M} \in \mathbb{R}^{N \times M}$ via differentiable read/write heads:
Addressing — Content-Based
Addressing — Location-Based
Read & Write
$\mathbf{e}_t$ is the erase vector and $\mathbf{a}_t$ is the add vector.
Bayesian Neural Network
Placing probability distributions over weights for principled uncertainty quantification.
Bayesian Inference over Weights
Predictive Distribution
Variational Inference (Bayes by Backprop)
Approximate the intractable posterior with $q_\phi(\boldsymbol{\theta})$:
With the reparameterization trick: $\theta_i = \mu_i + \sigma_i\,\epsilon$, $\epsilon\sim\mathcal{N}(0,1)$.
MC Dropout as Approximate BNN
Running $T$ forward passes with dropout enabled at test time provides uncertainty estimates.
Liquid Neural Network
Continuous-time neural networks with input-dependent dynamics — inspired by C. elegans neuroscience.
Liquid Time-Constant (LTC) Neuron
The key insight: the time constant $\tau$ is modulated by the input, making dynamics input-dependent.
Neural Circuit Policy
Closed-Form Continuous-Depth (CfC)
An analytical solution avoiding ODE solvers:
Where $f_\infty = A\,\sigma(\mathbf{W}[\mathbf{h}_0;\,\mathbf{x}] + \mathbf{b})$ is the steady-state.
Properties
Liquid networks are remarkably compact (19 neurons can drive a car) and inherently interpretable due to their neuroscience-inspired wiring.
Mixture Density Network
Predicting full conditional probability distributions using a mixture of Gaussians.
Output Parameterization
A neural network outputs the parameters of a Gaussian mixture model:
Network Outputs
Loss (Negative Log-Likelihood)
MDNs can model one-to-many mappings (e.g., inverse kinematics, handwriting generation) where a single input maps to multiple valid outputs.
WaveNet
Autoregressive generative model with dilated causal convolutions for raw audio synthesis.
Autoregressive Formulation
Dilated Causal Convolutions
Stack convolutions with exponentially increasing dilation rates to grow the receptive field efficiently:
Gated Activation
Conditional WaveNet
Where $\mathbf{c}$ is a conditioning signal (e.g., mel spectrogram, speaker ID, linguistic features).
μ-Law Quantization
Compresses the 16-bit audio range into 256 values for categorical output via softmax.
Large Language Model (LLM) Architecture LLM DEEP DIVE
How modern LLMs like GPT-4, Claude, LLaMA, Gemini, and Mistral combine the neural network building blocks documented above into a single coherent system.
Component Map — What LLMs Use
Every building block below is documented in detail in the sections above. An LLM is fundamentally a decoder-only Transformer composed of these pieces:
Transformer (Decoder-Only)
The core architecture. Stacked blocks with causal self-attention, preventing tokens from attending to future positions.
Scaled Dot-Product Attention
The fundamental operation: $\text{softmax}(\mathbf{QK}^\top/\sqrt{d_k})\mathbf{V}$. Every token attends to all previous tokens.
Multi-Head / GQA
Parallel attention heads capture different relationship types. GQA shares KV heads to reduce memory 4–8×.
RoPE Positional Encoding
Rotary embeddings encode relative position directly into Q/K vectors. Used by LLaMA, Mistral, Claude, Gemma.
Feed-Forward Network (MLP)
Two-layer MLP at each position: project up 4×, apply SiLU/GELU, project back down. The "memory" of the model.
SiLU / GELU Activation
Smooth activations used inside transformer FFNs. SiLU (Swish) in LLaMA/Mistral; GELU in GPT/BERT.
RMSNorm / LayerNorm
Normalizes activations for training stability. Modern LLMs prefer RMSNorm (no mean subtraction, faster).
Residual Connections
Skip connections around every attention and FFN sublayer. Essential for training 100+ layer models.
AdamW Optimizer
Adam with decoupled weight decay. The standard optimizer for all LLM pretraining.
Backpropagation
Gradient computation through billions of parameters. Combined with gradient checkpointing for memory efficiency.
Tokenization
LLMs do not operate on raw characters. Text is first split into subword tokens using algorithms like BPE (Byte Pair Encoding), which iteratively merges the most frequent byte pairs:
Each token is mapped to an integer ID, which is then looked up in the embedding table.
Token & Positional Embeddings
Modern LLMs typically use RoPE instead of learned positional embeddings, applied directly to Q and K vectors inside each attention layer rather than added to the input.
Weight Tying
Many LLMs share the token embedding matrix with the output projection (language model head):
The LLM Transformer Block (Full Equations)
A modern LLM (e.g. LLaMA-style) stacks $L$ identical blocks. Each block performs:
Step 1: Pre-Norm + Causal Multi-Head Attention + Residual
Step 2: Pre-Norm + SwiGLU FFN + Residual
Where $\mathbf{W}_1, \mathbf{W}_{\text{gate}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ and $\mathbf{W}_2 \in \mathbb{R}^{d_{\text{ff}} \times d}$, with $d_{\text{ff}} \approx \frac{8}{3}d$ (SwiGLU adjustment).
Final Output
KV Cache (Inference Optimization)
During autoregressive generation, previously computed key and value vectors are cached to avoid redundant computation:
KV Cache Memory
For a 70B model with 8K context in FP16: ~2–4 GB of KV cache per sequence.
PagedAttention (vLLM)
Manages KV cache as virtual memory pages to eliminate fragmentation and enable efficient batching of variable-length sequences.
LLM Training Pipeline
Phase 1: Pre-Training (Next Token Prediction)
Trained on trillions of tokens from web text, books, code, etc.
Phase 2: Supervised Fine-Tuning (SFT)
Only the response tokens contribute to the loss; prompt tokens are masked.
Phase 3: RLHF (Reinforcement Learning from Human Feedback)
Step 3a: Train a reward model $r_\phi$ on human preference pairs $(y_w \succ y_l)$:
Step 3b: Optimize the policy with PPO, constrained by a KL penalty from the reference model $\pi_{\text{ref}}$:
DPO (Direct Preference Optimization)
Bypasses the reward model entirely by reparameterizing the RLHF objective:
Scaling Laws
Kaplan Scaling (OpenAI, 2020)
Loss follows power laws in model parameters $N$, dataset size $D$, and compute $C$.
Chinchilla Optimal (Hoffmann et al., 2022)
For a given compute budget, model size and data should be scaled equally — a 10B model needs ~200B tokens.
LLM Parameter Count
| Model | Layers $L$ | Dim $d$ | Heads $h$ | Params |
|---|---|---|---|---|
| GPT-2 | 48 | 1600 | 25 | 1.5B |
| LLaMA-2 7B | 32 | 4096 | 32 | 6.7B |
| LLaMA-2 70B | 80 | 8192 | 64 | 70B |
| GPT-4 (est.) | 120 | ~12288 | 96 | ~1.8T (MoE) |
Mixture of Experts (MoE)
Replace the dense FFN with a sparse set of expert FFNs, routing each token to the top-$k$ experts:
Load Balancing Loss
Encourages equal load across experts. Mixtral 8×7B uses $E=8$ experts with $k=2$, giving 47B total params but only ~13B active per token.
Sampling & Decoding Strategies
Temperature Scaling
$\tau \to 0$: greedy (argmax). $\tau = 1$: standard softmax. $\tau > 1$: more random.
Top-$k$ Sampling
Top-$p$ (Nucleus) Sampling
Min-$p$ Sampling
Beam Search
Maintains top-$B$ candidates at each step, with length normalization exponent $\alpha$.
Repetition Penalty
LoRA & Parameter-Efficient Fine-Tuning
LoRA (Low-Rank Adaptation)
Freeze the pretrained weights and inject trainable low-rank decompositions:
Typical $r = 8\text{–}64$, reducing trainable parameters by 1000× (e.g., 70B model → ~100M trainable params).
QLoRA
Combines LoRA with 4-bit quantized base weights using NormalFloat4 (NF4) data type:
Fine-tune a 70B model on a single 48GB GPU.
Other PEFT Methods
| Method | Approach | Trainable Params |
|---|---|---|
| Prefix Tuning | Learnable "virtual tokens" prepended to keys/values | ~0.1% |
| Prompt Tuning | Learnable soft prompt embeddings | ~0.01% |
| Adapters | Small bottleneck layers inserted between transformer sublayers | ~1–3% |
| IA³ | Learned vectors that rescale keys, values, and FFN activations | ~0.01% |
Long Context Techniques
RoPE Frequency Scaling
Extending a 4K model to 32K context uses $s = 8$.
YaRN (Yet another RoPE extensioN)
Flash Attention
IO-aware exact attention that avoids materializing the $n \times n$ attention matrix:
Uses online softmax and tiling to keep intermediate results in SRAM, avoiding slow HBM reads/writes.
Ring Attention
Distributes sequence across devices in a ring topology, overlapping communication with computation for near-infinite context:
Sliding Window Attention
Used in Mistral. With $L$ layers and window $w$, effective receptive field is $L \times w$ tokens.
Neural Network Encyclopedia — 48 architectures — Generated March 2026
Covers: Perceptron, MLP, CNN, U-Net, RNN, LSTM, GRU, xLSTM, Bidirectional RNN, Attention, Transformer, BERT, Seq2Seq, ViT, RWKV, Autoencoder, VAE, GAN, Diffusion, Normalizing Flows, Energy-Based, Siamese/Contrastive (SimCLR, CLIP), JEPA, GNN, Capsule, Hopfield, Boltzmann, RBM, RBF, SOM, ResNet, Neural ODE, Echo State, Spiking NN, KAN, SSM/Mamba, Hypernetworks, Neural Cellular Automata, Neural Turing Machine, Bayesian NN, Liquid NN, Mixture Density, WaveNet + LLM Architecture Deep Dive