1. Overview & Motivation
Joint Embedding Predictive Architecture (JEPA) is a family of self-supervised learning architectures proposed by Yann LeCun (Meta Chief AI Scientist) in his 2022 position paper "A Path Towards Autonomous Machine Intelligence." JEPA represents a fundamental rethinking of how AI systems should learn to understand the world.
Why JEPA?
Traditional self-supervised approaches have fundamental limitations:
- Generative models (e.g., MAE, GPT) try to reconstruct every pixel/token — wasting capacity on irrelevant details like carpet textures or leaf positions that are inherently unpredictable.
- Contrastive methods (e.g., SimCLR, DINO) rely on hand-crafted data augmentations (crops, color jitter, flips), creating inductive biases that don't generalize across domains.
- JEPA sidesteps both problems by operating entirely in a learned latent space, making predictions about what matters rather than what every detail looks like.
Key Properties
- Non-generative: Does not reconstruct input pixels/tokens — no decoder required
- Augmentation-free: No hand-crafted data transforms needed (unlike contrastive methods)
- Latent prediction: All predictions happen in abstract representation space
- Compute-efficient: Often 2–5× fewer GPU hours than comparable methods
- Semantically rich: Captures high-level meaning without being distracted by low-level detail
2. Theoretical Foundations
Energy-Based Model Perspective
JEPA can be understood through the lens of Energy-Based Models (EBMs). The system learns an energy function E(x, y) that assigns low energy to compatible (x, y) pairs and high energy to incompatible ones.
Where x is the context input, y is the target, encθ is the encoder, sθ is the target encoder, and z is a latent variable representing the information needed to predict y from x.
The Collapse Problem
A major challenge in joint-embedding architectures is representation collapse — the model learns to map everything to the same constant representation, trivially minimizing the prediction error. JEPA addresses this through several mechanisms:
| Strategy | How It Works | Used In |
|---|---|---|
| EMA (Exponential Moving Average) | Target encoder is a slow-moving average of the context encoder — asymmetric update prevents collapse | I-JEPA, V-JEPA, V-JEPA 2 |
| VICReg | Variance-Invariance-Covariance regularization: enforces variance in representations, decorrelates dimensions, and matches embeddings | EB-JEPA |
| Predictor Architecture | A narrow predictor network (bottleneck) prevents information from being passed through trivially | All JEPA variants |
| Stop Gradient | Gradient is stopped on the target encoder branch to prevent both encoders from converging to the same solution | All JEPA variants |
Information-Theoretic View
JEPA's objective can be seen as maximizing mutual information I(Zx; Zy) between the representations of context and target, while minimizing the entropy of the representation (discarding unpredictable information). The model learns representations that are:
- Maximally informative: Captures essential features for prediction
- Maximally predictable: Discards stochastic, unpredictable detail
- Semantically grounded: Focuses on structure and meaning
3. Core Architecture
The JEPA framework has three fundamental components that appear in all variants:
Context Encoder
The context encoder processes the visible/available portion of the input and produces a set of representation tokens. Typically a Vision Transformer (ViT) in vision tasks. It only sees the unmasked context, making it computationally efficient.
Target Encoder
Architecturally identical to the context encoder, but its parameters are updated as an Exponential Moving Average (EMA) of the context encoder weights:
The target encoder processes the full or masked target regions. A stop-gradient is applied so no gradients flow back through the target encoder during training.
Predictor
A narrow Transformer (fewer layers, smaller hidden dimension) that takes context encoder outputs plus positional tokens for the target locations, and predicts the target representations. The bottleneck design prevents trivial information copying.
Training Objective
Where sg denotes stop-gradient. The loss is the mean squared error between predicted and actual target representations, averaged over all target positions.
4. JEPA vs Other Paradigms
| Property | JEPA | Contrastive (SimCLR, DINO) | Generative (MAE, BEiT) | Autoregressive (GPT) |
|---|---|---|---|---|
| Prediction space | Latent representations | Latent similarity | Pixel / token space | Token space |
| Augmentations needed | None | Heavy (crops, jitter, blur) | Masking only | None (causal mask) |
| Decoder | None (narrow predictor) | None (projection head) | Full decoder required | None (shared weights) |
| Handles uncertainty | Yes — ignores unpredictable details | Partially — via invariance | No — must reconstruct all details | No — must predict exact tokens |
| Compute efficiency | High (2–5× fewer GPU hours) | Medium (augmentation overhead) | Low (decoder + reconstruction) | Low (sequential generation) |
| Collapse prevention | EMA + stop-gradient + narrow predictor | Negative samples / momentum | N/A (reconstructive loss) | N/A (next-token loss) |
| Semantic quality | Very high (abstract features) | High (invariant features) | Mixed (pixel-level + some semantic) | High for language |
5. I-JEPA — Image-Based JEPA
I-JEPA is the first practical implementation of JEPA, applied to self-supervised learning from images. Published at CVPR 2023, it learns semantic visual representations without relying on hand-crafted data augmentations.
Architecture Details
Multi-Block Masking Strategy
The masking strategy is critical to I-JEPA's success. It forces the model to learn semantic representations rather than local textures:
- Target blocks: 4 large blocks sampled with scale 15–20% of image area and aspect ratios from 0.75–1.5
- Context block: A single large block covering ~85–100% of the image, with target regions removed
- Key design: Masking is applied to the output of the target encoder, not the input — the target encoder always sees the full patch
Model Configurations
| Model | Encoder | Params | ImageNet Linear | ImageNet 1% Semi-sup |
|---|---|---|---|---|
| I-JEPA | ViT-L/16 | 307M | 74.4% | 70.2% |
| I-JEPA | ViT-H/14 | 632M | 76.5% | 72.4% |
| I-JEPA | ViT-H/16 @ 448 | 632M | 77.3% | 73.3% |
Code Example: I-JEPA Conceptual Implementation
import torch
import torch.nn as nn
from functools import partial
class IJEPA(nn.Module):
"""Simplified I-JEPA architecture."""
def __init__(self, img_size=224, patch_size=16, embed_dim=1024,
enc_depth=24, pred_depth=6, pred_dim=384, ema_decay=0.996):
super().__init__()
self.patch_size = patch_size
num_patches = (img_size // patch_size) ** 2
# Context encoder (ViT)
self.context_encoder = VisionTransformer(
img_size=img_size, patch_size=patch_size,
embed_dim=embed_dim, depth=enc_depth, num_heads=16
)
# Target encoder (EMA copy — no gradient)
self.target_encoder = VisionTransformer(
img_size=img_size, patch_size=patch_size,
embed_dim=embed_dim, depth=enc_depth, num_heads=16
)
# Initialize target encoder from context encoder
self.target_encoder.load_state_dict(self.context_encoder.state_dict())
for p in self.target_encoder.parameters():
p.requires_grad = False
# Predictor (narrow ViT)
self.predictor = PredictorTransformer(
embed_dim=embed_dim, pred_dim=pred_dim,
depth=pred_depth, num_heads=12
)
self.ema_decay = ema_decay
@torch.no_grad()
def update_target_encoder(self):
"""EMA update: θ_target = τ * θ_target + (1 - τ) * θ_context"""
for param_t, param_c in zip(
self.target_encoder.parameters(),
self.context_encoder.parameters()
):
param_t.data.mul_(self.ema_decay).add_(
param_c.data, alpha=1.0 - self.ema_decay
)
def forward(self, images, context_mask, target_masks):
"""
Args:
images: (B, 3, H, W) input images
context_mask: (B, N) boolean mask for context patches
target_masks: list of (B, N) boolean masks for target blocks
"""
# Context encoder: encode visible patches only
context_tokens = self.context_encoder(images, mask=context_mask)
# Target encoder: encode full image, then extract target tokens
with torch.no_grad():
target_tokens_full = self.target_encoder(images)
# Predict target representations from context
total_loss = 0.0
for target_mask in target_masks:
# Extract target representations (stop-gradient)
target_repr = target_tokens_full[target_mask].detach()
# Predict using context + positional info for target locations
predicted = self.predictor(context_tokens, target_positions=target_mask)
# L2 loss in representation space
loss = nn.functional.mse_loss(predicted, target_repr)
total_loss += loss
return total_loss / len(target_masks)
def sample_target_blocks(num_patches_h, num_patches_w, num_targets=4,
scale_range=(0.15, 0.20), aspect_range=(0.75, 1.5)):
"""Sample multi-block target masks for I-JEPA."""
masks = []
for _ in range(num_targets):
# Sample scale and aspect ratio
scale = torch.empty(1).uniform_(*scale_range).item()
aspect = torch.empty(1).uniform_(*aspect_range).item()
# Compute block dimensions
num_patches = num_patches_h * num_patches_w
target_area = int(num_patches * scale)
h = int(round((target_area * aspect) ** 0.5))
w = int(round((target_area / aspect) ** 0.5))
h = min(h, num_patches_h)
w = min(w, num_patches_w)
# Random position
top = torch.randint(0, num_patches_h - h + 1, (1,)).item()
left = torch.randint(0, num_patches_w - w + 1, (1,)).item()
# Create mask
mask = torch.zeros(num_patches_h, num_patches_w, dtype=torch.bool)
mask[top:top+h, left:left+w] = True
masks.append(mask.flatten())
return masks
GitHub: facebookresearch/ijepa
6. V-JEPA — Video JEPA
V-JEPA extends JEPA to video understanding, learning by predicting missing spatiotemporal regions in representation space. It is pre-trained entirely with unlabeled video data (VideoMix2M dataset) — no labels, no augmentations.
Spatiotemporal Masking
Unlike I-JEPA's spatial-only masking, V-JEPA masks portions of video in both space and time:
- Input video is tokenized into 3D patches (space × space × time)
- Large spatiotemporal blocks are masked (~90% of tokens)
- Context encoder only processes the visible ~10% of tokens (very efficient)
- The predictor must predict representations for the masked 90%
Key Results
- Learns motion and appearance representations from video alone
- Frozen features (no fine-tuning) transfer well to image and video tasks
- Outperforms pixel-reconstruction methods on action recognition
- Pre-trained on 2M video clips without any labels
GitHub: facebookresearch/jepa
7. V-JEPA 2 & World Models
V-JEPA 2 is a significant evolution — a 1.2 billion parameter world model trained primarily on video. It demonstrates that self-supervised video pre-training can produce representations suitable for understanding, prediction, and physical planning.
Architecture
- Backbone: ViT-eH (1.2B parameters) — a scaled-up Vision Transformer
- Pre-training data: 1M+ hours of internet video (unlabeled)
- Training: Self-supervised JEPA objective (masking + latent prediction)
- Post-training: Attentive probing / task-specific adapters (frozen backbone)
V-JEPA 2-AC: Action-Conditioned World Model
V-JEPA 2-AC extends the base model with action conditioning for robotics:
Performance Highlights
| Benchmark | V-JEPA 2 Score | Notes |
|---|---|---|
| Something-Something v2 | 77.3 top-1 | Strong motion understanding |
| Epic-Kitchens-100 (anticipation) | 39.7 recall@5 | State-of-the-art |
| Robot manipulation (zero-shot) | Pick-and-place success | Trained with <62h unlabeled robot video |
GitHub: facebookresearch/vjepa2 | Paper: arXiv:2506.09985
8. VL-JEPA — Vision-Language JEPA
VL-JEPA applies JEPA principles to vision-language modeling. Instead of autoregressively generating text tokens (like standard VLMs), VL-JEPA predicts continuous embeddings of target text — a fundamentally different approach to multimodal AI.
Architecture
- X-Encoder: V-JEPA 2 backbone — outputs visual token representations
- Predictor: Initialized from Llama 3 Transformer layers — predicts text embeddings from visual + text context
- Loss: L2 loss in continuous embedding space (not cross-entropy over vocabulary)
Key Advantages
- 50% fewer trainable parameters compared to standard VLM training
- Comparable performance to InstructBLIP and QwenVL on VQA benchmarks
- Only 1.6B total parameters — significantly smaller than typical VLMs
- Inherits V-JEPA 2's strong video understanding for video QA
Paper: arXiv:2512.10942
9. LLM-JEPA — JEPA for Large Language Models
LLM-JEPA adapts the JEPA framework for large language models, applicable to both pre-training and fine-tuning. Instead of predicting the next token, LLM-JEPA predicts representations of masked text spans in embedding space.
How It Works
- Text is split into context and target spans
- Context encoder processes visible tokens
- Predictor predicts latent representations of masked spans
- Target encoder (EMA) provides ground-truth embeddings
- L2 loss in representation space — no vocabulary softmax needed
Results Across Model Families
| Model Family | Task | Standard Training | LLM-JEPA | Improvement |
|---|---|---|---|---|
| Llama 3 | NL-RX | Baseline | +significant | Outperforms |
| OpenELM | GSM8K | Baseline | +significant | Outperforms |
| Gemma 2 | Spider | Baseline | +significant | Outperforms |
| OLMo | RottenTomatoes | Baseline | +significant | Outperforms |
Paper: arXiv:2509.14252
10. D-JEPA — Denoising JEPA
D-JEPA pioneers the integration of JEPA with generative modeling. It reinterprets JEPA's masked prediction as a generalized next-token prediction strategy and incorporates diffusion loss to model per-token probability distributions, enabling data generation in continuous space.
Key Innovation
While standard JEPA predicts a single point estimate for each masked token's representation, D-JEPA models the full probability distribution over possible representations using a diffusion process. This allows:
- High-fidelity image generation (not just representation learning)
- Conditional generation on ImageNet classes
- Bridging the gap between discriminative JEPA and generative modeling
Results
D-JEPA base, large, and huge model variants outperform all previous generative models at comparable scales on ImageNet conditional generation benchmarks.
| Scale | D-JEPA Performance | vs Prior SOTA |
|---|---|---|
| Base | State-of-the-art | Outperforms all previous generative models |
| Large | State-of-the-art | Outperforms all previous generative models |
| Huge | State-of-the-art | Outperforms all previous generative models |
11. Domain-Specific JEPA Variants
The JEPA framework has been adapted across numerous modalities and domains:
A-JEPA — Audio JEPA
Extends JEPA to audio spectrograms. Audio is converted to a spectrogram, tokenized into patches, and masked for prediction. Uses a curriculum masking strategy that starts with easy predictions and progressively increases difficulty. Random masking is preferred over block masking for audio due to the different spatial structure of spectrograms.
Paper: arXiv:2311.15830
MC-JEPA — Motion and Content JEPA
Jointly learns optical flow estimation and content features within a shared encoder. Achieves endpoint error (EPE) of ~2.81 on Sintel Clean and ~3.51 on Sintel Final — comparable to specialized unsupervised optical flow estimators.
T-JEPA — Trajectory JEPA
Samples and predicts trajectory segments in representation space. Robust to noise, irregular sampling, and spatial distortion. Bypasses the need for handcrafted augmentations that are difficult to design for trajectory data.
TS-JEPA — Time Series JEPA
Divides time series into patch tokens and uses high masking ratios (>70%). Outperforms masked input-reconstruction and contrastive baselines on both classification and long-term forecasting tasks.
Graph-JEPA
Applies JEPA to graph-structured data. Much more memory-efficient than contrastive graph learning since it requires no augmentations or negative samples.
TI-JEPA / JEPA-T — Text-Image JEPA
Cross-modal variants with cross-attention modules for multimodal fusion. Competitive on multimodal sentiment analysis, open-vocabulary image generation, and multimodal retrieval.
Drive-JEPA
Applies Video JEPA to end-to-end autonomous driving with multimodal trajectory distillation. Combines JEPA's self-supervised video understanding with driving-specific objectives.
12. EB-JEPA — Energy-Based JEPA Library
EB-JEPA is Meta's lightweight, open-source library for exploring energy-based joint-embedding predictive architectures. Designed for accessibility — every example trains on a single GPU in a few hours.
What's Included
- Image representation learning: I-JEPA style training on images
- Video representation learning: V-JEPA style training on video
- Action-conditioned video: World model training with action inputs
- Planning: Goal-conditioned planning using JEPA-based world models
Collapse Prevention: VICReg
EB-JEPA uses VICReg (Variance-Invariance-Covariance Regularization) to prevent representation collapse:
# VICReg Loss Components
def vicreg_loss(z1, z2, lam=25.0, mu=25.0, nu=1.0):
"""
z1, z2: (B, D) representations to align
"""
# 1. Invariance: match representations
inv_loss = F.mse_loss(z1, z2)
# 2. Variance: prevent collapse to constant
std_z1 = torch.sqrt(z1.var(dim=0) + 1e-4)
std_z2 = torch.sqrt(z2.var(dim=0) + 1e-4)
var_loss = (torch.mean(F.relu(1 - std_z1)) +
torch.mean(F.relu(1 - std_z2)))
# 3. Covariance: decorrelate dimensions
z1_centered = z1 - z1.mean(dim=0)
z2_centered = z2 - z2.mean(dim=0)
cov_z1 = (z1_centered.T @ z1_centered) / (z1.shape[0] - 1)
cov_z2 = (z2_centered.T @ z2_centered) / (z2.shape[0] - 1)
# Off-diagonal elements should be zero
cov_loss = (off_diagonal(cov_z1).pow(2).sum() / z1.shape[1] +
off_diagonal(cov_z2).pow(2).sum() / z2.shape[1])
return lam * inv_loss + mu * var_loss + nu * cov_loss
GitHub: facebookresearch/eb_jepa | Paper: arXiv:2602.03604
13. H-JEPA — Hierarchical JEPA
Hierarchical JEPA (H-JEPA) is the theoretical capstone of LeCun's vision for autonomous machine intelligence. It proposes a multi-level hierarchy of JEPA modules operating at different timescales and abstraction levels — analogous to how human cognition operates from reflexive motor control to abstract planning.
Hierarchical World Model
Key Properties of H-JEPA
- Multi-level abstraction: Higher levels represent more abstract, longer-term predictions
- Top-down influence: Higher-level goals constrain lower-level predictions and actions
- Variable time scales: Each level operates at a different temporal granularity
- Learned, not handcrafted: The hierarchy and abstraction levels emerge from self-supervised training
Connection to LeCun's Architecture for Intelligence
H-JEPA is one module in LeCun's proposed cognitive architecture, alongside:
| Module | Role |
|---|---|
| Configurator | Sets parameters and attention for other modules based on current task |
| Perception | Estimates world state from sensory input |
| World Model (H-JEPA) | Predicts future world states; the core of the system |
| Cost Module | Computes energy/cost of predicted states (drives behavior) |
| Short-term Memory | Stores recent world states and predictions |
| Actor | Computes action sequences to minimize predicted cost |
14. Timeline & Evolution
15. Code & Implementations
Official Meta Repositories
| Repository | Model | Framework | License |
|---|---|---|---|
| facebookresearch/ijepa | I-JEPA | PyTorch | CC-BY-NC 4.0 |
| facebookresearch/jepa | V-JEPA | PyTorch | CC-BY-NC 4.0 |
| facebookresearch/vjepa2 | V-JEPA 2 + AC | PyTorch | Apache 2.0 (commercial) |
| facebookresearch/eb_jepa | EB-JEPA | PyTorch | MIT |
Quick Start with EB-JEPA
# Clone and setup
git clone https://github.com/facebookresearch/eb_jepa.git
cd eb_jepa
pip install -e .
# Train I-JEPA on CIFAR-10 (single GPU, ~1 hour)
python examples/image_ijepa/train.py \
--dataset cifar10 \
--batch-size 256 \
--epochs 100 \
--encoder vit_small \
--pred-depth 4
# Train V-JEPA on video (single GPU, few hours)
python examples/video_vjepa/train.py \
--dataset kinetics400_subset \
--batch-size 32 \
--epochs 50
Using V-JEPA 2 Pre-trained Features
import torch
from vjepa2 import load_model
# Load pre-trained V-JEPA 2 encoder
model = load_model("vjepa2-viteh", pretrained=True)
model.eval()
# Extract features from video
# video: (B, T, C, H, W) tensor
video = torch.randn(1, 16, 3, 224, 224)
with torch.no_grad():
features = model.encode(video)
# features: (B, N, D) where N = num_patches, D = embed_dim
# Global average pooling for classification
global_features = features.mean(dim=1) # (B, D)
# Per-patch features for dense prediction
spatial_features = features # (B, N, D)
Community Implementations
| Package | Source | Description |
|---|---|---|
| jepa (PyPI) | Community | General JEPA framework built with PyTorch + Transformers |
| HuggingFace Course | HuggingFace | I-JEPA tutorial in the Computer Vision Course |
16. Resources & Links
Key Papers
| Paper | Year | Link |
|---|---|---|
| A Path Towards Autonomous Machine Intelligence (LeCun) | 2022 | |
| I-JEPA: Self-Supervised Learning from Images | 2023 | arXiv |
| V-JEPA: Video Joint-Embedding Predictive Architecture | 2024 | Meta Blog |
| V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning | 2025 | arXiv |
| VL-JEPA: Vision-Language JEPA | 2025 | arXiv |
| LLM-JEPA: LLMs Meet JEPA | 2025 | arXiv |
| A-JEPA: JEPA Can Listen | 2023 | arXiv |
| EB-JEPA: Lightweight Library for Energy-Based JEPA | 2026 | arXiv |
Meta AI Blog Posts
- I-JEPA: The first AI model based on Yann LeCun's vision for more human-like AI
- V-JEPA: The next step toward advanced machine intelligence
- Introducing V-JEPA 2 world model and new benchmarks