1. Overview & Motivation

Joint Embedding Predictive Architecture (JEPA) is a family of self-supervised learning architectures proposed by Yann LeCun (Meta Chief AI Scientist) in his 2022 position paper "A Path Towards Autonomous Machine Intelligence." JEPA represents a fundamental rethinking of how AI systems should learn to understand the world.

Core Insight Instead of predicting raw pixels, tokens, or waveforms, JEPA predicts abstract representations of missing or future content. This allows the model to focus on high-level semantic information and ignore unpredictable low-level details — much like how humans reason about the world.

Why JEPA?

Traditional self-supervised approaches have fundamental limitations:

Generative models (e.g., MAE, GPT) try to reconstruct every pixel/token — wasting capacity on irrelevant details like carpet textures or leaf positions that are inherently unpredictable.
Contrastive methods (e.g., SimCLR, DINO) rely on hand-crafted data augmentations (crops, color jitter, flips), creating inductive biases that don't generalize across domains.
JEPA sidesteps both problems by operating entirely in a learned latent space, making predictions about what matters rather than what every detail looks like.

Key Properties

Non-generative: Does not reconstruct input pixels/tokens — no decoder required
Augmentation-free: No hand-crafted data transforms needed (unlike contrastive methods)
Latent prediction: All predictions happen in abstract representation space
Compute-efficient: Often 2–5× fewer GPU hours than comparable methods
Semantically rich: Captures high-level meaning without being distracted by low-level detail

2. Theoretical Foundations

Energy-Based Model Perspective

JEPA can be understood through the lens of Energy-Based Models (EBMs). The system learns an energy function E(x, y) that assigns low energy to compatible (x, y) pairs and high energy to incompatible ones.

E(x, y) = ‖ s_θ(x) − predictor_φ( enc_θ(x) , z ) ‖²

Where x is the context input, y is the target, enc_θ is the encoder, s_θ is the target encoder, and z is a latent variable representing the information needed to predict y from x.

The Collapse Problem

A major challenge in joint-embedding architectures is representation collapse — the model learns to map everything to the same constant representation, trivially minimizing the prediction error. JEPA addresses this through several mechanisms:

Strategy	How It Works	Used In
EMA (Exponential Moving Average)	Target encoder is a slow-moving average of the context encoder — asymmetric update prevents collapse	I-JEPA, V-JEPA, V-JEPA 2
VICReg	Variance-Invariance-Covariance regularization: enforces variance in representations, decorrelates dimensions, and matches embeddings	EB-JEPA
Predictor Architecture	A narrow predictor network (bottleneck) prevents information from being passed through trivially	All JEPA variants
Stop Gradient	Gradient is stopped on the target encoder branch to prevent both encoders from converging to the same solution	All JEPA variants

Information-Theoretic View

JEPA's objective can be seen as maximizing mutual information I(Z_x; Z_y) between the representations of context and target, while minimizing the entropy of the representation (discarding unpredictable information). The model learns representations that are:

Maximally informative: Captures essential features for prediction
Maximally predictable: Discards stochastic, unpredictable detail
Semantically grounded: Focuses on structure and meaning

3. Core Architecture

The JEPA framework has three fundamental components that appear in all variants:

┌─────────────────────────────────────────────────────────────────┐ │ JEPA Core Architecture │ │ │ │ Input x Input y │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Context │ │ Target │ │ │ │ Encoder │ │ Encoder │ │ │ │ (θ) │ │ (EMA of θ) │ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ ▼ │ ← stop grad │ │ ┌──────────────┐ │ │ │ │ Predictor │ ▼ │ │ │ (φ) │──── Loss: ‖ ŷ − y_repr ‖² ─►│ │ │ └──────────────┘ ▲ │ │ │ ▲ │ │ │ │ │ ┌─────┴─────┐ │ │ │ positional │ Predicted │ Target │ │ tokens / │ Repr ŷ │ Representation │ │ condition └───────────┘ │ └─────────────────────────────────────────────────────────────────┘

Context Encoder

The context encoder processes the visible/available portion of the input and produces a set of representation tokens. Typically a Vision Transformer (ViT) in vision tasks. It only sees the unmasked context, making it computationally efficient.

Target Encoder

Architecturally identical to the context encoder, but its parameters are updated as an Exponential Moving Average (EMA) of the context encoder weights:

θ_target ← τ · θ_target + (1 − τ) · θ_context , τ ∈ [0.996, 0.9999]

The target encoder processes the full or masked target regions. A stop-gradient is applied so no gradients flow back through the target encoder during training.

Predictor

A narrow Transformer (fewer layers, smaller hidden dimension) that takes context encoder outputs plus positional tokens for the target locations, and predicts the target representations. The bottleneck design prevents trivial information copying.

Training Objective

L = (1/N) Σ_i ‖ predictor( enc_θ(x), pos_i ) − sg( target_enc(y_i) ) ‖²

Where sg denotes stop-gradient. The loss is the mean squared error between predicted and actual target representations, averaged over all target positions.

4. JEPA vs Other Paradigms

Property	JEPA	Contrastive (SimCLR, DINO)	Generative (MAE, BEiT)	Autoregressive (GPT)
Prediction space	Latent representations	Latent similarity	Pixel / token space	Token space
Augmentations needed	None	Heavy (crops, jitter, blur)	Masking only	None (causal mask)
Decoder	None (narrow predictor)	None (projection head)	Full decoder required	None (shared weights)
Handles uncertainty	Yes — ignores unpredictable details	Partially — via invariance	No — must reconstruct all details	No — must predict exact tokens
Compute efficiency	High (2–5× fewer GPU hours)	Medium (augmentation overhead)	Low (decoder + reconstruction)	Low (sequential generation)
Collapse prevention	EMA + stop-gradient + narrow predictor	Negative samples / momentum	N/A (reconstructive loss)	N/A (next-token loss)
Semantic quality	Very high (abstract features)	High (invariant features)	Mixed (pixel-level + some semantic)	High for language

Benchmark Highlight I-JEPA ViT-H/14 achieves 73.3% top-1 accuracy on ImageNet linear probing in ~2,500 GPU-hours. MAE with the same encoder reaches 71.5% but requires over 10,000 GPU-hours — a 4× compute advantage for JEPA.

5. I-JEPA — Image-Based JEPA

CVPR 2023 Open Source Meta AI ViT Backbone

I-JEPA is the first practical implementation of JEPA, applied to self-supervised learning from images. Published at CVPR 2023, it learns semantic visual representations without relying on hand-crafted data augmentations.

Architecture Details

┌────────────────────────────────────────────────────────┐ │ I-JEPA Training │ │ │ │ Original Image │ │ ┌──────────────────┐ │ │ │ ████ ░░ ████ │ ░░ = masked target blocks │ │ │ ████ ░░ ████ │ ██ = visible context patches │ │ │ ░░░░ ██ ░░░░ │ │ │ │ ████ ░░ ████ │ Typically 4 target blocks │ │ │ ████ ██ ████ │ each ~15-20% of image area │ │ └──────────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ Context Target │ │ Encoder Encoder (EMA) │ │ (ViT) (ViT) │ │ │ │ │ │ ▼ ▼ │ │ Context Target │ │ Tokens Representations │ │ │ │ │ │ ▼ │ ← stop gradient │ │ Predictor │ │ │ (narrow ViT) │ │ │ │ │ │ │ ▼ ▼ │ │ ‖ predicted − target ‖² = Loss │ └────────────────────────────────────────────────────────┘

Multi-Block Masking Strategy

The masking strategy is critical to I-JEPA's success. It forces the model to learn semantic representations rather than local textures:

Target blocks: 4 large blocks sampled with scale 15–20% of image area and aspect ratios from 0.75–1.5
Context block: A single large block covering ~85–100% of the image, with target regions removed
Key design: Masking is applied to the output of the target encoder, not the input — the target encoder always sees the full patch

Why Large Target Blocks? Small masked patches (like MAE's 75% random masking) encourage pixel-level interpolation. Large, semantic-scale target blocks force the model to reason about object-level structure, yielding representations that transfer better to downstream tasks.

Model Configurations

Model	Encoder	Params	ImageNet Linear	ImageNet 1% Semi-sup
I-JEPA	ViT-L/16	307M	74.4%	70.2%
I-JEPA	ViT-H/14	632M	76.5%	72.4%
I-JEPA	ViT-H/16 @ 448	632M	77.3%	73.3%

Code Example: I-JEPA Conceptual Implementation

import torch
import torch.nn as nn
from functools import partial

class IJEPA(nn.Module):
    """Simplified I-JEPA architecture."""

    def __init__(self, img_size=224, patch_size=16, embed_dim=1024,
                 enc_depth=24, pred_depth=6, pred_dim=384, ema_decay=0.996):
        super().__init__()
        self.patch_size = patch_size
        num_patches = (img_size // patch_size) ** 2

        # Context encoder (ViT)
        self.context_encoder = VisionTransformer(
            img_size=img_size, patch_size=patch_size,
            embed_dim=embed_dim, depth=enc_depth, num_heads=16
        )

        # Target encoder (EMA copy — no gradient)
        self.target_encoder = VisionTransformer(
            img_size=img_size, patch_size=patch_size,
            embed_dim=embed_dim, depth=enc_depth, num_heads=16
        )
        # Initialize target encoder from context encoder
        self.target_encoder.load_state_dict(self.context_encoder.state_dict())
        for p in self.target_encoder.parameters():
            p.requires_grad = False

        # Predictor (narrow ViT)
        self.predictor = PredictorTransformer(
            embed_dim=embed_dim, pred_dim=pred_dim,
            depth=pred_depth, num_heads=12
        )

        self.ema_decay = ema_decay

    @torch.no_grad()
    def update_target_encoder(self):
        """EMA update: θ_target = τ * θ_target + (1 - τ) * θ_context"""
        for param_t, param_c in zip(
            self.target_encoder.parameters(),
            self.context_encoder.parameters()
        ):
            param_t.data.mul_(self.ema_decay).add_(
                param_c.data, alpha=1.0 - self.ema_decay
            )

    def forward(self, images, context_mask, target_masks):
        """
        Args:
            images: (B, 3, H, W) input images
            context_mask: (B, N) boolean mask for context patches
            target_masks: list of (B, N) boolean masks for target blocks
        """
        # Context encoder: encode visible patches only
        context_tokens = self.context_encoder(images, mask=context_mask)

        # Target encoder: encode full image, then extract target tokens
        with torch.no_grad():
            target_tokens_full = self.target_encoder(images)

        # Predict target representations from context
        total_loss = 0.0
        for target_mask in target_masks:
            # Extract target representations (stop-gradient)
            target_repr = target_tokens_full[target_mask].detach()

            # Predict using context + positional info for target locations
            predicted = self.predictor(context_tokens, target_positions=target_mask)

            # L2 loss in representation space
            loss = nn.functional.mse_loss(predicted, target_repr)
            total_loss += loss

        return total_loss / len(target_masks)


def sample_target_blocks(num_patches_h, num_patches_w, num_targets=4,
                          scale_range=(0.15, 0.20), aspect_range=(0.75, 1.5)):
    """Sample multi-block target masks for I-JEPA."""
    masks = []
    for _ in range(num_targets):
        # Sample scale and aspect ratio
        scale = torch.empty(1).uniform_(*scale_range).item()
        aspect = torch.empty(1).uniform_(*aspect_range).item()

        # Compute block dimensions
        num_patches = num_patches_h * num_patches_w
        target_area = int(num_patches * scale)
        h = int(round((target_area * aspect) ** 0.5))
        w = int(round((target_area / aspect) ** 0.5))
        h = min(h, num_patches_h)
        w = min(w, num_patches_w)

        # Random position
        top = torch.randint(0, num_patches_h - h + 1, (1,)).item()
        left = torch.randint(0, num_patches_w - w + 1, (1,)).item()

        # Create mask
        mask = torch.zeros(num_patches_h, num_patches_w, dtype=torch.bool)
        mask[top:top+h, left:left+w] = True
        masks.append(mask.flatten())

    return masks

GitHub: facebookresearch/ijepa

6. V-JEPA — Video JEPA

Feb 2024 Open Source Meta AI Video SSL

V-JEPA extends JEPA to video understanding, learning by predicting missing spatiotemporal regions in representation space. It is pre-trained entirely with unlabeled video data (VideoMix2M dataset) — no labels, no augmentations.

Spatiotemporal Masking

Unlike I-JEPA's spatial-only masking, V-JEPA masks portions of video in both space and time:

Input video is tokenized into 3D patches (space × space × time)
Large spatiotemporal blocks are masked (~90% of tokens)
Context encoder only processes the visible ~10% of tokens (very efficient)
The predictor must predict representations for the masked 90%

Time → t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ ██ │ ██ │ ░░ │ ░░ │ ░░ │ ██ │ ██ │ ██ │ row 1 │ ██ │ ██ │ ░░ │ ░░ │ ░░ │ ██ │ ██ │ ██ │ row 2 │ ░░ │ ░░ │ ░░ │ ░░ │ ░░ │ ░░ │ ░░ │ ░░ │ row 3 │ ░░ │ ░░ │ ██ │ ██ │ ██ │ ░░ │ ░░ │ ░░ │ row 4 │ ██ │ ██ │ ██ │ ██ │ ██ │ ██ │ ██ │ ██ │ row 5 └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ██ = visible (context) ░░ = masked (target) ~10% visible, ~90% masked → very efficient training

Key Results

Learns motion and appearance representations from video alone
Frozen features (no fine-tuning) transfer well to image and video tasks
Outperforms pixel-reconstruction methods on action recognition
Pre-trained on 2M video clips without any labels

GitHub: facebookresearch/jepa

7. V-JEPA 2 & World Models

June 2025 Open Source (Commercial) Meta AI World Model 1.2B Params

V-JEPA 2 is a significant evolution — a 1.2 billion parameter world model trained primarily on video. It demonstrates that self-supervised video pre-training can produce representations suitable for understanding, prediction, and physical planning.

Architecture

Backbone: ViT-eH (1.2B parameters) — a scaled-up Vision Transformer
Pre-training data: 1M+ hours of internet video (unlabeled)
Training: Self-supervised JEPA objective (masking + latent prediction)
Post-training: Attentive probing / task-specific adapters (frozen backbone)

V-JEPA 2-AC: Action-Conditioned World Model

V-JEPA 2-AC extends the base model with action conditioning for robotics:

V-JEPA 2-AC Pipeline: Visual Observation Robot Action │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ V-JEPA 2 │ │ Action Encoder │ │ (1.2B, frozen) │ │ (learned) │ └────────┬────────┘ └────────┬────────┘ │ │ ▼ ▼ ┌────────────────────────────────────┐ │ Action-Conditioned Predictor │ │ (300M params, block-causal attn) │ └───────────────┬────────────────────┘ │ ▼ Predicted Future State (in representation space) │ ▼ ┌──────────────────┐ │ Planning Module │ │ (goal-conditioned│ │ action search) │ └──────────────────┘

Performance Highlights

Benchmark	V-JEPA 2 Score	Notes
Something-Something v2	77.3 top-1	Strong motion understanding
Epic-Kitchens-100 (anticipation)	39.7 recall@5	State-of-the-art
Robot manipulation (zero-shot)	Pick-and-place success	Trained with <62h unlabeled robot video

Zero-Shot Robot Planning V-JEPA 2-AC can be post-trained on fewer than 62 hours of unlabeled robot videos, then deployed zero-shot on Franka robot arms for pick-and-place tasks using goal-conditioned planning in latent space.

GitHub: facebookresearch/vjepa2 | Paper: arXiv:2506.09985

8. VL-JEPA — Vision-Language JEPA

Dec 2025 Meta AI Multimodal 1.6B Params

VL-JEPA applies JEPA principles to vision-language modeling. Instead of autoregressively generating text tokens (like standard VLMs), VL-JEPA predicts continuous embeddings of target text — a fundamentally different approach to multimodal AI.

Architecture

X-Encoder: V-JEPA 2 backbone — outputs visual token representations
Predictor: Initialized from Llama 3 Transformer layers — predicts text embeddings from visual + text context
Loss: L2 loss in continuous embedding space (not cross-entropy over vocabulary)

VL-JEPA Architecture: Image/Video Text (partial) │ │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ V-JEPA 2 │ │ Text │ │ Encoder │ │ Encoder │ └────┬─────┘ └────┬─────┘ │ Visual Tokens │ Text Tokens └───────────┬────────────┘ ▼ ┌─────────────────┐ │ Predictor │ │ (Llama 3 init) │ └────────┬────────┘ │ ▼ Predicted Text Embeddings │ ▼ L2 Loss vs Target Embeddings (NOT cross-entropy over vocab)

Key Advantages

50% fewer trainable parameters compared to standard VLM training
Comparable performance to InstructBLIP and QwenVL on VQA benchmarks
Only 1.6B total parameters — significantly smaller than typical VLMs
Inherits V-JEPA 2's strong video understanding for video QA

Paper: arXiv:2512.10942

9. LLM-JEPA — JEPA for Large Language Models

Sep 2025 Language Pre-training & Fine-tuning

LLM-JEPA adapts the JEPA framework for large language models, applicable to both pre-training and fine-tuning. Instead of predicting the next token, LLM-JEPA predicts representations of masked text spans in embedding space.

How It Works

Text is split into context and target spans
Context encoder processes visible tokens
Predictor predicts latent representations of masked spans
Target encoder (EMA) provides ground-truth embeddings
L2 loss in representation space — no vocabulary softmax needed

Results Across Model Families

Model Family	Task	Standard Training	LLM-JEPA	Improvement
Llama 3	NL-RX	Baseline	+significant	Outperforms
OpenELM	GSM8K	Baseline	+significant	Outperforms
Gemma 2	Spider	Baseline	+significant	Outperforms
OLMo	RottenTomatoes	Baseline	+significant	Outperforms

Robustness to Overfitting A notable advantage of LLM-JEPA is improved robustness to overfitting during fine-tuning, making it particularly valuable for domain adaptation with limited data.

Paper: arXiv:2509.14252

10. D-JEPA — Denoising JEPA

ICLR 2025 Generative Diffusion Open Source

D-JEPA pioneers the integration of JEPA with generative modeling. It reinterprets JEPA's masked prediction as a generalized next-token prediction strategy and incorporates diffusion loss to model per-token probability distributions, enabling data generation in continuous space.

Key Innovation

While standard JEPA predicts a single point estimate for each masked token's representation, D-JEPA models the full probability distribution over possible representations using a diffusion process. This allows:

High-fidelity image generation (not just representation learning)
Conditional generation on ImageNet classes
Bridging the gap between discriminative JEPA and generative modeling

Results

D-JEPA base, large, and huge model variants outperform all previous generative models at comparable scales on ImageNet conditional generation benchmarks.

Scale	D-JEPA Performance	vs Prior SOTA
Base	State-of-the-art	Outperforms all previous generative models
Large	State-of-the-art	Outperforms all previous generative models
Huge	State-of-the-art	Outperforms all previous generative models

11. Domain-Specific JEPA Variants

The JEPA framework has been adapted across numerous modalities and domains:

A-JEPA — Audio JEPA

Audio Nov 2023

Extends JEPA to audio spectrograms. Audio is converted to a spectrogram, tokenized into patches, and masked for prediction. Uses a curriculum masking strategy that starts with easy predictions and progressively increases difficulty. Random masking is preferred over block masking for audio due to the different spatial structure of spectrograms.

Paper: arXiv:2311.15830

MC-JEPA — Motion and Content JEPA

Optical Flow Video

Jointly learns optical flow estimation and content features within a shared encoder. Achieves endpoint error (EPE) of ~2.81 on Sintel Clean and ~3.51 on Sintel Final — comparable to specialized unsupervised optical flow estimators.

T-JEPA — Trajectory JEPA

Trajectories Spatiotemporal

Samples and predicts trajectory segments in representation space. Robust to noise, irregular sampling, and spatial distortion. Bypasses the need for handcrafted augmentations that are difficult to design for trajectory data.

TS-JEPA — Time Series JEPA

Time Series Forecasting

Divides time series into patch tokens and uses high masking ratios (>70%). Outperforms masked input-reconstruction and contrastive baselines on both classification and long-term forecasting tasks.

Graph-JEPA

Graphs GNN

Applies JEPA to graph-structured data. Much more memory-efficient than contrastive graph learning since it requires no augmentations or negative samples.

TI-JEPA / JEPA-T — Text-Image JEPA

Multimodal Cross-modal

Cross-modal variants with cross-attention modules for multimodal fusion. Competitive on multimodal sentiment analysis, open-vocabulary image generation, and multimodal retrieval.

Drive-JEPA

Autonomous Driving 2025

Applies Video JEPA to end-to-end autonomous driving with multimodal trajectory distillation. Combines JEPA's self-supervised video understanding with driving-specific objectives.

12. EB-JEPA — Energy-Based JEPA Library

Feb 2026 Open Source Meta AI Educational

EB-JEPA is Meta's lightweight, open-source library for exploring energy-based joint-embedding predictive architectures. Designed for accessibility — every example trains on a single GPU in a few hours.

What's Included

Image representation learning: I-JEPA style training on images
Video representation learning: V-JEPA style training on video
Action-conditioned video: World model training with action inputs
Planning: Goal-conditioned planning using JEPA-based world models

Collapse Prevention: VICReg

EB-JEPA uses VICReg (Variance-Invariance-Covariance Regularization) to prevent representation collapse:

# VICReg Loss Components
def vicreg_loss(z1, z2, lam=25.0, mu=25.0, nu=1.0):
    """
    z1, z2: (B, D) representations to align
    """
    # 1. Invariance: match representations
    inv_loss = F.mse_loss(z1, z2)

    # 2. Variance: prevent collapse to constant
    std_z1 = torch.sqrt(z1.var(dim=0) + 1e-4)
    std_z2 = torch.sqrt(z2.var(dim=0) + 1e-4)
    var_loss = (torch.mean(F.relu(1 - std_z1)) +
                torch.mean(F.relu(1 - std_z2)))

    # 3. Covariance: decorrelate dimensions
    z1_centered = z1 - z1.mean(dim=0)
    z2_centered = z2 - z2.mean(dim=0)
    cov_z1 = (z1_centered.T @ z1_centered) / (z1.shape[0] - 1)
    cov_z2 = (z2_centered.T @ z2_centered) / (z2.shape[0] - 1)
    # Off-diagonal elements should be zero
    cov_loss = (off_diagonal(cov_z1).pow(2).sum() / z1.shape[1] +
                off_diagonal(cov_z2).pow(2).sum() / z2.shape[1])

    return lam * inv_loss + mu * var_loss + nu * cov_loss

GitHub: facebookresearch/eb_jepa | Paper: arXiv:2602.03604

13. H-JEPA — Hierarchical JEPA

Theory World Model Autonomous Intelligence

Hierarchical JEPA (H-JEPA) is the theoretical capstone of LeCun's vision for autonomous machine intelligence. It proposes a multi-level hierarchy of JEPA modules operating at different timescales and abstraction levels — analogous to how human cognition operates from reflexive motor control to abstract planning.

Hierarchical World Model

H-JEPA: Hierarchical World Model Level 3 (Abstract) ┌─────────────────────────────────┐ Long time horizon │ JEPA: Strategic planning │ │ (minutes → hours) │ │ "Go to the kitchen" │ └───────────┬─────────────────────┘ │ Level 2 (Mid) ┌───────────▼─────────────────────┐ Medium time horizon │ JEPA: Tactical planning │ │ (seconds → minutes) │ │ "Navigate around table" │ └───────────┬─────────────────────┘ │ Level 1 (Concrete) ┌───────────▼─────────────────────┐ Short time horizon │ JEPA: Motor control │ │ (milliseconds → seconds) │ │ "Move left foot forward 30cm" │ └─────────────────────────────────┘

Key Properties of H-JEPA

Multi-level abstraction: Higher levels represent more abstract, longer-term predictions
Top-down influence: Higher-level goals constrain lower-level predictions and actions
Variable time scales: Each level operates at a different temporal granularity
Learned, not handcrafted: The hierarchy and abstraction levels emerge from self-supervised training

Connection to LeCun's Architecture for Intelligence

H-JEPA is one module in LeCun's proposed cognitive architecture, alongside:

Module	Role
Configurator	Sets parameters and attention for other modules based on current task
Perception	Estimates world state from sensory input
World Model (H-JEPA)	Predicts future world states; the core of the system
Cost Module	Computes energy/cost of predicted states (drives behavior)
Short-term Memory	Stores recent world states and predictions
Actor	Computes action sequences to minimize predicted cost

Current Status H-JEPA remains largely theoretical. V-JEPA 2-AC represents the closest practical realization — a single-level world model capable of planning in latent space. Fully hierarchical, multi-level JEPA systems are an active area of research.

14. Timeline & Evolution

June 2022

LeCun's Position Paper

"A Path Towards Autonomous Machine Intelligence" — proposes JEPA, H-JEPA, and a full cognitive architecture.

Jan 2023

I-JEPA Paper

First practical JEPA implementation. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" — accepted at CVPR 2023.

Nov 2023

A-JEPA

Audio JEPA — extends the framework to audio spectrograms with curriculum masking.

Feb 2024

V-JEPA Released

Video JEPA — self-supervised learning from video with spatiotemporal masking. Open-sourced by Meta.

2024

Domain Variants Emerge

MC-JEPA (motion+content), T-JEPA (trajectories), TS-JEPA (time series), Graph-JEPA, TI-JEPA (text-image) — the JEPA family expands across modalities.

Early 2025

D-JEPA (ICLR 2025)

Denoising JEPA bridges discriminative and generative modeling with diffusion loss. SOTA on ImageNet generation.

June 2025

V-JEPA 2 Released

1.2B parameter world model. Includes V-JEPA 2-AC for robot manipulation with zero-shot planning. Open-sourced for commercial use.

Sep 2025

LLM-JEPA

JEPA applied to large language models. Outperforms standard objectives across Llama 3, OpenELM, Gemma 2, and OLMo.

Dec 2025

VL-JEPA

Vision-language JEPA — predicts continuous text embeddings. 50% fewer trainable params than standard VLMs.

Feb 2026

EB-JEPA Library

Meta releases lightweight energy-based JEPA library. Single-GPU examples for images, video, action-conditioned models, and planning.

15. Code & Implementations

Official Meta Repositories

Repository	Model	Framework	License
facebookresearch/ijepa	I-JEPA	PyTorch	CC-BY-NC 4.0
facebookresearch/jepa	V-JEPA	PyTorch	CC-BY-NC 4.0
facebookresearch/vjepa2	V-JEPA 2 + AC	PyTorch	Apache 2.0 (commercial)
facebookresearch/eb_jepa	EB-JEPA	PyTorch	MIT

Quick Start with EB-JEPA

# Clone and setup
git clone https://github.com/facebookresearch/eb_jepa.git
cd eb_jepa
pip install -e .

# Train I-JEPA on CIFAR-10 (single GPU, ~1 hour)
python examples/image_ijepa/train.py \
    --dataset cifar10 \
    --batch-size 256 \
    --epochs 100 \
    --encoder vit_small \
    --pred-depth 4

# Train V-JEPA on video (single GPU, few hours)
python examples/video_vjepa/train.py \
    --dataset kinetics400_subset \
    --batch-size 32 \
    --epochs 50

Using V-JEPA 2 Pre-trained Features

import torch
from vjepa2 import load_model

# Load pre-trained V-JEPA 2 encoder
model = load_model("vjepa2-viteh", pretrained=True)
model.eval()

# Extract features from video
# video: (B, T, C, H, W) tensor
video = torch.randn(1, 16, 3, 224, 224)

with torch.no_grad():
    features = model.encode(video)
    # features: (B, N, D) where N = num_patches, D = embed_dim

    # Global average pooling for classification
    global_features = features.mean(dim=1)  # (B, D)

    # Per-patch features for dense prediction
    spatial_features = features  # (B, N, D)

Community Implementations

Package	Source	Description
jepa (PyPI)	Community	General JEPA framework built with PyTorch + Transformers
HuggingFace Course	HuggingFace	I-JEPA tutorial in the Computer Vision Course

16. Resources & Links

Key Papers

Paper	Year	Link
A Path Towards Autonomous Machine Intelligence (LeCun)	2022	PDF
I-JEPA: Self-Supervised Learning from Images	2023	arXiv
V-JEPA: Video Joint-Embedding Predictive Architecture	2024	Meta Blog
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning	2025	arXiv
VL-JEPA: Vision-Language JEPA	2025	arXiv
LLM-JEPA: LLMs Meet JEPA	2025	arXiv
A-JEPA: JEPA Can Listen	2023	arXiv
EB-JEPA: Lightweight Library for Energy-Based JEPA	2026	arXiv

Meta AI Blog Posts

Tutorials & Explainers

Glossary of JEPA Terms

10 key technical terms used throughout this guide.

C

Term	Definition
Contrastive Learning	A self-supervised approach that learns representations by pulling similar pairs closer and pushing dissimilar pairs apart in embedding space. Used by CLIP, SimCLR, MoCo.

E

Term	Definition
Energy-Based Model (EBM)	A model that assigns low energy to plausible data configurations and high energy to implausible ones. JEPA can be viewed as an EBM operating in latent space.

I

Term	Definition
I-JEPA (Image JEPA)	Meta AI's image-level JEPA that predicts masked patch representations from visible context patches in latent space, without generating pixels. Published 2023.

J

Term	Definition
JEPA	Joint Embedding Predictive Architecture — Yann LeCun's proposed framework for learning world models by predicting representations in latent space rather than pixel/token space.
Joint Embedding	An architecture where two inputs (e.g., image + text, or two views of an image) are encoded into the same representation space for comparison or prediction.

L

Term	Definition
Latent Space	A learned abstract representation space where high-dimensional data (images, text) is compressed into dense vectors that capture semantic meaning.

M

Term	Definition
Masked Image Modeling	A self-supervised pre-training task that masks portions of an image and trains the model to reconstruct or predict the missing regions, analogous to MLM for text.

R

Term	Definition
Representation Learning	Learning useful data representations (embeddings) that capture the underlying structure and semantics, enabling downstream tasks without task-specific training.

V

Term	Definition
V-JEPA (Video JEPA)	Extension of JEPA to video understanding, predicting future frame representations from past context. Enables learning temporal dynamics in latent space.

W

Term	Definition
World Model	A model that learns an internal representation of how the world works — predicting future states from current states and actions. JEPA's ultimate goal is to build hierarchical world models.

Full Reference: See the unified LLM Glossary for 140+ terms across all learning documents.