JEPA — Joint Embedding Predictive Architecture

From Yann LeCun's vision for autonomous machine intelligence to practical implementations

1. Overview & Motivation

Joint Embedding Predictive Architecture (JEPA) is a family of self-supervised learning architectures proposed by Yann LeCun (Meta Chief AI Scientist) in his 2022 position paper "A Path Towards Autonomous Machine Intelligence." JEPA represents a fundamental rethinking of how AI systems should learn to understand the world.

Core Insight Instead of predicting raw pixels, tokens, or waveforms, JEPA predicts abstract representations of missing or future content. This allows the model to focus on high-level semantic information and ignore unpredictable low-level details — much like how humans reason about the world.

Why JEPA?

Traditional self-supervised approaches have fundamental limitations:

  • Generative models (e.g., MAE, GPT) try to reconstruct every pixel/token — wasting capacity on irrelevant details like carpet textures or leaf positions that are inherently unpredictable.
  • Contrastive methods (e.g., SimCLR, DINO) rely on hand-crafted data augmentations (crops, color jitter, flips), creating inductive biases that don't generalize across domains.
  • JEPA sidesteps both problems by operating entirely in a learned latent space, making predictions about what matters rather than what every detail looks like.

Key Properties

  • Non-generative: Does not reconstruct input pixels/tokens — no decoder required
  • Augmentation-free: No hand-crafted data transforms needed (unlike contrastive methods)
  • Latent prediction: All predictions happen in abstract representation space
  • Compute-efficient: Often 2–5× fewer GPU hours than comparable methods
  • Semantically rich: Captures high-level meaning without being distracted by low-level detail

2. Theoretical Foundations

Energy-Based Model Perspective

JEPA can be understood through the lens of Energy-Based Models (EBMs). The system learns an energy function E(x, y) that assigns low energy to compatible (x, y) pairs and high energy to incompatible ones.

E(x, y) = ‖ sθ(x) − predictorφ( encθ(x) , z ) ‖²

Where x is the context input, y is the target, encθ is the encoder, sθ is the target encoder, and z is a latent variable representing the information needed to predict y from x.

The Collapse Problem

A major challenge in joint-embedding architectures is representation collapse — the model learns to map everything to the same constant representation, trivially minimizing the prediction error. JEPA addresses this through several mechanisms:

StrategyHow It WorksUsed In
EMA (Exponential Moving Average) Target encoder is a slow-moving average of the context encoder — asymmetric update prevents collapse I-JEPA, V-JEPA, V-JEPA 2
VICReg Variance-Invariance-Covariance regularization: enforces variance in representations, decorrelates dimensions, and matches embeddings EB-JEPA
Predictor Architecture A narrow predictor network (bottleneck) prevents information from being passed through trivially All JEPA variants
Stop Gradient Gradient is stopped on the target encoder branch to prevent both encoders from converging to the same solution All JEPA variants

Information-Theoretic View

JEPA's objective can be seen as maximizing mutual information I(Zx; Zy) between the representations of context and target, while minimizing the entropy of the representation (discarding unpredictable information). The model learns representations that are:

  • Maximally informative: Captures essential features for prediction
  • Maximally predictable: Discards stochastic, unpredictable detail
  • Semantically grounded: Focuses on structure and meaning

3. Core Architecture

The JEPA framework has three fundamental components that appear in all variants:

┌─────────────────────────────────────────────────────────────────┐ │ JEPA Core Architecture │ │ │ │ Input x Input y │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Context │ │ Target │ │ │ │ Encoder │ │ Encoder │ │ │ │ (θ) │ │ (EMA of θ) │ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ ▼ │ ← stop grad │ │ ┌──────────────┐ │ │ │ │ Predictor │ ▼ │ │ │ (φ) │──── Loss: ‖ ŷ − y_repr ‖² ─►│ │ │ └──────────────┘ ▲ │ │ │ ▲ │ │ │ │ │ ┌─────┴─────┐ │ │ │ positional │ Predicted │ Target │ │ tokens / │ Repr ŷ │ Representation │ │ condition └───────────┘ │ └─────────────────────────────────────────────────────────────────┘

Context Encoder

The context encoder processes the visible/available portion of the input and produces a set of representation tokens. Typically a Vision Transformer (ViT) in vision tasks. It only sees the unmasked context, making it computationally efficient.

Target Encoder

Architecturally identical to the context encoder, but its parameters are updated as an Exponential Moving Average (EMA) of the context encoder weights:

θtarget ← τ · θtarget + (1 − τ) · θcontext , τ ∈ [0.996, 0.9999]

The target encoder processes the full or masked target regions. A stop-gradient is applied so no gradients flow back through the target encoder during training.

Predictor

A narrow Transformer (fewer layers, smaller hidden dimension) that takes context encoder outputs plus positional tokens for the target locations, and predicts the target representations. The bottleneck design prevents trivial information copying.

Training Objective

L = (1/N) Σi ‖ predictor( encθ(x), posi ) − sg( target_enc(yi) ) ‖²

Where sg denotes stop-gradient. The loss is the mean squared error between predicted and actual target representations, averaged over all target positions.

4. JEPA vs Other Paradigms

Property JEPA Contrastive (SimCLR, DINO) Generative (MAE, BEiT) Autoregressive (GPT)
Prediction space Latent representations Latent similarity Pixel / token space Token space
Augmentations needed None Heavy (crops, jitter, blur) Masking only None (causal mask)
Decoder None (narrow predictor) None (projection head) Full decoder required None (shared weights)
Handles uncertainty Yes — ignores unpredictable details Partially — via invariance No — must reconstruct all details No — must predict exact tokens
Compute efficiency High (2–5× fewer GPU hours) Medium (augmentation overhead) Low (decoder + reconstruction) Low (sequential generation)
Collapse prevention EMA + stop-gradient + narrow predictor Negative samples / momentum N/A (reconstructive loss) N/A (next-token loss)
Semantic quality Very high (abstract features) High (invariant features) Mixed (pixel-level + some semantic) High for language
Benchmark Highlight I-JEPA ViT-H/14 achieves 73.3% top-1 accuracy on ImageNet linear probing in ~2,500 GPU-hours. MAE with the same encoder reaches 71.5% but requires over 10,000 GPU-hours — a 4× compute advantage for JEPA.

5. I-JEPA — Image-Based JEPA

CVPR 2023 Open Source Meta AI ViT Backbone

I-JEPA is the first practical implementation of JEPA, applied to self-supervised learning from images. Published at CVPR 2023, it learns semantic visual representations without relying on hand-crafted data augmentations.

Architecture Details

┌────────────────────────────────────────────────────────┐ │ I-JEPA Training │ │ │ │ Original Image │ │ ┌──────────────────┐ │ │ │ ████ ░░ ████ │ ░░ = masked target blocks │ │ │ ████ ░░ ████ │ ██ = visible context patches │ │ │ ░░░░ ██ ░░░░ │ │ │ │ ████ ░░ ████ │ Typically 4 target blocks │ │ │ ████ ██ ████ │ each ~15-20% of image area │ │ └──────────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ Context Target │ │ Encoder Encoder (EMA) │ │ (ViT) (ViT) │ │ │ │ │ │ ▼ ▼ │ │ Context Target │ │ Tokens Representations │ │ │ │ │ │ ▼ │ ← stop gradient │ │ Predictor │ │ │ (narrow ViT) │ │ │ │ │ │ │ ▼ ▼ │ │ ‖ predicted − target ‖² = Loss │ └────────────────────────────────────────────────────────┘

Multi-Block Masking Strategy

The masking strategy is critical to I-JEPA's success. It forces the model to learn semantic representations rather than local textures:

  • Target blocks: 4 large blocks sampled with scale 15–20% of image area and aspect ratios from 0.75–1.5
  • Context block: A single large block covering ~85–100% of the image, with target regions removed
  • Key design: Masking is applied to the output of the target encoder, not the input — the target encoder always sees the full patch
Why Large Target Blocks? Small masked patches (like MAE's 75% random masking) encourage pixel-level interpolation. Large, semantic-scale target blocks force the model to reason about object-level structure, yielding representations that transfer better to downstream tasks.

Model Configurations

ModelEncoderParamsImageNet LinearImageNet 1% Semi-sup
I-JEPAViT-L/16307M74.4%70.2%
I-JEPAViT-H/14632M76.5%72.4%
I-JEPAViT-H/16 @ 448632M77.3%73.3%

Code Example: I-JEPA Conceptual Implementation

import torch
import torch.nn as nn
from functools import partial

class IJEPA(nn.Module):
    """Simplified I-JEPA architecture."""

    def __init__(self, img_size=224, patch_size=16, embed_dim=1024,
                 enc_depth=24, pred_depth=6, pred_dim=384, ema_decay=0.996):
        super().__init__()
        self.patch_size = patch_size
        num_patches = (img_size // patch_size) ** 2

        # Context encoder (ViT)
        self.context_encoder = VisionTransformer(
            img_size=img_size, patch_size=patch_size,
            embed_dim=embed_dim, depth=enc_depth, num_heads=16
        )

        # Target encoder (EMA copy — no gradient)
        self.target_encoder = VisionTransformer(
            img_size=img_size, patch_size=patch_size,
            embed_dim=embed_dim, depth=enc_depth, num_heads=16
        )
        # Initialize target encoder from context encoder
        self.target_encoder.load_state_dict(self.context_encoder.state_dict())
        for p in self.target_encoder.parameters():
            p.requires_grad = False

        # Predictor (narrow ViT)
        self.predictor = PredictorTransformer(
            embed_dim=embed_dim, pred_dim=pred_dim,
            depth=pred_depth, num_heads=12
        )

        self.ema_decay = ema_decay

    @torch.no_grad()
    def update_target_encoder(self):
        """EMA update: θ_target = τ * θ_target + (1 - τ) * θ_context"""
        for param_t, param_c in zip(
            self.target_encoder.parameters(),
            self.context_encoder.parameters()
        ):
            param_t.data.mul_(self.ema_decay).add_(
                param_c.data, alpha=1.0 - self.ema_decay
            )

    def forward(self, images, context_mask, target_masks):
        """
        Args:
            images: (B, 3, H, W) input images
            context_mask: (B, N) boolean mask for context patches
            target_masks: list of (B, N) boolean masks for target blocks
        """
        # Context encoder: encode visible patches only
        context_tokens = self.context_encoder(images, mask=context_mask)

        # Target encoder: encode full image, then extract target tokens
        with torch.no_grad():
            target_tokens_full = self.target_encoder(images)

        # Predict target representations from context
        total_loss = 0.0
        for target_mask in target_masks:
            # Extract target representations (stop-gradient)
            target_repr = target_tokens_full[target_mask].detach()

            # Predict using context + positional info for target locations
            predicted = self.predictor(context_tokens, target_positions=target_mask)

            # L2 loss in representation space
            loss = nn.functional.mse_loss(predicted, target_repr)
            total_loss += loss

        return total_loss / len(target_masks)


def sample_target_blocks(num_patches_h, num_patches_w, num_targets=4,
                          scale_range=(0.15, 0.20), aspect_range=(0.75, 1.5)):
    """Sample multi-block target masks for I-JEPA."""
    masks = []
    for _ in range(num_targets):
        # Sample scale and aspect ratio
        scale = torch.empty(1).uniform_(*scale_range).item()
        aspect = torch.empty(1).uniform_(*aspect_range).item()

        # Compute block dimensions
        num_patches = num_patches_h * num_patches_w
        target_area = int(num_patches * scale)
        h = int(round((target_area * aspect) ** 0.5))
        w = int(round((target_area / aspect) ** 0.5))
        h = min(h, num_patches_h)
        w = min(w, num_patches_w)

        # Random position
        top = torch.randint(0, num_patches_h - h + 1, (1,)).item()
        left = torch.randint(0, num_patches_w - w + 1, (1,)).item()

        # Create mask
        mask = torch.zeros(num_patches_h, num_patches_w, dtype=torch.bool)
        mask[top:top+h, left:left+w] = True
        masks.append(mask.flatten())

    return masks

GitHub: facebookresearch/ijepa

6. V-JEPA — Video JEPA

Feb 2024 Open Source Meta AI Video SSL

V-JEPA extends JEPA to video understanding, learning by predicting missing spatiotemporal regions in representation space. It is pre-trained entirely with unlabeled video data (VideoMix2M dataset) — no labels, no augmentations.

Spatiotemporal Masking

Unlike I-JEPA's spatial-only masking, V-JEPA masks portions of video in both space and time:

  • Input video is tokenized into 3D patches (space × space × time)
  • Large spatiotemporal blocks are masked (~90% of tokens)
  • Context encoder only processes the visible ~10% of tokens (very efficient)
  • The predictor must predict representations for the masked 90%
Time → t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ ██ │ ██ │ ░░ │ ░░ │ ░░ │ ██ │ ██ │ ██ │ row 1 │ ██ │ ██ │ ░░ │ ░░ │ ░░ │ ██ │ ██ │ ██ │ row 2 │ ░░ │ ░░ │ ░░ │ ░░ │ ░░ │ ░░ │ ░░ │ ░░ │ row 3 │ ░░ │ ░░ │ ██ │ ██ │ ██ │ ░░ │ ░░ │ ░░ │ row 4 │ ██ │ ██ │ ██ │ ██ │ ██ │ ██ │ ██ │ ██ │ row 5 └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ██ = visible (context) ░░ = masked (target) ~10% visible, ~90% masked → very efficient training

Key Results

  • Learns motion and appearance representations from video alone
  • Frozen features (no fine-tuning) transfer well to image and video tasks
  • Outperforms pixel-reconstruction methods on action recognition
  • Pre-trained on 2M video clips without any labels

GitHub: facebookresearch/jepa

7. V-JEPA 2 & World Models

June 2025 Open Source (Commercial) Meta AI World Model 1.2B Params

V-JEPA 2 is a significant evolution — a 1.2 billion parameter world model trained primarily on video. It demonstrates that self-supervised video pre-training can produce representations suitable for understanding, prediction, and physical planning.

Architecture

  • Backbone: ViT-eH (1.2B parameters) — a scaled-up Vision Transformer
  • Pre-training data: 1M+ hours of internet video (unlabeled)
  • Training: Self-supervised JEPA objective (masking + latent prediction)
  • Post-training: Attentive probing / task-specific adapters (frozen backbone)

V-JEPA 2-AC: Action-Conditioned World Model

V-JEPA 2-AC extends the base model with action conditioning for robotics:

V-JEPA 2-AC Pipeline: Visual Observation Robot Action │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ V-JEPA 2 │ │ Action Encoder │ │ (1.2B, frozen) │ │ (learned) │ └────────┬────────┘ └────────┬────────┘ │ │ ▼ ▼ ┌────────────────────────────────────┐ │ Action-Conditioned Predictor │ │ (300M params, block-causal attn) │ └───────────────┬────────────────────┘ │ ▼ Predicted Future State (in representation space) │ ▼ ┌──────────────────┐ │ Planning Module │ │ (goal-conditioned│ │ action search) │ └──────────────────┘

Performance Highlights

BenchmarkV-JEPA 2 ScoreNotes
Something-Something v277.3 top-1Strong motion understanding
Epic-Kitchens-100 (anticipation)39.7 recall@5State-of-the-art
Robot manipulation (zero-shot)Pick-and-place successTrained with <62h unlabeled robot video
Zero-Shot Robot Planning V-JEPA 2-AC can be post-trained on fewer than 62 hours of unlabeled robot videos, then deployed zero-shot on Franka robot arms for pick-and-place tasks using goal-conditioned planning in latent space.

GitHub: facebookresearch/vjepa2  |  Paper: arXiv:2506.09985

8. VL-JEPA — Vision-Language JEPA

Dec 2025 Meta AI Multimodal 1.6B Params

VL-JEPA applies JEPA principles to vision-language modeling. Instead of autoregressively generating text tokens (like standard VLMs), VL-JEPA predicts continuous embeddings of target text — a fundamentally different approach to multimodal AI.

Architecture

  • X-Encoder: V-JEPA 2 backbone — outputs visual token representations
  • Predictor: Initialized from Llama 3 Transformer layers — predicts text embeddings from visual + text context
  • Loss: L2 loss in continuous embedding space (not cross-entropy over vocabulary)
VL-JEPA Architecture: Image/Video Text (partial) │ │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ V-JEPA 2 │ │ Text │ │ Encoder │ │ Encoder │ └────┬─────┘ └────┬─────┘ │ Visual Tokens │ Text Tokens └───────────┬────────────┘ ▼ ┌─────────────────┐ │ Predictor │ │ (Llama 3 init) │ └────────┬────────┘ │ ▼ Predicted Text Embeddings │ ▼ L2 Loss vs Target Embeddings (NOT cross-entropy over vocab)

Key Advantages

  • 50% fewer trainable parameters compared to standard VLM training
  • Comparable performance to InstructBLIP and QwenVL on VQA benchmarks
  • Only 1.6B total parameters — significantly smaller than typical VLMs
  • Inherits V-JEPA 2's strong video understanding for video QA

Paper: arXiv:2512.10942

9. LLM-JEPA — JEPA for Large Language Models

Sep 2025 Language Pre-training & Fine-tuning

LLM-JEPA adapts the JEPA framework for large language models, applicable to both pre-training and fine-tuning. Instead of predicting the next token, LLM-JEPA predicts representations of masked text spans in embedding space.

How It Works

  • Text is split into context and target spans
  • Context encoder processes visible tokens
  • Predictor predicts latent representations of masked spans
  • Target encoder (EMA) provides ground-truth embeddings
  • L2 loss in representation space — no vocabulary softmax needed

Results Across Model Families

Model FamilyTaskStandard TrainingLLM-JEPAImprovement
Llama 3NL-RXBaseline+significantOutperforms
OpenELMGSM8KBaseline+significantOutperforms
Gemma 2SpiderBaseline+significantOutperforms
OLMoRottenTomatoesBaseline+significantOutperforms
Robustness to Overfitting A notable advantage of LLM-JEPA is improved robustness to overfitting during fine-tuning, making it particularly valuable for domain adaptation with limited data.

Paper: arXiv:2509.14252

10. D-JEPA — Denoising JEPA

ICLR 2025 Generative Diffusion Open Source

D-JEPA pioneers the integration of JEPA with generative modeling. It reinterprets JEPA's masked prediction as a generalized next-token prediction strategy and incorporates diffusion loss to model per-token probability distributions, enabling data generation in continuous space.

Key Innovation

While standard JEPA predicts a single point estimate for each masked token's representation, D-JEPA models the full probability distribution over possible representations using a diffusion process. This allows:

  • High-fidelity image generation (not just representation learning)
  • Conditional generation on ImageNet classes
  • Bridging the gap between discriminative JEPA and generative modeling

Results

D-JEPA base, large, and huge model variants outperform all previous generative models at comparable scales on ImageNet conditional generation benchmarks.

ScaleD-JEPA Performancevs Prior SOTA
BaseState-of-the-artOutperforms all previous generative models
LargeState-of-the-artOutperforms all previous generative models
HugeState-of-the-artOutperforms all previous generative models

11. Domain-Specific JEPA Variants

The JEPA framework has been adapted across numerous modalities and domains:

A-JEPA — Audio JEPA

Audio Nov 2023

Extends JEPA to audio spectrograms. Audio is converted to a spectrogram, tokenized into patches, and masked for prediction. Uses a curriculum masking strategy that starts with easy predictions and progressively increases difficulty. Random masking is preferred over block masking for audio due to the different spatial structure of spectrograms.

Paper: arXiv:2311.15830

MC-JEPA — Motion and Content JEPA

Optical Flow Video

Jointly learns optical flow estimation and content features within a shared encoder. Achieves endpoint error (EPE) of ~2.81 on Sintel Clean and ~3.51 on Sintel Final — comparable to specialized unsupervised optical flow estimators.

T-JEPA — Trajectory JEPA

Trajectories Spatiotemporal

Samples and predicts trajectory segments in representation space. Robust to noise, irregular sampling, and spatial distortion. Bypasses the need for handcrafted augmentations that are difficult to design for trajectory data.

TS-JEPA — Time Series JEPA

Time Series Forecasting

Divides time series into patch tokens and uses high masking ratios (>70%). Outperforms masked input-reconstruction and contrastive baselines on both classification and long-term forecasting tasks.

Graph-JEPA

Graphs GNN

Applies JEPA to graph-structured data. Much more memory-efficient than contrastive graph learning since it requires no augmentations or negative samples.

TI-JEPA / JEPA-T — Text-Image JEPA

Multimodal Cross-modal

Cross-modal variants with cross-attention modules for multimodal fusion. Competitive on multimodal sentiment analysis, open-vocabulary image generation, and multimodal retrieval.

Drive-JEPA

Autonomous Driving 2025

Applies Video JEPA to end-to-end autonomous driving with multimodal trajectory distillation. Combines JEPA's self-supervised video understanding with driving-specific objectives.

12. EB-JEPA — Energy-Based JEPA Library

Feb 2026 Open Source Meta AI Educational

EB-JEPA is Meta's lightweight, open-source library for exploring energy-based joint-embedding predictive architectures. Designed for accessibility — every example trains on a single GPU in a few hours.

What's Included

  • Image representation learning: I-JEPA style training on images
  • Video representation learning: V-JEPA style training on video
  • Action-conditioned video: World model training with action inputs
  • Planning: Goal-conditioned planning using JEPA-based world models

Collapse Prevention: VICReg

EB-JEPA uses VICReg (Variance-Invariance-Covariance Regularization) to prevent representation collapse:

# VICReg Loss Components
def vicreg_loss(z1, z2, lam=25.0, mu=25.0, nu=1.0):
    """
    z1, z2: (B, D) representations to align
    """
    # 1. Invariance: match representations
    inv_loss = F.mse_loss(z1, z2)

    # 2. Variance: prevent collapse to constant
    std_z1 = torch.sqrt(z1.var(dim=0) + 1e-4)
    std_z2 = torch.sqrt(z2.var(dim=0) + 1e-4)
    var_loss = (torch.mean(F.relu(1 - std_z1)) +
                torch.mean(F.relu(1 - std_z2)))

    # 3. Covariance: decorrelate dimensions
    z1_centered = z1 - z1.mean(dim=0)
    z2_centered = z2 - z2.mean(dim=0)
    cov_z1 = (z1_centered.T @ z1_centered) / (z1.shape[0] - 1)
    cov_z2 = (z2_centered.T @ z2_centered) / (z2.shape[0] - 1)
    # Off-diagonal elements should be zero
    cov_loss = (off_diagonal(cov_z1).pow(2).sum() / z1.shape[1] +
                off_diagonal(cov_z2).pow(2).sum() / z2.shape[1])

    return lam * inv_loss + mu * var_loss + nu * cov_loss

GitHub: facebookresearch/eb_jepa  |  Paper: arXiv:2602.03604

13. H-JEPA — Hierarchical JEPA

Theory World Model Autonomous Intelligence

Hierarchical JEPA (H-JEPA) is the theoretical capstone of LeCun's vision for autonomous machine intelligence. It proposes a multi-level hierarchy of JEPA modules operating at different timescales and abstraction levels — analogous to how human cognition operates from reflexive motor control to abstract planning.

Hierarchical World Model

H-JEPA: Hierarchical World Model Level 3 (Abstract) ┌─────────────────────────────────┐ Long time horizon │ JEPA: Strategic planning │ │ (minutes → hours) │ │ "Go to the kitchen" │ └───────────┬─────────────────────┘ │ Level 2 (Mid) ┌───────────▼─────────────────────┐ Medium time horizon │ JEPA: Tactical planning │ │ (seconds → minutes) │ │ "Navigate around table" │ └───────────┬─────────────────────┘ │ Level 1 (Concrete) ┌───────────▼─────────────────────┐ Short time horizon │ JEPA: Motor control │ │ (milliseconds → seconds) │ │ "Move left foot forward 30cm" │ └─────────────────────────────────┘

Key Properties of H-JEPA

  • Multi-level abstraction: Higher levels represent more abstract, longer-term predictions
  • Top-down influence: Higher-level goals constrain lower-level predictions and actions
  • Variable time scales: Each level operates at a different temporal granularity
  • Learned, not handcrafted: The hierarchy and abstraction levels emerge from self-supervised training

Connection to LeCun's Architecture for Intelligence

H-JEPA is one module in LeCun's proposed cognitive architecture, alongside:

ModuleRole
ConfiguratorSets parameters and attention for other modules based on current task
PerceptionEstimates world state from sensory input
World Model (H-JEPA)Predicts future world states; the core of the system
Cost ModuleComputes energy/cost of predicted states (drives behavior)
Short-term MemoryStores recent world states and predictions
ActorComputes action sequences to minimize predicted cost
Current Status H-JEPA remains largely theoretical. V-JEPA 2-AC represents the closest practical realization — a single-level world model capable of planning in latent space. Fully hierarchical, multi-level JEPA systems are an active area of research.

14. Timeline & Evolution

June 2022
LeCun's Position Paper
"A Path Towards Autonomous Machine Intelligence" — proposes JEPA, H-JEPA, and a full cognitive architecture.
Jan 2023
I-JEPA Paper
First practical JEPA implementation. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" — accepted at CVPR 2023.
Nov 2023
A-JEPA
Audio JEPA — extends the framework to audio spectrograms with curriculum masking.
Feb 2024
V-JEPA Released
Video JEPA — self-supervised learning from video with spatiotemporal masking. Open-sourced by Meta.
2024
Domain Variants Emerge
MC-JEPA (motion+content), T-JEPA (trajectories), TS-JEPA (time series), Graph-JEPA, TI-JEPA (text-image) — the JEPA family expands across modalities.
Early 2025
D-JEPA (ICLR 2025)
Denoising JEPA bridges discriminative and generative modeling with diffusion loss. SOTA on ImageNet generation.
June 2025
V-JEPA 2 Released
1.2B parameter world model. Includes V-JEPA 2-AC for robot manipulation with zero-shot planning. Open-sourced for commercial use.
Sep 2025
LLM-JEPA
JEPA applied to large language models. Outperforms standard objectives across Llama 3, OpenELM, Gemma 2, and OLMo.
Dec 2025
VL-JEPA
Vision-language JEPA — predicts continuous text embeddings. 50% fewer trainable params than standard VLMs.
Feb 2026
EB-JEPA Library
Meta releases lightweight energy-based JEPA library. Single-GPU examples for images, video, action-conditioned models, and planning.

15. Code & Implementations

Official Meta Repositories

RepositoryModelFrameworkLicense
facebookresearch/ijepa I-JEPA PyTorch CC-BY-NC 4.0
facebookresearch/jepa V-JEPA PyTorch CC-BY-NC 4.0
facebookresearch/vjepa2 V-JEPA 2 + AC PyTorch Apache 2.0 (commercial)
facebookresearch/eb_jepa EB-JEPA PyTorch MIT

Quick Start with EB-JEPA

# Clone and setup
git clone https://github.com/facebookresearch/eb_jepa.git
cd eb_jepa
pip install -e .

# Train I-JEPA on CIFAR-10 (single GPU, ~1 hour)
python examples/image_ijepa/train.py \
    --dataset cifar10 \
    --batch-size 256 \
    --epochs 100 \
    --encoder vit_small \
    --pred-depth 4

# Train V-JEPA on video (single GPU, few hours)
python examples/video_vjepa/train.py \
    --dataset kinetics400_subset \
    --batch-size 32 \
    --epochs 50

Using V-JEPA 2 Pre-trained Features

import torch
from vjepa2 import load_model

# Load pre-trained V-JEPA 2 encoder
model = load_model("vjepa2-viteh", pretrained=True)
model.eval()

# Extract features from video
# video: (B, T, C, H, W) tensor
video = torch.randn(1, 16, 3, 224, 224)

with torch.no_grad():
    features = model.encode(video)
    # features: (B, N, D) where N = num_patches, D = embed_dim

    # Global average pooling for classification
    global_features = features.mean(dim=1)  # (B, D)

    # Per-patch features for dense prediction
    spatial_features = features  # (B, N, D)

Community Implementations

PackageSourceDescription
jepa (PyPI) Community General JEPA framework built with PyTorch + Transformers
HuggingFace Course HuggingFace I-JEPA tutorial in the Computer Vision Course

16. Resources & Links

Key Papers

PaperYearLink
A Path Towards Autonomous Machine Intelligence (LeCun) 2022 PDF
I-JEPA: Self-Supervised Learning from Images 2023 arXiv
V-JEPA: Video Joint-Embedding Predictive Architecture 2024 Meta Blog
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning 2025 arXiv
VL-JEPA: Vision-Language JEPA 2025 arXiv
LLM-JEPA: LLMs Meet JEPA 2025 arXiv
A-JEPA: JEPA Can Listen 2023 arXiv
EB-JEPA: Lightweight Library for Energy-Based JEPA 2026 arXiv

Meta AI Blog Posts

Tutorials & Explainers

Learning Hub