AV-JEPA: Extending LeJEPA to Audio-Visual Self-Supervised Learning

AV-JEPA training pipeline: global and cross-modal local views pass through a shared early-fusion ViT; the LeJEPA loss aligns view embeddings while SIGReg shapes their distribution. — **The AV‑JEPA training pipeline.** Each clip yields two global views (both modalities) and two local views that alternate audio-only / video-only. All views pass through a single shared early-fusion ViT‑Base, and the LeJEPA loss pulls every embedding toward the joint-modality center while SIGReg enforces an isotropic Gaussian distribution.

Abstract

We present AV‑JEPA, an elegant multimodal extension of LeJEPA to audio-visual self-supervised learning. Using an early-fusion Vision Transformer and modality dropout as masking, the model is trained to align the embeddings of global and per-modality local views, while the SIGReg objective encourages a theoretically optimal distribution. This achieves cross-modal alignment in the latent space, resulting in a remarkably clean architecture with no decoder, EMA teacher, complex multi-term losses, or contrastive negatives. The proposed AV‑JEPA backbone delivers competitive classification performance on VGGSound (57.1% top-1) and AudioSet (32.7 mAP) and supports zero-shot audio-video retrieval out of the box.

How it works

Prior audio-visual SSL methods reconstruct masked inputs through dedicated decoders, often with contrastive losses and EMA teachers on top. AV‑JEPA operates entirely in latent space, built from three ingredients:

Early-fusion encoder

Video tubelets (1568 tokens) and mel-spectrogram patches (400 tokens) are processed jointly by a single shared ViT‑Base from the first layer.

Modality dropout as masking

Local views alternate between audio-only and video-only, the other modality zeroed. Aligning them with the joint-modality center makes cross-modal prediction an implicit latent-space task.

LeJEPA loss

One invariance term plus SIGReg, which provably drives embeddings toward an isotropic Gaussian and prevents collapse. A single scalar λ is the only loss hyperparameter.

Results

Pretrained on AudioSet‑2M for 57 epochs and fine-tuned per dataset, AV‑JEPA is to our knowledge the first JEPA-based method at this level of audio-visual classification accuracy.

Audio-visual classification, VGGSound top-1 (%)
Method	Type	Pre-train	Epochs	Eval	Top-1
AS-2M → VGGSound fine-tune (ours)
AV‑JEPA	JEPA	AS-2M	57+13	FT (attentive)	57.1
AV‑JEPA	JEPA	AS-2M	57+13	FT (linear)	56.6
Controlled VGGSound-only (ours)
AV‑JEPA	JEPA	VGGS	50+6	FT	49.8
AV‑JEPA	JEPA	VGGS	50	Attentive (frozen)	48.6
AV‑JEPA	JEPA	VGGS	50	Linear (frozen)	46.0
Literature (MAE-based)
MAViL	MAE	AS-2M+IN	80+60	FT	67.1
CAV-MAE	MAE	AS-2M	25+10	FT	65.4
AV-MAE	MAE	VGGS	800+50	FT	63.5
CAV-MAE Sync	MAE	AS-2M	25	Linear (frozen)	52.7

MAE baselines rely on reconstruction decoders and contrastive objectives; AV-MAE pretrains for up to 800 epochs and MAViL adds an ImageNet-pretrained visual encoder.

Audio-visual mAP on the AudioSet eval split
Method	Type	Pre-train	Eval	AS-2M	AS-20k
End-to-end fine-tuning (ours)
AV‑JEPA	JEPA	AS-2M	A+V	32.7	29.6
AV‑JEPA	JEPA	AS-2M	Audio-only	26.0	23.7
AV‑JEPA	JEPA	AS-2M	Video-only	12.8	10.3
Baselines (end-to-end fine-tuning)
AV-MAE	MAE	AS-2M	A+V	47.3	–
CAV-MAE	MAE	AS-2M	A+V	51.2	42.0
MAViL	MAE	AS-2M+IN	A+V	53.3	44.9
CAV-MAE Sync	MAE	AS-2M	A+V	–	30.5*

*Linear probe. The per-modality breakdown (26.0 audio-only vs. 12.8 video-only) shows the learned representation is strongly audio-driven.

Zero-shot cross-modal retrieval

Each clip is encoded once per modality, with the other zeroed, and candidates are ranked by cosine similarity. With no contrastive training, retrieval is far above the roughly 0.05% chance level in both directions.

Recall@k (%) and median rank, balanced 5-per-class eval subsets
Dataset	Direction	R@1	R@5	R@10	Med. rank
VGGSound (N=1545)	A → V	10.6	26.3	35.4	25
VGGSound (N=1545)	V → A	10.2	27.4	36.9	24
AudioSet (N=2015)	A → V	10.6	26.3	35.5	25
AudioSet (N=2015)	V → A	11.2	26.1	35.9	25

Cross-modal attention emerges for free

Last-layer audio→video attention on RGB frames (left) and video→audio attention on mel spectrograms (right). The model attends to the sound-producing region and to the sound's harmonic structure, with no localization supervision.

Audio-to-video attention on a guitar clip, concentrated on the guitar and the player's hands — **Cross-modal attention** on three VGGSound test clips. Top to bottom: playing guitar, playing flute, flying bird.

Video-to-audio attention on the guitar mel spectrogram, following fundamentals and overtones — **Cross-modal attention** on three VGGSound test clips. Top to bottom: playing guitar, playing flute, flying bird.

The encoder localizes sounding objects

DINOv2‑style feature PCA of last-layer video patch tokens, with one basis fit per class (rows) and the top three components mapped to RGB. The sounding object takes a consistent color across clips and frames, despite training with clip-level objectives only.

Feature PCA on four piano clips: the keyboard takes a consistent color across clips and frames — **Feature PCA of video patch tokens.** Left to right: playing piano, playing bass guitar, playing violin. Click to enlarge.

Feature PCA on four bass guitar clips: the instrument body and neck take a consistent color — **Feature PCA of video patch tokens.** Left to right: playing piano, playing bass guitar, playing violin. Click to enlarge.

A semantically structured embedding space

t-SNE of CLS embeddings for 11,143 VGGSound clips from 60 classes, grouped into six semantic families. Clips form compact, well-separated clusters that respect the family grouping, even though pretraining never sees class labels.

t-SNE of 11,143 VGGSound clip embeddings colored by six semantic families, forming compact well-separated clusters — **Semantic structure of the embedding space.** The residual overlap falls mainly between acoustically related families such as animal calls and human voice.

BibTeX

@inproceedings{robson2026avjepa,
  title     = {{AV-JEPA}: Extending {LeJEPA} to Audio-Visual Self-Supervised Learning},
  author    = {Robson, Benjamin and Mentu, Santeri and Zhao, Wenshuai and Solin, Arno},
  booktitle = {ICML 2026 Workshop on Machine Learning for Audio},
  year      = {2026}
}