Abstract
We present AV‑JEPA, an elegant multimodal extension of LeJEPA to audio-visual self-supervised learning. Using an early-fusion Vision Transformer and modality dropout as masking, the model is trained to align the embeddings of global and per-modality local views, while the SIGReg objective encourages a theoretically optimal distribution. This achieves cross-modal alignment in the latent space, resulting in a remarkably clean architecture with no decoder, EMA teacher, complex multi-term losses, or contrastive negatives. The proposed AV‑JEPA backbone delivers competitive classification performance on VGGSound (57.1% top-1) and AudioSet (32.7 mAP) and supports zero-shot audio-video retrieval out of the box.
How it works
Prior audio-visual SSL methods reconstruct masked inputs through dedicated decoders, often with contrastive losses and EMA teachers on top. AV‑JEPA operates entirely in latent space, built from three ingredients:
Early-fusion encoder
Video tubelets (1568 tokens) and mel-spectrogram patches (400 tokens) are processed jointly by a single shared ViT‑Base from the first layer.
Modality dropout as masking
Local views alternate between audio-only and video-only, the other modality zeroed. Aligning them with the joint-modality center makes cross-modal prediction an implicit latent-space task.
LeJEPA loss
One invariance term plus SIGReg, which provably drives embeddings toward an isotropic Gaussian and prevents collapse. A single scalar λ is the only loss hyperparameter.
Results
Pretrained on AudioSet‑2M for 57 epochs and fine-tuned per dataset, AV‑JEPA is to our knowledge the first JEPA-based method at this level of audio-visual classification accuracy.
| Method | Type | Pre-train | Epochs | Eval | Top-1 |
|---|---|---|---|---|---|
| AS-2M → VGGSound fine-tune (ours) | |||||
| AV‑JEPA | JEPA | AS-2M | 57+13 | FT (attentive) | 57.1 |
| AV‑JEPA | JEPA | AS-2M | 57+13 | FT (linear) | 56.6 |
| Controlled VGGSound-only (ours) | |||||
| AV‑JEPA | JEPA | VGGS | 50+6 | FT | 49.8 |
| AV‑JEPA | JEPA | VGGS | 50 | Attentive (frozen) | 48.6 |
| AV‑JEPA | JEPA | VGGS | 50 | Linear (frozen) | 46.0 |
| Literature (MAE-based) | |||||
| MAViL | MAE | AS-2M+IN | 80+60 | FT | 67.1 |
| CAV-MAE | MAE | AS-2M | 25+10 | FT | 65.4 |
| AV-MAE | MAE | VGGS | 800+50 | FT | 63.5 |
| CAV-MAE Sync | MAE | AS-2M | 25 | Linear (frozen) | 52.7 |
MAE baselines rely on reconstruction decoders and contrastive objectives; AV-MAE pretrains for up to 800 epochs and MAViL adds an ImageNet-pretrained visual encoder.
| Method | Type | Pre-train | Eval | AS-2M | AS-20k |
|---|---|---|---|---|---|
| End-to-end fine-tuning (ours) | |||||
| AV‑JEPA | JEPA | AS-2M | A+V | 32.7 | 29.6 |
| AV‑JEPA | JEPA | AS-2M | Audio-only | 26.0 | 23.7 |
| AV‑JEPA | JEPA | AS-2M | Video-only | 12.8 | 10.3 |
| Baselines (end-to-end fine-tuning) | |||||
| AV-MAE | MAE | AS-2M | A+V | 47.3 | – |
| CAV-MAE | MAE | AS-2M | A+V | 51.2 | 42.0 |
| MAViL | MAE | AS-2M+IN | A+V | 53.3 | 44.9 |
| CAV-MAE Sync | MAE | AS-2M | A+V | – | 30.5* |
*Linear probe. The per-modality breakdown (26.0 audio-only vs. 12.8 video-only) shows the learned representation is strongly audio-driven.
Zero-shot cross-modal retrieval
Each clip is encoded once per modality, with the other zeroed, and candidates are ranked by cosine similarity. With no contrastive training, retrieval is far above the roughly 0.05% chance level in both directions.
| Dataset | Direction | R@1 | R@5 | R@10 | Med. rank |
|---|---|---|---|---|---|
| VGGSound (N=1545) | A → V | 10.6 | 26.3 | 35.4 | 25 |
| V → A | 10.2 | 27.4 | 36.9 | 24 | |
| AudioSet (N=2015) | A → V | 10.6 | 26.3 | 35.5 | 25 |
| V → A | 11.2 | 26.1 | 35.9 | 25 |
Cross-modal attention emerges for free
Last-layer audio→video attention on RGB frames (left) and video→audio attention on mel spectrograms (right). The model attends to the sound-producing region and to the sound's harmonic structure, with no localization supervision.
The encoder localizes sounding objects
DINOv2‑style feature PCA of last-layer video patch tokens, with one basis fit per class (rows) and the top three components mapped to RGB. The sounding object takes a consistent color across clips and frames, despite training with clip-level objectives only.
A semantically structured embedding space
t-SNE of CLS embeddings for 11,143 VGGSound clips from 60 classes, grouped into six semantic families. Clips form compact, well-separated clusters that respect the family grouping, even though pretraining never sees class labels.
BibTeX
@inproceedings{robson2026avjepa,
title = {{AV-JEPA}: Extending {LeJEPA} to Audio-Visual Self-Supervised Learning},
author = {Robson, Benjamin and Mentu, Santeri and Zhao, Wenshuai and Solin, Arno},
booktitle = {ICML 2026 Workshop on Machine Learning for Audio},
year = {2026}
}