Learning to Listen · ICML 2026 Workshop on Machine Learning for Audio

AV‑JEPAExtending LeJEPA to Audio‑Visual Self‑Supervised Learning

Benjamin Robson · Santeri Mentu · Wenshuai Zhao · Arno Solin

ELLIS Institute Finland & Aalto University

AV-JEPA training pipeline: global and cross-modal local views pass through a shared early-fusion ViT; the LeJEPA loss aligns view embeddings while SIGReg shapes their distribution.
The AV‑JEPA training pipeline. Each clip yields two global views (both modalities) and two local views that alternate audio-only / video-only. All views pass through a single shared early-fusion ViT‑Base, and the LeJEPA loss pulls every embedding toward the joint-modality center while SIGReg enforces an isotropic Gaussian distribution.

Abstract

We present AV‑JEPA, an elegant multimodal extension of LeJEPA to audio-visual self-supervised learning. Using an early-fusion Vision Transformer and modality dropout as masking, the model is trained to align the embeddings of global and per-modality local views, while the SIGReg objective encourages a theoretically optimal distribution. This achieves cross-modal alignment in the latent space, resulting in a remarkably clean architecture with no decoder, EMA teacher, complex multi-term losses, or contrastive negatives. The proposed AV‑JEPA backbone delivers competitive classification performance on VGGSound (57.1% top-1) and AudioSet (32.7 mAP) and supports zero-shot audio-video retrieval out of the box.

How it works

Prior audio-visual SSL methods reconstruct masked inputs through dedicated decoders, often with contrastive losses and EMA teachers on top. AV‑JEPA operates entirely in latent space, built from three ingredients:

Early-fusion encoder

Video tubelets (1568 tokens) and mel-spectrogram patches (400 tokens) are processed jointly by a single shared ViT‑Base from the first layer.

Modality dropout as masking

Local views alternate between audio-only and video-only, the other modality zeroed. Aligning them with the joint-modality center makes cross-modal prediction an implicit latent-space task.

LeJEPA loss

One invariance term plus SIGReg, which provably drives embeddings toward an isotropic Gaussian and prevents collapse. A single scalar λ is the only loss hyperparameter.

Results

Pretrained on AudioSet‑2M for 57 epochs and fine-tuned per dataset, AV‑JEPA is to our knowledge the first JEPA-based method at this level of audio-visual classification accuracy.

Audio-visual classification, VGGSound top-1 (%)
MethodTypePre-trainEpochsEvalTop-1
AS-2M → VGGSound fine-tune (ours)
AV‑JEPAJEPAAS-2M57+13FT (attentive)57.1
AV‑JEPAJEPAAS-2M57+13FT (linear)56.6
Controlled VGGSound-only (ours)
AV‑JEPAJEPAVGGS50+6FT49.8
AV‑JEPAJEPAVGGS50Attentive (frozen)48.6
AV‑JEPAJEPAVGGS50Linear (frozen)46.0
Literature (MAE-based)
MAViLMAEAS-2M+IN80+60FT67.1
CAV-MAEMAEAS-2M25+10FT65.4
AV-MAEMAEVGGS800+50FT63.5
CAV-MAE SyncMAEAS-2M25Linear (frozen)52.7

MAE baselines rely on reconstruction decoders and contrastive objectives; AV-MAE pretrains for up to 800 epochs and MAViL adds an ImageNet-pretrained visual encoder.

Audio-visual mAP on the AudioSet eval split
MethodTypePre-trainEvalAS-2MAS-20k
End-to-end fine-tuning (ours)
AV‑JEPAJEPAAS-2MA+V32.729.6
AV‑JEPAJEPAAS-2MAudio-only26.023.7
AV‑JEPAJEPAAS-2MVideo-only12.810.3
Baselines (end-to-end fine-tuning)
AV-MAEMAEAS-2MA+V47.3
CAV-MAEMAEAS-2MA+V51.242.0
MAViLMAEAS-2M+INA+V53.344.9
CAV-MAE SyncMAEAS-2MA+V30.5*

*Linear probe. The per-modality breakdown (26.0 audio-only vs. 12.8 video-only) shows the learned representation is strongly audio-driven.

Zero-shot cross-modal retrieval

Each clip is encoded once per modality, with the other zeroed, and candidates are ranked by cosine similarity. With no contrastive training, retrieval is far above the roughly 0.05% chance level in both directions.

Recall@k (%) and median rank, balanced 5-per-class eval subsets
DatasetDirectionR@1R@5R@10Med. rank
VGGSound (N=1545)A → V10.626.335.425
V → A10.227.436.924
AudioSet (N=2015)A → V10.626.335.525
V → A11.226.135.925

Cross-modal attention emerges for free

Last-layer audio→video attention on RGB frames (left) and video→audio attention on mel spectrograms (right). The model attends to the sound-producing region and to the sound's harmonic structure, with no localization supervision.

Audio-to-video attention on a guitar clip, concentrated on the guitar and the player's hands
Video-to-audio attention on the guitar mel spectrogram, following fundamentals and overtones
Audio-to-video attention on a flute clip, concentrated on the flute and the player's mouth
Video-to-audio attention on the flute mel spectrogram
Audio-to-video attention on a flying bird clip, concentrated on the bird's body
Video-to-audio attention on the bird mel spectrogram, following the wing-beat envelope
Cross-modal attention on three VGGSound test clips. Top to bottom: playing guitar, playing flute, flying bird.

The encoder localizes sounding objects

DINOv2‑style feature PCA of last-layer video patch tokens, with one basis fit per class (rows) and the top three components mapped to RGB. The sounding object takes a consistent color across clips and frames, despite training with clip-level objectives only.

Feature PCA on four piano clips: the keyboard takes a consistent color across clips and frames Feature PCA on four bass guitar clips: the instrument body and neck take a consistent color Feature PCA on four violin clips: the violin takes a consistent color
Feature PCA of video patch tokens. Left to right: playing piano, playing bass guitar, playing violin. Click to enlarge.

A semantically structured embedding space

t-SNE of CLS embeddings for 11,143 VGGSound clips from 60 classes, grouped into six semantic families. Clips form compact, well-separated clusters that respect the family grouping, even though pretraining never sees class labels.

t-SNE of 11,143 VGGSound clip embeddings colored by six semantic families, forming compact well-separated clusters
Semantic structure of the embedding space. The residual overlap falls mainly between acoustically related families such as animal calls and human voice.

BibTeX

@inproceedings{robson2026avjepa,
  title     = {{AV-JEPA}: Extending {LeJEPA} to Audio-Visual Self-Supervised Learning},
  author    = {Robson, Benjamin and Mentu, Santeri and Zhao, Wenshuai and Solin, Arno},
  booktitle = {ICML 2026 Workshop on Machine Learning for Audio},
  year      = {2026}
}