V-JEPA 2 Collection A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13 • 177
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published Nov 6 • 210
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation Paper • 2510.01284 • Published Sep 30 • 34
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction Paper • 2510.03117 • Published Oct 3 • 11
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? Paper • 2505.14321 • Published May 20 • 11
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization Paper • 2503.23377 • Published Mar 30 • 57
BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain Paper • 2409.20075 • Published Sep 30, 2024 • 2
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering Paper • 2503.16867 • Published Mar 21 • 11
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published Oct 3, 2024 • 54