AI Research Roundup: Oct 2025's Top Papers
Hey everyone! Check out the latest and greatest in AI research from October 30, 2025. This week, we're diving into cutting-edge papers on video understanding, world models, and more. If you want an even better reading experience and access to more papers, make sure to check out the Github page. Let's get started!
Video Understanding
In the realm of video understanding, several exciting papers have emerged, pushing the boundaries of what machines can comprehend from visual data. From temporal dynamics to multimodal reasoning, these studies explore various facets of video analysis. Let's dive in!
StreamingCoT: Temporal Dynamics and Multimodal Chain-of-Thought Reasoning
StreamingCoT introduces a novel dataset designed for temporal dynamics and multimodal chain-of-thought reasoning in streaming VideoQA. This research addresses the challenge of understanding videos in real-time, requiring models to process information sequentially and integrate multiple modalities. The dataset facilitates the development of systems that can not only answer questions about video content but also reason through complex temporal relationships and multimodal cues. This is a crucial step toward building more intelligent and responsive video analysis systems, guys.
Video-LMM Post-Training: Deep Dive into Video Reasoning with Large Multimodal Models
This paper provides a deep dive into video reasoning using large multimodal models through Video-LMM Post-Training. The authors explore techniques to enhance the video understanding capabilities of these models, focusing on how they process and reason about visual information. Version v1.1 of this paper highlights significant advancements in the field, offering insights into the architecture and training strategies that enable more accurate and nuanced video reasoning. For researchers and practitioners, this work is invaluable in guiding the development of next-generation video understanding systems.
VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning
VideoTG-R1 presents a novel approach to boosting video temporal grounding by using curriculum reinforcement learning on reflected boundary annotations. The method enhances a model's ability to identify specific moments or segments within a video that correspond to a given query. By training the model with a curriculum that gradually increases the difficulty of the grounding tasks, the system learns to focus on the most relevant temporal boundaries, thereby improving accuracy. This technique represents a significant step forward in making video search and retrieval more efficient and precise, if you ask me.
Evaluation of Vision-LLMs in Surveillance Video
This paper evaluates the performance of Vision-LLMs in the context of surveillance video. Accepted as a poster in the NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI, the study examines how well these models can interpret and reason about complex surveillance footage. The findings provide insights into the strengths and limitations of current Vision-LLMs in this critical application area, highlighting potential areas for future research and development. The ability of these models to process surveillance video effectively could have a massive impact on security and safety applications.
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
VideoHallu focuses on evaluating and mitigating multi-modal hallucinations in synthetic video understanding. Hallucinations, where models generate incorrect or nonsensical information, pose a significant challenge in AI. This research identifies the causes of these hallucinations and proposes methods to reduce their occurrence, thereby improving the reliability of video understanding systems. The work is essential for ensuring that AI-driven video analysis can be trusted in real-world scenarios.
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning
MECD+ introduces a novel method for unlocking event-level causal graph discovery for video reasoning. Accepted by IEEE TPAMI, this work delves into the complex relationships between events in videos, enabling models to understand not just what is happening but why. By constructing causal graphs, the system can infer the underlying causes and effects, leading to a deeper comprehension of video content. This approach represents a significant advancement in the field, providing a powerful tool for video analysis and interpretation. (Note: substantial text overlap with arXiv:2409.17647)
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
Video-Skill-CoT presents a skill-based chain-of-thoughts approach for domain-adaptive video reasoning. This method aims to enhance the ability of models to generalize across different video domains by breaking down complex reasoning tasks into a series of smaller, skill-based steps. By learning these skills, the model can adapt more effectively to new and unseen video content. Visit the project website for more details: https://video-skill-cot.github.io/.
MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence
Accepted to NeurIPS 2025 D&B Track, MUVR introduces a multi-modal untrimmed video retrieval benchmark with multi-level visual correspondence. This benchmark challenges models to retrieve relevant video segments from untrimmed videos using multi-modal queries. The multi-level visual correspondence aspect ensures that the system can match visual elements at different granularities, from individual objects to entire scenes. MUVR sets a high standard for video retrieval systems, pushing the field towards more accurate and comprehensive solutions.
Two Causally Related Needles in a Video Haystack
Accepted to NeurIPS 2025 D&B Track, this paper addresses the challenge of finding two causally related events in a video haystack. The task requires models to identify events that are not only temporally close but also have a causal relationship. This research is crucial for applications where understanding cause-and-effect is paramount, such as incident analysis and forensic investigations. This is some fascinating stuff, right?
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
InfiniPot-V, presented at NeurIPS 2025, explores memory-constrained KV cache compression for streaming video understanding. The method aims to reduce the memory footprint of video processing systems, making it possible to analyze long videos in real-time without exceeding memory limits. This is particularly important for resource-constrained environments and applications that require continuous video analysis.
HRT1: One-Shot Human-to-Robot Trajectory Transfer for Mobile Manipulation
HRT1 introduces a one-shot human-to-robot trajectory transfer method for mobile manipulation. The system allows robots to learn complex manipulation tasks from a single human demonstration. This approach significantly reduces the amount of training data required, making it easier to deploy robots in real-world environments. For more details, check out the project page: https://irvlutd.github.io/HRT1/ (14 pages, 11 figures, and 3 tables).
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
SeViCES proposes a method for unifying semantic-visual evidence consensus for long video understanding. The technique combines information from both semantic and visual modalities to improve the accuracy of video analysis. By integrating these different sources of evidence, the system can achieve a more comprehensive understanding of long video sequences. It is really cool how they managed to put all of this together!
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Open-o3 Video focuses on grounded video reasoning with explicit spatio-temporal evidence. The approach aims to improve the interpretability of video understanding systems by making the reasoning process more transparent. By explicitly representing spatio-temporal relationships, the system can provide clear explanations for its conclusions. This is essential for building trust in AI-driven video analysis.
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Conan introduces a progressive learning method that allows models to reason like a detective over multi-scale visual evidence. The system gradually learns to integrate visual information at different scales, enabling it to solve complex reasoning tasks. This approach is inspired by human detective work, where clues are gathered and analyzed to uncover the truth. I love this approach!
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
Accepted by NeurIPS 2025, PreFM presents a method for online audio-visual event parsing via predictive future modeling. The system predicts future events based on current audio and visual inputs, enabling it to understand videos in real-time. This predictive capability is crucial for applications such as autonomous driving and human-robot interaction. This paper is accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025).
World Models
World models are a hot topic in AI research, aiming to create systems that can understand and predict the dynamics of the world around them. These models are essential for tasks such as robotics, autonomous driving, and reinforcement learning. Let's explore the latest advancements in this exciting field.
Off-policy Reinforcement Learning with Model-based Exploration Augmentation
This paper explores off-policy reinforcement learning with model-based exploration augmentation. The approach aims to improve the efficiency of reinforcement learning by using a world model to guide exploration. By leveraging the model's understanding of the environment, the agent can make more informed decisions about which actions to take, leading to faster learning. We will see where this goes!
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
FutureSightDrive introduces a method for thinking visually with spatio-temporal CoT for autonomous driving. Accepted to NeurIPS 2025 as a Spotlight Presentation, the approach enhances the ability of autonomous vehicles to anticipate future events by reasoning about spatio-temporal relationships. This is crucial for making safe and effective driving decisions. Check out the code: https://github.com/MIV-XJTU/FSDrive.
Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
This paper presents a vision-centric approach to 4D occupancy forecasting and planning via implicit residual world models. The method aims to predict the future occupancy of a scene based on visual inputs, enabling robots and autonomous systems to plan their actions more effectively. By using implicit residual models, the system can capture complex dynamics and uncertainties in the environment.
AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians
AtlasGS introduces a technique for Atlanta-world guided surface reconstruction with implicit structured Gaussians. This method aims to create detailed 3D reconstructions of urban environments, which is essential for applications such as mapping and urban planning. The paper, consisting of 18 pages and 11 figures, will be presented at NeurIPS 2025. Project page: https://zju3dv.github.io/AtlasGS/.
Evolving Diagnostic Agents in a Virtual Clinical Environment
This research focuses on evolving diagnostic agents in a virtual clinical environment. The goal is to develop AI systems that can diagnose diseases and recommend treatments, providing valuable support to healthcare professionals. By training agents in a virtual environment, the system can learn from a wide range of scenarios without the risks associated with real-world clinical settings.
Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks
This paper presents dual-mind world models, a general framework for learning in dynamic wireless networks. The approach aims to improve the performance of wireless communication systems by using world models to predict network conditions and optimize resource allocation. By adapting to the dynamic nature of wireless networks, the system can achieve higher throughput and reliability. It will be interesting to see how it goes.
Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning
Multimodal Dreaming explores a global workspace approach to world model-based reinforcement learning. Under review, this method aims to enhance the learning process by integrating information from multiple modalities, such as vision and language. By simulating scenarios in a dream-like state, the agent can explore potential actions and their consequences, leading to more effective learning.
Affordance Representation and Recognition for Autonomous Agents
This paper focuses on affordance representation and recognition for autonomous agents. Affordances are the potential actions that an agent can perform in an environment, such as grasping an object or navigating a space. By learning to recognize affordances, autonomous agents can interact more effectively with the world around them.
LongCat-Video Technical Report
This is a technical report on LongCat-Video, detailing the architecture and performance of a system designed for long-form video analysis. The report provides insights into the challenges of processing long videos and the techniques used to overcome them. This is for the most savvy viewers, I guess!
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
This paper presents a comprehensive survey on general world models and beyond, asking the question: Is Sora a world simulator? The survey provides an overview of the current state of world models, their capabilities, and their limitations. This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey.
Human Machine Social Hybrid Intelligence: A Collaborative Decision Making Framework
This research introduces a collaborative decision-making framework for large model agent groups and human experts, titled Human Machine Social Hybrid Intelligence. The framework aims to leverage the strengths of both AI and human intelligence, enabling more effective decision-making in complex scenarios. This kind of collaboration is a great thing in my opinion.
COMPASS: Cross-embodiment Mobility Policy via Residual RL and Skill Synthesis
COMPASS presents a cross-embodiment mobility policy via residual RL and skill synthesis. The method aims to enable robots to transfer skills learned in one embodiment (e.g., a simulation) to another (e.g., a real-world robot). This is a crucial step toward making robots more versatile and adaptable.
Deductive Chain-of-Thought Augmented Socially-aware Robot Navigation World Model
This paper introduces a deductive chain-of-thought augmented socially-aware robot navigation world model. The system combines chain-of-thought reasoning with social awareness, enabling robots to navigate complex environments while interacting safely and effectively with humans.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
World-Env explores leveraging a world model as a virtual environment for VLA post-training. The approach aims to improve the performance of VLAs by training them in a simulated environment that closely mirrors the real world. This is some serious next-level research, guys.
Deep Active Inference with Diffusion Policy and Multiple Timescale World Model
This research presents deep active inference with diffusion policy and multiple timescale world model for real-world exploration and navigation. The system combines active inference, which is a theory of how the brain controls behavior, with diffusion policies and world models, enabling robots to explore and navigate complex environments effectively. Preprint version.
Multimodal Learning
Multimodal learning focuses on integrating information from multiple sources, such as vision, language, and audio, to build more robust and versatile AI systems. This approach is crucial for tasks that require a comprehensive understanding of the world. Let's take a look at the latest research in this area.
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
This paper provides a survey and benchmarks for multimodal spatial reasoning in the large model era. The study examines how well large models can reason about spatial relationships using information from multiple modalities. The findings highlight the strengths and limitations of current models, providing guidance for future research.
SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning
SMMILE introduces an expert-driven benchmark for multimodal medical in-context learning. Presented at NeurIPS 2025 (Datasets & Benchmarks Track), the benchmark aims to evaluate the ability of AI systems to learn from medical data, such as images and text, in a clinical context. The benchmark is designed to be challenging and realistic, providing a valuable tool for assessing the performance of AI in healthcare.
Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies
This paper explores multimodal recurrent ensembles for predicting brain responses to naturalistic movies (Algonauts 2025). The research investigates how well AI models can predict human brain activity based on multimodal inputs, such as visual and auditory information from movies. The work was an invited report for the CCN 2025 Algonauts Project session (3rd-place team). Code: https://github.com/erensemih/Algonauts2025_ModalityRNN (8 pages, 2 figures, 1 table). v3: Added equal contribution footnote to author list. Corrected reference list.
CGM-Led Multimodal Tracking with Chatbot Support: An Autoethnography in Sub-Health
This research presents an autoethnography in sub-health, focusing on CGM-led multimodal tracking with chatbot support. The study examines how wearable sensors and chatbots can be used to track and manage health conditions. Conference paper, preprint.
Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Open3D-VQA introduces a benchmark for comprehensive spatial reasoning with a multimodal large language model in open space. The benchmark challenges models to answer questions about 3D scenes using information from multiple modalities, such as vision and language. This sounds like a really cool challenge, eh?
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning
(Same as in Video Understanding section) StreamingCoT introduces a novel dataset designed for temporal dynamics and multimodal chain-of-thought reasoning in streaming VideoQA. This research addresses the challenge of understanding videos in real-time, requiring models to process information sequentially and integrate multiple modalities.
MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Accepted by SenSys 2026, MMEdge explores accelerating on-device multimodal inference via pipelined sensing and encoding. The method aims to improve the efficiency of multimodal AI systems on edge devices, making it possible to process data in real-time without relying on cloud resources. This is essential for applications such as mobile computing and IoT.
Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student
This paper focuses on teaching sarcasm through few-shot multimodal sarcasm detection via distillation to a parameter-efficient student. The research investigates how to train AI systems to detect sarcasm using limited training data. Sarcasm detection is a challenging task that requires models to understand both the literal meaning of words and the context in which they are used. This one sounds funny, hehe.
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
Accepted by Neurips2025, NoisyGRPO presents a method for incentivizing multimodal CoT reasoning via noise injection and Bayesian estimation. The approach aims to improve the robustness of multimodal AI systems by training them to handle noisy inputs. Project page: https://artanic30.github.io/project_pages/NoisyGRPO/.
Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection
This research explores a mixture of dual reasoning experts for multimodal sarcasm detection, encouraging models to think twice before making a judgment. The approach combines different reasoning strategies, enabling the system to better understand the nuances of sarcastic language.
Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
This paper focuses on quantifying multimodal imbalance and proposes a GMM-guided adaptive loss for audio-visual learning. The method aims to address the challenges of training AI systems on datasets where the different modalities are not equally represented. Now that is a challenge!
InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
InfoChartQA introduces a benchmark for multimodal question answering on infographic charts. The benchmark challenges models to answer questions about charts using information from both visual and textual elements. I wonder how well I would do on this one.
Adapter-state Sharing CLIP for Parameter-efficient Multimodal Sarcasm Detection
This research explores adapter-state sharing CLIP for parameter-efficient multimodal sarcasm detection. The method aims to reduce the computational cost of training multimodal AI systems by sharing parameters across different modalities. Always great to save some resources!
LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
LightBagel introduces a light-weighted, double fusion framework for unified multimodal understanding and generation. (Withdrawn because the submission was premature and not agreed by all parties in collaboration.)
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
[NeurIPS2025] This paper presents a unified multimodal chain-of-thought reward model through reinforcement fine-tuning. Project Page: https://codegoat24.github.io/UnifiedReward/think.
Multimodal LLM
Multimodal Large Language Models (LLMs) are a cutting-edge area of research, combining the power of LLMs with the ability to process multiple modalities, such as vision and audio. These models are capable of performing a wide range of tasks, from image captioning to video understanding. Let's delve into the latest developments in this field.
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables
Accepted by NeurIPS 2025, NeedleInATable explores the long-context capability of large language models towards long-structured tables. The research investigates how well LLMs can process and reason about large amounts of structured data, such as tables. This is a crucial capability for applications such as data analysis and knowledge management.
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
This paper presents a holistic benchmark for multi-level visual grounding in 3D scenes, titled From Objects to Anywhere. Update v3 of the NeurIPS 2025 Datasets and Benchmarks paper (v2), including additional evaluations of state-of-the-art multimodal large language models. Project page: https://anywhere-3d.github.io/.
Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier
This research focuses on emotion-coherent reasoning for multimodal LLMs via an emotional rationale verifier (16 pages, 11 figures). The approach aims to improve the ability of LLMs to understand and respond to emotions in multimodal contexts. I wonder what they will come up with next.
FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment
FairJudge introduces MLLM judging for social attributes and prompt image alignment. The research investigates how to ensure that multimodal LLMs are fair and unbiased in their judgments, particularly when dealing with social attributes such as gender and race.
LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models
This paper presents LUQ, a method for layerwise ultra-low bit quantization for multimodal large language models. The approach aims to reduce the computational cost and memory footprint of LLMs, making them more accessible for a wider range of applications.
EasyUUV: An LLM-Enhanced Universal and Lightweight Sim-to-Real Reinforcement Learning Framework for UUV Attitude Control
EasyUUV introduces an LLM-enhanced universal and lightweight sim-to-real reinforcement learning framework for UUV attitude control (8 pages, 15 figures). The system aims to improve the autonomy of underwater vehicles by using LLMs to enhance their learning and decision-making capabilities. This is what I call smart!
Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
This research explores evaluating multimodal LLMs on tool-enabled image perception, transformation, and reasoning. The study investigates how well LLMs can use external tools to process and reason about images, pushing the boundaries of what these models can achieve.
Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study
NeurIPS 2025 (Spotlight) This paper explores Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study. I hope one day I'll be able to just sit in the back of my car and relax.
L2M3OF: A Large Language Multimodal Model for Metal-Organic Frameworks
L2M3OF introduces a large language multimodal model for metal-organic frameworks (18 pages, 7 figures). The system aims to accelerate the discovery of new materials by using AI to predict their properties and performance.
Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs
This paper presents continuous-token diffusion for speaker-referenced TTS in multimodal LLMs. The approach aims to improve the quality and naturalness of text-to-speech synthesis by using diffusion models to generate speech tokens continuously.
Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations
Empathic Prompting focuses on non-verbal context integration for multimodal LLM conversations. The research investigates how to incorporate non-verbal cues, such as facial expressions and body language, into LLM conversations, making them more natural and engaging. I wonder if the AI will be able to tell when I'm lying.
EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence
EmbodiedBrain explores expanding performance boundaries of task planning for embodied intelligence. The research aims to improve the ability of AI systems to plan and execute complex tasks in physical environments.
MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
MCIF introduces a multimodal crosslingual instruction-following benchmark from scientific talks. (Data available at https://huggingface.co/datasets/FBK-MT/MCIF | Evaluation and baselines available at https://github.com/hlt-mt/mcif).
Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs
EMNLP 2025 Main Conference This paper presents Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs. This method is pretty neat, guys!
DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents
DaMo introduces a data mixing optimizer in fine-tuning multimodal LLMs for mobile phone agents. The approach aims to improve the performance of LLMs on mobile devices by optimizing the way data is mixed during fine-tuning. Oh, the things you can do with your phone nowadays!
Video Foundation Models
Video Foundation Models are large-scale models trained on vast amounts of video data, capable of performing a wide range of video-related tasks. These models are revolutionizing the field of video analysis and generation. Let's explore the latest advancements in this area.
GenLit: Reformulating Single-Image Relighting as Video Generation
GenLit reformulates single-image relighting as video generation. The research investigates how to use video generation techniques to improve the realism of image relighting, making it possible to change the lighting conditions of an image in a convincing way.
Breakdance Video classification in the age of Generative AI
This paper explores breakdance video classification in the age of Generative AI (11 pages). The study investigates how Generative AI can be used to improve the accuracy and efficiency of video classification. I could use a dance class myself, haha.
Advances in 4D Representation: Geometry, Motion, and Interaction
This research presents Advances in 4D Representation: Geometry, Motion, and Interaction (21 pages). Project Page: https://mingrui-zhao.github.io/4DRep-GMI/.
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
TTOM introduces test-time optimization and memorization for compositional video generation. Project page: https://ttom-t2v.github.io/.
Inferring Dynamic Physical Properties from Video Foundation Models
This paper focuses on inferring dynamic physical properties from video foundation models. The research investigates how to use video foundation models to estimate physical properties, such as mass and friction, from videos.
Can World Models Benefit VLMs for World Dynamics?
This research asks the question: Can World Models Benefit VLMs for World Dynamics? Project page: https://dyva-worldlm.github.io.
FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
FantasyWorld explores geometry-consistent world modeling via unified video and 3D prediction. The research aims to build AI systems that can create realistic 3D models of the world based on video inputs.
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation
Uni3C focuses on unifying precisely 3D-enhanced camera and human motion controls for video generation. Project page: https://github.com/ewrfcas/Uni3C. Accepted by Siggraph Asian 2025.
Simplifying Traffic Anomaly Detection with Video Foundation Models
This paper explores simplifying traffic anomaly detection with video foundation models. ICCVW 2025 accepted. Code: https://github.com/tue-mps/simple-tad.
Autoregressive Universal Video Segmentation Model
This research introduces an autoregressive universal video segmentation model. The model aims to segment videos into meaningful regions, such as objects and backgrounds.
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
ToonComposer focuses on streamlining cartoon production with generative post-keyframing. Project Page: https://lg-li.github.io/project/tooncomposer.
SAGOnline: Segment Any Gaussians Online
SAGOnline presents a method to segment any Gaussians online (19 pages, 10 figures).
TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction
This research introduces TRIBE, a TRImodal Brain Encoder for whole-brain fMRI response prediction.
SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking
SAMITE presents a position prompted SAM2 with calibrated memory for visual object tracking. SAM2, eh? Sounds like a good buddy!
SeqTex: Generate Mesh Textures in Video Sequence
This paper explores generating mesh textures in a video sequence, titled SeqTex.
Conclusion
Wow, what a week for AI research! From video understanding to world models and multimodal LLMs, the field is advancing at an incredible pace. These papers offer a glimpse into the future of AI, where machines can understand the world around them with increasing accuracy and sophistication. Make sure you check out the Github page for even more papers and a better reading experience. Until next time, keep exploring!