话题

多模态

共 331 条相关资讯 · 来自历史归档

技巧与观点NEW

昨天

Nativ: Run AI models locally on your Mac

Nativ: Run AI models locally on your Mac Prince Canuma is the developer behind the excellent MLX-VLM Python library for running vision-LLMs using MLX on a Mac. I'm really excited a…

来源：Simon Willison

论文研究NEW

昨天

Simple Domain Generalization for Strong Pixel-Level Image Tampering Detection in Modern VLMs

Modern vision-language models (VLMs) have significantly improved image generation and editing capabilities, making pixel-level image tampering detection increasingly important yet challenging under cr…

来源：arXiv

论文研究NEW

7/20 04:00

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

Human-object centric video personalization (HOCVP) is a core task within subject-driven video generation. However, existing methods suffer from two key limitations. First, most approaches focusing on…

来源：HuggingFace Papers

论文研究NEW

7/20 04:00

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

Real-time multimodal applications, including voice agents and interactive video generation, compose heterogeneous models into pipelines whose efficient deployment requires application-specific decisio…

来源：HuggingFace Papers

论文研究NEW

7/19 04:00

TimeLens2: Generalist Video Temporal Grounding with Multimodal LLMs

Video multimodal large language models (MLLMs) can describe what happens in a video, but rarely identify when the supporting evidence occurs. We study generalist video temporal grounding, in which one…

来源：HuggingFace Papers

行业动态

7/18 16:54

豆包视频通话背后，火山引擎重构 Agent 时代多模态传输底座

来源：InfoQ

论文研究NEW

7/18 04:00

Can Multimodal Large Language Models Understand OCT?

Optical coherence tomography (OCT) imaging is essential for the diagnosis and treatment of retinal diseases. Although multimodal large language models (MLLMs) have demonstrated considerable potential…

来源：HuggingFace Papers

论文研究NEW

7/18 01:11

ToolSciVer: Multimodal Scientific Claim Verification with Visual Tool Augmented Reinforcement Learning

Multimodal Scientific Claim Verification (MSCV) requires models to verify scientific claims using visually grounded evidence from papers, including figures, tables, charts, and textual context. Howeve…

来源：arXiv

行业动态

7/17 18:45

2026最受投资人关注人工智能/具身智能企业50揭晓

人工智能正在进入一个新的产业周期。过去一年，大模型能力持续演进，生成式AI、多模态交互、智能体等技术方向快速推进；而具身智能也从早期的技术探索阶段，逐渐步入产业验证的深水区，机器人开始成为人工智能与现实世界的重要载体。市场率先给出了回应。据36氪研究院测算，中国具身智能市场规模已从2018年的2133亿元增长至2025年的9150亿元，2026年有望突破…

来源：36氪

行业动态

7/17 15:22

腾讯发布具身 VLM 基座模型 Hy-Embodied-VLM-1.0，A3B 规模整体性能接近上一代 A32B 模型

IT之家 7 月 17 日消息，腾讯 Robotics X 实验室、福田实验室联合腾讯混元打造的第二代具身 VLM 基座模型 Hy-Embodied-VLM-1.0 昨日正式发布。官方表示，在覆盖 37 个评测任务的具身能力评测体系中，Hy-Embodied-VLM-1.0 在物理状态理解、动作 — 变化推理、时序与自适应推理三大维度分别取得 68.6、6…

AI 点评 · 参数规模锐减十倍，性能逼近上一代大模型，具身智能走向高效实用化。

来源：IT之家

行业动态

7/17 10:39

36氪首发 | 港科大博士创业做机器人全身触觉系统，红杉、瓴智、智元共同押注

作者 | 乔钰杰编辑 | 袁斯来硬氪获悉，全身多模态融合触觉解决方案公司模感科技（MoSense）近日完成数千万元天使轮融资，投资方包括红杉中国、高瓴创投及智元机器人。本轮融资资金将主要用于加速研发、团队扩充、算力投入及量产测试体系建设。模感科技成立于2026年5月，总部注册于上海，在深圳前海设有研发中心，聚焦机器人全身多模态触觉感知系统研发。公司正式…

AI 点评 · 红杉、高瓴、智元联手押注，机器人触觉赛道技术壁垒高、应用前景广。

来源：36氪

论文研究NEW

7/17 04:00

S1-Omni: A Unified Multimodal Reasoning Model for Scientific Understanding, Prediction, and Generation

We present S1-Omni, a unified multimodal reasoning model for scientific understanding, prediction, and generation. AI for Science (AI4S) has advanced significantly through domain-specific models, tool…

来源：HuggingFace Papers

论文研究NEW

7/17 04:00

JoyNexus: Service-Oriented Multi-Tenant Post-Training for VLA Models

The post-training of Vision-Language-Action (VLA) models is essential due to the diversity of simulators, robot embodiments, and task objectives. Existing compute services, whether offered as direct a…

来源：HuggingFace Papers

论文研究

7/17 01:38

Beyond the Leaderboard: Design Lessons for Trustworthy Multimodal VQA

Healthcare multimodal AI must combine visual and textual evidence while remaining reliable and interpretable. Using MediaEval Medico 2025 as a retrospective GI endoscopy case study, we analyze design…

AI 点评 · 多模态医疗AI的可靠性设计突破，用胃肠镜案例揭示可解释性关键。

来源：arXiv

论文研究

7/17 01:37

TikStance: A Multimodal and Hierarchical Dataset for Multi-target Stance Analysis in TikTok Political Conversations

Political discourse has increasingly moved to short-video platforms, yet computational analysis of such content remains constrained by the scarcity of datasets that jointly preserve audiovisual inform…

AI 点评 · 多模态数据集填补短视频政治立场分析空白，推动社交媒体研究。

来源：arXiv

论文研究

7/16 04:00

VIABench: A Comprehensive Video Benchmark Collected from Blind Individuals for Visual Impairment Assistance

Visually impaired individuals (VIIs) encounter significant daily challenges due to limited access to visual information. Although Multimodal Large Language Models (MLLMs) have achieved impressive resu…

AI 点评 · 首个专为盲人设计的视频评测集，推动AI辅助视觉障碍技术落地。

来源：HuggingFace Papers

论文研究NEW

7/16 04:00

RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedural knowledge. Yet existing skill libraries are mostly hand-written, text-centric, or deriv…

来源：HuggingFace Papers

论文研究NEW

7/16 04:00

Xiaomi-Robotics-1: Scaling Vision-Language-Action Models with over 100K Hours of Real-World Trajectories

We present Xiaomi-Robotics-1, a foundational vision-language-action (VLA) model capable of (1) following diverse language instructions to perform a wide range of mobile manipulation tasks in unseen en…

来源：HuggingFace Papers

论文研究

7/16 00:13

Multimodal Empirical Bayes Variational Autoencoders for Joint Longitudinal and Time-to-Event Modeling

Longitudinal tumor measurements, dropout information, and genetic covariates provide complementary information about treatment response, but integrating these data sources within a single population m…

来源：arXiv

论文研究

7/15 01:24

FormalAnalyticGeo: A Neural-Symbolic Based Framework for Multimodal Analytic Geometry Problem Generation

Math reasoning has achieved significant progress with the rapid advancement of Multimodal Large Language Models (MLLMs), however analytic geometry remains largely underexplored, primarily due to the s…

来源：arXiv

论文研究

7/14 04:00

Self in Space: Benchmarking Self-Awareness and Spatial Cognition in UAV Embodied Intelligence

Autonomous UAV systems increasingly rely on multimodal large language models (MLLMs) to operate in complex real-world environments. Such embodied scenarios require not only understanding the surroundi…

来源：HuggingFace Papers

论文研究

7/14 04:00

Boogu-Image-0.1: Boosting Open-Source Unified Multimodal Understanding and Generation

We introduce Boogu-Image-0.1, an open-source unified multimodal understanding and generation model family, comprising Base, Turbo, Edit, and Edit-Turbo variants. It delivers competitive performance in…

来源：HuggingFace Papers

论文研究

7/14 04:00

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCyni…

来源：HuggingFace Papers

论文研究NEW

7/14 04:00

ReflectWorld-MM: An Entity-Oriented Multimodal Memory System for Open-Ended Video Streams

Building assistants that can continually watch the world, remember what they see, and reason over their accumulated experience is a long-standing goal, and recently multimodal agents equipped with lon…

来源：HuggingFace Papers

论文研究

7/14 01:27

LoRA-Based Cascaded Multimodal Fusion for Action Recognition in Medical Training Environments

This paper presents a cascaded Low-Rank Adaptation (LoRA)-based multimodal fusion framework for action and activity recognition in healthcare-oriented training environments. The proposed architecture…

来源：arXiv

行业动态

7/13 19:15

UniRL：统一多模态 RL 框架的 2.4X 端到端性能优化实践｜AICon深圳

AI 点评 · 统一多模态框架性能大幅提升，端到端优化思路值得关注。

来源：InfoQ

行业动态

7/13 16:34

字节探索自动驾驶，Seed世界模型团队负责｜36氪独家

36氪从多位产业人士处获悉，字节跳动正探索进入自动驾驶领域。这一项目目前由Seed旗下周畅的世界模型团队负责。据了解，Seed旗下不仅有周畅的多模态模型、世界模型等团队，还有大语言模型方向。而自动驾驶与世界模型的技术路线有交叠之处。另有消息人士告诉36氪，业务方向上，字节有意布局的自动驾驶场景有无人物流，这一业务隶属于字节旗下的火山引擎汽车行业线。部分…

来源：36氪

行业动态

7/13 10:39

对话Om AI赵天成：多年坚守，押注物理AI原生的「流式」未来

一个从未见过监控画面的多模态模型，却比在监控数据上练了多年的小模型“老将”更懂监控。这不是科幻电影，这是2023年Om AI联汇的一场“无心插柳”，也是CEO兼首席科学家赵天成博士更加坚信“多模态训练方式能为物理开放世界带来泛化性”的关键节点。彼时，AI行业正在追求以大语言模型为核心的生成式AI。三年后，这个多模态模型演变成了VLX——全球首个面向物理AI…

AI 点评 · 押注物理AI原生流式架构，多模态泛化性突破传统小模型局限。

来源：36氪

论文研究NEW

7/13 04:00

SVR-R1: Bootstrapping Multi-modal Reasoning with Self-verification in Reinforcement Learning

We introduce Self-Verified Reasoner (SVR-R1), a multi-turn RL framework that turns a model's own verification into a learning signal for multimodal reasoning. For each query, the model proposes an ans…

来源：HuggingFace Papers

论文研究NEW

7/13 04:00

See like a Robot: Robot-Centric Pointmaps for Vision-Language-Action Models

Vision-language-action (VLA) models predict robot actions from visual observations and language instructions. These actions are defined in the robot's own 3D coordinate frame, yet most VLAs observe th…

来源：HuggingFace Papers

产品发布/更新

7/12 23:12

Meta 发布多模态推理模型 Muse Spark 1.1，强化 AI 智能体任务能力

IT之家 7 月 12 日消息，Meta 于 7 月 9 日正式发布适用于 AI 智能体的多模态推理模型 Muse Spark 1.1 版本，重点提升了模型在智能体任务中的规划、协同与执行能力，并增强了工具调用、代码开发、应用操作能力。 Meta 表示，Muse Spark 1.1 强化了多智能体协作机制，由主智能体负责收集信息、制定计划，再将任务拆分并分配…

AI 点评 · 多智能体协作机制是AI落地的关键突破，Meta这次强化了任务拆解与分工能力。

来源：IT之家

论文研究

7/11 19:24

ABot-AgentOS: A General Robotic Agent OS with Lifelong Multi-modal Memory

Recent VLM and VLA systems have improved robotic perception and action prediction, yet long-horizon embodied agents still require a general runtime layer for reasoning, memory, tool use, verification,…

来源：HuggingFace Papers

论文研究

7/11 04:00

SynthDocBench: Controlled Benchmark for Long-Context Visual Document Understanding

Vision language models (VLMs) have achieved strong performance on visual document understanding benchmarks such as DocVQA, ChartQA, and MMLongBench-Doc. However, real-world documents combine multiple…

来源：HuggingFace Papers

论文研究

7/11 01:53

Evolution of Accuracy and Visual-Cognitive Errors in a Decade of Vision-Language AI Models

Vision language models (VLMs) have made remarkable progress in visual reasoning during the last decade. Most evaluations have used simple scenes (MS-COCO) that do not showcase complex human interactio…

AI 点评 · 十年视觉语言模型进化揭示：精度提升中，视觉认知错误揭示了人类与AI感知的深层差异。

来源：arXiv

论文研究

7/11 01:22

Task-Specific Multimodal Question Answering Agents via Confidence Calibration and Incremental Reasoning for QANTA 2026

We present our submission to the QANTA 2026 shared challenge at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Quanta evaluates multimodal quizbowl systems that answer pyr…

AI 点评 · 置信度校准与增量推理结合，为多模态问答提供高效新思路，技术方案值得关注。

来源：arXiv

论文研究

7/11 00:42

PAC-ACT: Post-training Actor-Critic for Action Chunking Transformers

Precision industrial contact manipulation requires reliable robot policies under pose perturbations and contact-force constraints. Vision-language-action models offer broad generalization but often in…

AI 点评 · 用强化学习微调动作块，提升机器人接触操作的鲁棒性和泛化能力。

来源：arXiv

行业动态

7/10 18:00

vLLM 针对多模态模型的推理优化实践｜AICon深圳

AI 点评 · 聚焦多模态推理效率，vLLM的实践为AI应用降本增效提供关键参考。

来源：InfoQ

论文研究

7/10 01:46

AUTOPILOT VQA: Benchmarking Vision-Language Models for Incident-Centric Dashcam Understanding

Recent advances in Vision-Language Models, Large Language Models, and Multimodal Large Language Models have improved autonomous driving tasks such as scene understanding, decision making, trajectory p…

来源：arXiv

论文研究

7/9 04:00

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

The rapid development of large language models and multimodal large language models has accelerated the emergence of proactive agents capable of operating everyday tools and assisting users in real-wo…

来源：HuggingFace Papers

论文研究

7/9 04:00

Blind-Spots-Bench: Evaluating Blind Spots in Multimodal Models

Modern AI models achieve strong performance on many established benchmarks, yet they still fail on tasks that humans find almost trivial, such as manipulating a string or drawing a dog with five legs.…

来源：HuggingFace Papers

论文研究

7/9 01:26

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Medicine is inherently multimodal, requiring clinicians to synthesize information across diverse data streams. Yet the development of multimodal foundation models is constrained by limited access to l…

来源：arXiv

论文研究

7/8 04:00

Dual Latent Memory in Vision-Language-Action Models for Robotic Manipulation

Mainstream Vision-Language-Action (VLA) models predict actions primarily from the current observation under a Markovian assumption, thus struggling with long-horizon, temporally dependent tasks. Exist…

来源：HuggingFace Papers

论文研究

7/8 04:00

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

来源：HuggingFace Papers

论文研究

7/8 01:38

The Large Cancer Assistant (LCA): A Model-Agnostic Orchestration Framework for Scalable Clinical Decision Support in Oncology

- Objective: Multimodal deep learning models in oncology are currently limited by monolithic designs that rigidly couple data ingestion, clinical routing, and artificial intelligence (AI) inference. T…

来源：arXiv

论文研究

7/8 01:27

Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment

Vision-language models (VLMs) struggle to generalize in interactive physical reasoning, particularly under unseen tasks and environments. Two key failure modes are prominent: hallucinated chain-of-tho…

来源：arXiv

产品发布/更新

7/7 16:38

caseclose/cma-harness

Cognitive-structured Multimodal Agent (CMA-Harness): a memory-centric agent for long-horizon multimodal understanding, generation, and editing — externalizing v…

来源：GitHub

产品发布/更新

7/7 13:03

让Skill“有图可依”：openJiuwen首发多模态Skill范式Skill-Omni

从文字说明书到多模态经验库

来源：量子位

行业动态

7/7 09:05

用AI“复刻”人类细胞、预判药效，「华源智因」获千万级人民币种子轮融资｜36氪首发

文｜胡香赟编辑｜海若镜 36氪获悉，AI虚拟细胞（AIVC）企业华源智因近期已完成千万级人民币种子轮融资。本轮融资由水木创投领投，募集资金将主要用于多模态测序底层技术迭代，进一步拓展与头部三甲医院的合作，以及团队扩充等。此外，华源智因团队已计划启动新一轮融资。华源智因创始团队由资深医药产业从业者、计算生物学研发人员组成，并邀请到深圳国家基因库等单位专家组…

来源：36氪

论文研究

7/7 04:00

Image2Sim: Scaling Embodied Navigation via Generative Neural Simulator

Embodied navigation aims to build agents that interpret multimodal goals, reason in 3D space, and reach target destinations reliably in the real world. However, progress remains constrained by the lac…

来源：HuggingFace Papers

论文研究

7/7 04:00

Vision as Unified Multimodal Generation

We formulate computer vision as unified multimodal generation, where heterogeneous visual tasks are expressed in the native text and image generation spaces of a unified multimodal model, without task…

来源：HuggingFace Papers

论文研究

7/7 04:00

SIEVE: Structure-Aware Data Selection for Imitation Learning with VLA Models

Vision-Language-Action (VLA) models are typically trained by imitation learning on large-scale robot demonstration datasets, but more data does not necessarily yield better policies due to redundancy,…

来源：HuggingFace Papers

论文研究

7/7 04:00

VaseMuseum: Digital Intelligent Museum for Ancient Greek Pottery

Vision-language models (VLMs) have made interactive digital museums increasingly feasible by connecting 3D digitization with natural-language artifact exploration. However, in cultural heritage domain…

来源：HuggingFace Papers

论文研究

7/7 01:59

From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model

Real-world robot deployment rarely maintains the training-stage camera setup, where cameras often experience repositioning or remounting depending on actual scenarios. Existing view-robust Vision-Lang…

来源：arXiv

论文研究

7/7 01:55

Cortex: A Bidirectionally Aligned Embodied Agent Framework for Long-horizon Manipulation

While recent Vision-Language-Action (VLA) models show promise toward generalist manipulation policies, they struggle with long-horizon tasks due to their Markovian nature-relying solely on current obs…

来源：arXiv

论文研究

7/6 04:00

Do All Visual Tokens Matter Equally? Object-Evidence Preserving Token Merging for Vision-Language Retrieval

Multi-vector vision-language retrieval preserves fine-grained visual evidence through maximum-similarity late interaction, but dense image-side tokens make storage and scoring expensive. Existing toke…

来源：HuggingFace Papers

论文研究

7/6 04:00

HunyuanOCR-1.5: Making Lightweight OCR VLMs Faster and Better

We present HunyuanOCR-1.5, a lightweight end-to-end OCR-specialized vision-language model. HunyuanOCR unifies document parsing, text spotting, information extraction, text-image translation, and multi…

来源：HuggingFace Papers

论文研究

7/6 04:00

Light-Omni: Reflex over Reasoning in Agentic Video Understanding with Long-Term Memory

Agentic video understanding equips models with long-term memory to autonomously process and respond to continuous, long-horizon multimodal streams. However, advanced video agents often rely on ``detec…

来源：HuggingFace Papers

产品发布/更新

7/5 13:38

zhiweio/EagleRAG

Search knowledge by what documents mean and how they look — not one or the other.

来源：GitHub

论文研究

7/5 04:00

AI Wizards at EXIST 2026: Hierarchical Soft-Label Learning for Multimodal Sexism Identification in Memes

We present the AI Wizards submission to EXIST 2026 for multimodal sexism identification in memes. The task is composed of three, increasingly harder subtasks. We model them hierarchically as condition…

来源：HuggingFace Papers

论文研究

7/5 04:00

UI-MOPD: Multi-Platform On-Policy Distillation for Continual GUI Agent Learning

Recent advances in multimodal foundation models and agent systems have driven GUI agents from single-platform task execution toward cross-platform interaction. However, building multi-platform GUI age…

来源：HuggingFace Papers

产品发布/更新

7/4 16:20

JT-Sun/UAVReason

🚁 Can Vision-Language Models Think from the Sky? UAVReason for Aerial Reasoning and Generation

来源：GitHub

论文研究

7/4 04:00

Look Before You Leap: Distilling Tree Search into Action Evaluation for Frozen VLA Models

Vision-Language-Action (VLA) models acquire broad embodied capabilities through large-scale pretraining, yet their generalization remains far more fragile than that of LLMs and VLMs. The prevailing re…

来源：HuggingFace Papers

论文研究

7/4 04:00

Attending to Multimodal Generation One Token at a Time

Multimodal large language models (MLLMs) generate responses autoregressively, integrating visual and linguistic information in an evolving context. Prior work on interpretability has focused on indivi…

来源：HuggingFace Papers

论文研究

7/3 04:00

MentalThink: Shaping Thoughts in Mental SVG World

We introduce MentalThink, a visual-symbolic reasoning paradigm that equips Multimodal LLMs (MLLMs) with an executable mechanism for "mental" visualization. The core of MentalThink is a think-with-SVG…

来源：HuggingFace Papers

论文研究

7/3 01:55

Towards Robustness against Typographic Attack with Training-free Concept Localization

Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP…

来源：arXiv

论文研究

7/3 01:53

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisio…

来源：arXiv

论文研究

7/2 04:00

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale.…

来源：HuggingFace Papers

论文研究

7/2 04:00

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on genera…

来源：HuggingFace Papers

论文研究

7/2 04:00

VLA-Corrector: Lightweight Detect-and-Correct Inference for Adaptive Action Horizon

Vision-Language-Action (VLA) foundation models have recently achieved strong progress in embodied intelligence. To reduce policy-call frequency while preserving temporal coherence, most generative pol…

来源：HuggingFace Papers

论文研究

7/2 04:00

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, an…

来源：HuggingFace Papers

论文研究

7/2 04:00

Rank-Then-Act: Reward-Free Control from Frame-Order Progress

We introduce Rank-Then-Act (RTA), a framework for learning control policies from expert video demonstrations without environment rewards. RTA trains a Vision-Language Model (VLM) offline as a progress…

来源：HuggingFace Papers

论文研究

7/2 04:00

Gemma 4 Technical Report

We introduce Gemma 4, a new generation of open-weight, natively multimodal language models in the Gemma model family. Designed to advance compute efficiency and reasoning, the Gemma 4 model suite feat…

来源：HuggingFace Papers

论文研究

7/2 01:51

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembl…

来源：arXiv

产品发布/更新

7/1 11:46

Om AI联汇发布VLX：全球首个面向物理世界的端侧流式多模态模型

物理世界AI的下一步

来源：量子位

模型发布/更新

7/1 07:29

谷歌发布全新AI创作工具，加速多模态内容生成

谷歌在近期举行的I/O开发者大会上宣布了一系列面向开发者的AI创作工具升级，旨在通过最新的Gemini模型家族，降低多媒体内容的生成门槛并提升效率。在视频和多模态创作领域，谷歌发布了全新的Gemini Omni模型。该模型能够理解并处理文本、图像、音频和视频输入，并生成连贯的视频内容。其最突出的特点是支持对话式编辑，用户只需用自然语言描述修改需求，如更换角色…

AI 点评 · 多模态对话式编辑降低视频创作门槛，自然语言交互革新内容生产流程。

来源：36氪

论文研究

7/1 04:00

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

Multimodal Large Language Models (MLLMs) are often constrained by a language-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alter…

来源：HuggingFace Papers

论文研究

7/1 04:00

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated…

来源：HuggingFace Papers

论文研究

7/1 04:00

Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts

Vision-Language-Action (VLA) models often fail to perform the same learned tasks under environmental shifts, such as changes in camera pose and shifts to a different but similar robot (e.g., from Pand…

来源：HuggingFace Papers

论文研究

7/1 04:00

MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering

As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety. While unimodal attributions have be…

来源：HuggingFace Papers

论文研究

7/1 04:00

Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLMs

Touch supplies the physical grounding needed to perceive intrinsic material properties, such as friction and compliance, that vision alone often cannot resolve. Recent efforts for equipping multimodal…

来源：HuggingFace Papers

论文研究

7/1 01:47

FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning

Multimodal graph foundation models aim to learn reusable knowledge from graphs enriched with text, images, attributes, and relational topology, thereby supporting diverse graph-centric and modality-ce…

来源：arXiv

论文研究

7/1 01:46

CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

Uncertainty estimation has been a long-standing challenge in AI models; it amounts to "knowing what you don't know," and metacognition is notoriously difficult even for humans (cf. the Dunning-Kruger…

来源：arXiv

行业动态

6/30 23:27

华为官宣全球首个商用多模态文旅大模型规模化应用

IT之家 6 月 30 日消息，华为中国宣布，2026 年 6 月 29 日，全球首个商用多模态文旅大模型 ——“博观文旅大模型”在西安规模应用。截至今年 3 月， “博观”支撑开发的 AI 伴游智能体已覆盖超 400 万用户。其打造的非遗数字 IP，衍生产品销售超 200 万。 IT之家查询获悉，陕文投与华为等于 2025 年 9 月联合开发的“博观文旅…

来源：IT之家

产品发布/更新

6/30 13:53

24小时直播，只靠一张照片？虎牙实时多模态数字人VAM 1.0率先突围行业三堵墙

能聊、能唱跳、能陪你玩游戏

来源：量子位

论文研究

6/30 04:00

Xiaomi-GUI-0 Technical Report

Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigat…

来源：HuggingFace Papers

论文研究

6/30 04:00

ASPIRE: Agentic /Skills Discovery for Robotics

Traditional robot programming is challenging: it requires orchestrating multimodal perception, managing physical contact dynamics, and handling diverse configurations and execution failures. We introd…

来源：HuggingFace Papers

论文研究

6/30 04:00

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

Recent multimodal large language models have shown great promise in clinical image reasoning, but existing post-training pipelines remain predominantly outcome-centric, relying on final answer correct…

来源：HuggingFace Papers

论文研究

6/30 04:00

Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs…

来源：HuggingFace Papers

论文研究

6/30 04:00

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

Hierarchical Vision-Language-Action (VLA) models decouple high-level planning from low-level control to improve generalization in robot manipulation. Recent work in this paradigm uses 2D end-effector…

来源：HuggingFace Papers

产品发布/更新

6/29 16:49

OceanBase发布AI数据库：以一套引擎融合湖库与多模态数据

让AI真正“读懂”企业

来源：量子位

论文研究

6/29 04:00

Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

The advancement of generative AI models capable of producing text and image marks a critical step forward in the realm of multimodal intelligence, particularly for tasks involving the interleaving of…

来源：HuggingFace Papers

论文研究

6/29 04:00

TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations…

来源：HuggingFace Papers

论文研究

6/29 04:00

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

The materials science literature encodes decades of experimental knowledge in figures, yet this visual record remains locked away and inaccessible to AI at scale. The core difficulty is structural: mo…

来源：HuggingFace Papers

论文研究

6/29 04:00

Orca: The World is in Your Mind

We introduce Orca, an initial instantiation of a general world foundation model. Orca learns a unified world latent space from multimodal world signals and exposes it through multimodal readout interf…

来源：HuggingFace Papers

论文研究

6/29 04:00

JD Oxygen AI Item Center (Oxygen AIIC) V1: An Industrial-Scale LLM/VLM-Centric Solution for Item Understanding, Management, and Applications

JD.com, one of the world's largest e-commerce platforms, serves over 700 million active users and millions of merchants, with a catalog of tens of billions of SKUs. At this scale, high-quality, struct…

来源：HuggingFace Papers

论文研究

6/28 04:00

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Video understanding is a fundamental capability for multimodal intelligence, and recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance on Video Question Answering (Video…

来源：HuggingFace Papers

论文研究

6/28 04:00

Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation

Vision-language dataset distillation (VLDD) compresses a large image-text paired dataset into a small set of synthetic pairs that can efficiently train contrastive vision-language models under strict…

来源：HuggingFace Papers

产品发布/更新

6/27 20:19

CVPR 2026最热方向，被一家杭州团队率先跑进了端侧！

VLM- R1之后再次出手！全球首个端侧流式多模态来了！

AI 点评 · 端侧跑通多模态大模型，将AI应用成本与门槛大幅降低，是产业落地的关键突破。

来源：量子位

产品发布/更新

6/27 14:12

odebo/CC-Vision

Give non-multimodal Claude Code main models the ability to see pasted screenshots — a ~200-line UserPromptSubmit hook.

来源：GitHub

行业动态

6/27 07:30

追赶FSD V14，理想在补哪些课？｜最前线

过去几年，智能驾驶行业的竞争重心经历了几次明显变化。最早比的是硬件：激光雷达要不要上、摄像头装几个、算力做到多少 TOPS；随后进入大模型时代，竞争开始转向端到端、VLA（Vision-Language-Action）、World Model（世界模型）等路线。到了今天，越来越多公司发现，仅仅拥有更大的模型已经不足以形成代际优势，真正决定上限的，开始变成…

AI 点评 · 解析理想追赶特斯拉FSD V14的技术短板，揭示智驾竞争从模型规模转向系统整合的新趋势。

来源：36氪

论文研究

6/27 01:16

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work…

来源：arXiv

产品发布/更新

6/26 20:23

FudanCVL/Unison

[ICML 2026] Unison: Benchmarking Unified Multimodal Models via Synergistic Understanding and Generation

来源：GitHub

论文研究

6/26 04:00

ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-sel…

来源：HuggingFace Papers

论文研究

6/26 04:00

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

Vision-Language-Action (VLA) models enable instruction-driven robotic manipulation, but they inherit oversized language backbones from pretrained VLMs whose capacity far exceeds what is needed for sho…

来源：HuggingFace Papers

论文研究

6/26 04:00

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generali…

来源：HuggingFace Papers

论文研究

6/26 04:00

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames?…

来源：HuggingFace Papers

论文研究

6/26 04:00

PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic mat…

来源：HuggingFace Papers

论文研究

6/26 04:00

DataComp-VLM: Improved Open Datasets for Vision-Language Models

Building performant Vision-Language Models (VLMs) requires carefully curating large-scale training datasets, yet the community lacks systematic benchmarks for evaluating such curation strategies. We i…

来源：HuggingFace Papers

论文研究

6/26 01:44

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLL…

来源：arXiv

产品发布/更新

6/25 15:42

RoboScience机器科学发布Visics通用具身大模型，实现跨本体、跨物体、跨任务｜最前线

作者｜黄楠编辑｜袁斯来 6月24日，通用具身智能企业RoboScience机器科学通用具身大模型发布，首次完整披露自研Visics大模型的技术架构VLOA（Vision-Language-Object-Action），并展示了模型在家具拼装、灵巧抓取、动态流水线等多项真实场景的应用。大语言模型有标准的文本Token，自动驾驶有统一的视觉或点云表征，这些基…

来源：36氪

论文研究

6/25 04:00

In-Context World Modeling for Robotic Control

Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current obs…

来源：HuggingFace Papers

论文研究

6/25 04:00

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same w…

来源：HuggingFace Papers

论文研究

6/25 01:59

Learning Action Priors for Cross-embodiment Robot Manipulation

Most Vision-Language-Action (VLA) models build on a Vision-Language Model (VLM) backbone by attaching an action module and optimizing the full policy jointly. This design inherits strong visual and li…

来源：arXiv

论文研究

6/25 01:53

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability proper…

来源：arXiv

论文研究

6/25 01:15

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation rem…

来源：arXiv

论文研究

6/25 01:00

TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

Multimodal Large Language Models (MLLMs) demonstrate strong performance on standard visual question answering benchmarks, yet their scalability under controlled structural complexity remains poorly un…

来源：arXiv

论文研究

6/24 04:00

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods ty…

来源：HuggingFace Papers

产品发布/更新

6/23 15:11

vancyland/DataClaw0

DataClaw: Agentic Tailoring Multimodal Data from Raw Streams — coming soon (code, weights, dataset & DataClaw-val upon acceptance).

来源：GitHub

产品发布/更新

6/23 15:11

vancyland/DataClaw0

DataClaw: Agentic Tailoring Multimodal Data from Raw Streams — coming soon (code, weights, dataset & DataClaw-val upon acceptance).

来源：GitHub

论文研究

6/23 04:00

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unloc…

来源：HuggingFace Papers

论文研究

6/23 04:00

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

Multimodal driving planning faces a long-standing tension between two paradigms: scoring-based methods benefit from dense reward supervision but are confined to a fixed action vocabulary, while anchor…

来源：HuggingFace Papers

论文研究

6/23 04:00

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

Multimodal misinformation detection is increasingly important because viral posts now combine long multilingual narratives, several images, mixed provenance, and subtle text--image framing errors. Exi…

来源：HuggingFace Papers

论文研究

6/23 01:58

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature…

来源：arXiv

论文研究

6/23 01:31

TailorMind: Towards Preference-Aligned Multimodal Content Generation

Personalized content systems depend on available UGC and struggle when suitable content is absent, delayed, or costly to create. Although multimodal generators can synthesize content on demand, how to…

来源：arXiv

技巧与观点

6/23 00:32

Embed the world: Multimodal AI for searchable aerial imagery at scale

In this post, we walk through the problem space, our architecture on Amazon Bedrock and Amazon OpenSearch Serverless, the evaluation methodology we built on OpenStreetMap ground tr…

AI 点评 · 多模态AI将航拍图像转化为可搜索数据，实现地理空间信息的规模化智能检索。

来源：AWS ML

行业动态

6/22 18:00

复旦大学教授邱锡鹏确认出席AICon上海站，分享MOSS 多模态模型的创新与实践

来源：InfoQ

论文研究

6/22 04:00

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing d…

来源：HuggingFace Papers

论文研究

6/22 04:00

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training requi…

来源：HuggingFace Papers

论文研究

6/22 04:00

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal…

来源：HuggingFace Papers

论文研究

6/22 04:00

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision enco…

来源：HuggingFace Papers

论文研究

6/21 04:00

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predomina…

来源：HuggingFace Papers

论文研究

6/21 04:00

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks r…

来源：HuggingFace Papers

论文研究

6/20 04:00

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architectu…

来源：HuggingFace Papers

产品发布/更新

6/19 20:22

Goldezstevie/amd-multimodal-lab

Vision-Language Models on AMD GPUs — LLaVA, MiniGPT-4, Idefics on ROCm 🚀

来源：GitHub

论文研究

6/19 04:00

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, h…

来源：HuggingFace Papers

论文研究

6/19 01:39

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly under…

来源：arXiv

论文研究

6/19 01:38

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largel…

来源：arXiv

论文研究

6/18 04:00

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

来源：HuggingFace Papers

论文研究

6/18 04:00

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over ti…

来源：HuggingFace Papers

论文研究

6/18 01:20

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain a…

来源：arXiv

产品发布/更新

6/17 13:11

kaistmm/SeeandSniff

[ECCV 2026] Official Pytorch implementation for See & Sniff: Learning Visuo-Olfactory Representations

来源：GitHub

模型发布/更新

6/17 07:13

谷歌推送 Android 17 正式版，深度集成 AI 功能

IT之家 6 月 17 日消息，谷歌于当地时间周二正式推送了 Android 17 正式版，同时发布了智能手表操作系统 Wear OS 7。本次新版系统将率先搭载于谷歌自家 Pixel 系列设备，同步上线 Pixel 专属功能更新包，新增多项 AI 相关功能，包括对最新人工智能模型的支持，如音乐生成模型 Lyria 3、多模态大模型 Gemini Omni，…

AI 点评 · AI深度融入系统底层，Android 17标志移动平台正式进入AI原生时代，看点在于其生态影响力。

来源：IT之家

模型发布/更新

6/17 06:30

Hands Free, AIs Forward: NVIDIA XR AI Brings Agents to AR Glasses

NVIDIA XR AI is now available in public beta, giving developers a framework for building multimodal AI agents for AR glasses and XR devices.

AI 点评 · 英伟达XR AI公测，为AR眼镜开发多模态AI代理，推动无手交互新范式。

来源：NVIDIA

论文研究

6/17 04:00

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning,…

来源：HuggingFace Papers

论文研究

6/17 04:00

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the ful…

来源：HuggingFace Papers

论文研究

6/17 04:00

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency…

来源：HuggingFace Papers

论文研究

6/17 04:00

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether t…

来源：HuggingFace Papers

论文研究

6/17 04:00

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding exe…

来源：HuggingFace Papers

论文研究

6/17 04:00

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

来源：HuggingFace Papers

行业动态

6/16 17:44

Gemma 4 12B 通过无编码器架构实现设备端多模态主动工作流

来源：InfoQ

行业动态

6/16 06:59

招商银行推出“运通工程师信用卡”，新用户办卡提供“专属 AI 权益”单月可享 18 亿 Token M3 用量

IT之家 6 月 16 日消息，招商银行宣布推出一款“运通工程师信用卡”，强调相应信用卡拥有“专属 AI 权益”。 IT之家参考官方介绍获悉，新用户办卡首次参与活动达标后，至高可享每月 18 亿 Token M3 用量，可直接用于文档、图像、音视频等多模态模型调用，以及 MaxClaw 龙虾部署等 AI 高频场景。具体来看，相应信用卡“AI 权益体系”提供三…

AI 点评 · 银行信用卡与AI算力打包，精准切中工程师群体的高频需求，开创了金融+AI的跨界权益新玩法。

来源：IT之家

论文研究

6/16 04:00

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits t…

来源：HuggingFace Papers

论文研究

6/16 04:00

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreove…

来源：HuggingFace Papers

论文研究

6/16 04:00

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-…

来源：HuggingFace Papers

论文研究

6/16 04:00

Guava: An Effective and Universal Harness for Embodied Manipulation

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-t…

来源：HuggingFace Papers

论文研究

6/16 04:00

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for con…

来源：HuggingFace Papers

论文研究

6/16 04:00

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings,…

来源：HuggingFace Papers

论文研究

6/16 01:59

Context-Aware RL for Agentic and Multimodal LLMs

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle d…

来源：arXiv

论文研究

6/16 01:49

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data unde…

来源：arXiv

论文研究

6/16 01:45

ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to…

来源：arXiv

产品发布/更新

6/15 17:15

volcengine/ark-cli

The fastest way to put Volcengine Ark in your terminal and your AI agent — go from prompt to generated media, multimodal answer, or deployed endpoint in a sin…

来源：GitHub

论文研究

6/15 04:00

Geometric Action Model for Robot Policy Learning

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and…

来源：HuggingFace Papers

论文研究

6/15 04:00

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing de…

来源：HuggingFace Papers

论文研究

6/15 04:00

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing U…

来源：HuggingFace Papers

论文研究

6/15 04:00

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale ego…

来源：HuggingFace Papers

论文研究

6/15 04:00

Context-Aware RL for Agentic and Multimodal LLMs

来源：HuggingFace Papers

论文研究

6/15 04:00

Thinking with Visual Grounding

Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the support…

来源：HuggingFace Papers

论文研究

6/15 04:00

How Post-Training Shapes Biological Reasoning Models

Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-trai…

来源：HuggingFace Papers

产品发布/更新

6/14 21:26

Egoist-Machines/LodeDB

World's fastest and most compact embedded vector database: exact by default, multimodal, local-first, and GPU-accelerated

来源：GitHub

论文研究

6/14 19:45

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model beh…

来源：HuggingFace Papers

论文研究

6/14 16:06

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Actio…

来源：HuggingFace Papers

论文研究

6/14 04:00

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection an…

来源：HuggingFace Papers

行业动态

6/13 17:00

分析的未来是多模态的，一切都关乎 Vibe ｜技术趋势

来源：InfoQ

产品发布/更新

6/13 14:50

ratschlab/DeepSpotM

Multimodal foundation model predicting transcriptome-wide virtual spatial transcriptomics from histology.

来源：GitHub

论文研究

6/13 04:00

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenari…

来源：HuggingFace Papers

论文研究

6/13 04:00

MotionVLA: Vision-Language-Action Model for Humanoid Motion

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a s…

来源：HuggingFace Papers

论文研究

6/13 04:00

Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals.…

来源：HuggingFace Papers

论文研究

6/13 01:59

Gaze Heads: How VLMs Look at What They Describe

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its…

来源：arXiv

论文研究

6/13 01:54

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods…

来源：arXiv

论文研究

6/12 04:00

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emer…

来源：HuggingFace Papers

论文研究

6/12 04:00

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and super…

来源：HuggingFace Papers

论文研究

6/12 04:00

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but…

来源：HuggingFace Papers

模型发布/更新

6/11 04:10

小米 MiMo Code V0.1.0 探索性 AI 编程助手发布并开源：基于 OpenCode 二次开发，采用 MIT 协议

IT之家 6 月 11 日消息，小米 MiMo 官方今日凌晨正式发布并开源 MiMo Code V0.1.0 —— 一款运行在终端里的探索性 AI 编程助手。据介绍， MiMo Code 基于开源项目 OpenCode 二次开发，发布并开源，采用 MIT 协议。它还内置限时免费多模态模型 MiMo-V2.5，同时支持接入 DeepSeek、Kimi 和…

AI 点评 · 小米开源AI编程助手，降低开发者门槛，推动生态共建，MIT协议利于广泛商用。

来源：IT之家

论文研究

6/11 04:00

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in…

来源：HuggingFace Papers

论文研究

6/11 04:00

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM t…

来源：HuggingFace Papers

论文研究

6/11 04:00

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hy…

来源：HuggingFace Papers

论文研究

6/11 04:00

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attemp…

来源：HuggingFace Papers

论文研究

6/11 04:00

Self-Evolving Visual Questioner

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existin…

来源：HuggingFace Papers

论文研究

6/11 01:58

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe th…

来源：arXiv

论文研究

6/11 01:31

Latent World Recovery for Multimodal Learning with Missing Modalities

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need…

来源：arXiv

论文研究

6/10 04:00

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token re…

来源：HuggingFace Papers

论文研究

6/10 04:00

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long…

来源：HuggingFace Papers

论文研究

6/10 04:00

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on stat…

来源：HuggingFace Papers

论文研究

6/10 04:00

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Exi…

来源：HuggingFace Papers

论文研究

6/10 04:00

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-dist…

来源：HuggingFace Papers

论文研究

6/10 04:00

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Y…

来源：HuggingFace Papers

论文研究

6/10 01:59

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each…

来源：arXiv

论文研究

6/10 01:51

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for con…

来源：arXiv

模型发布/更新

6/9 22:10

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

来源：Google DeepMind

产品发布/更新

6/9 19:20

wanshuiyin/ARIS-Movie-Director

Agentic, long-horizon visual generation: a fuzzy story → a cross-model-audited image-based movie. Brings ARIS's research-wiki + multi-agent debate to multimodal…

来源：GitHub

产品发布/更新

6/9 15:08

kenchan0226/multimodal-docs-public

[EMNLP 2025] M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

来源：GitHub

论文研究

6/9 12:36

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms rep…

来源：HuggingFace Papers

论文研究

6/9 04:00

Kwai Keye-VL-2.0 Technical Report

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challen…

来源：HuggingFace Papers

论文研究

6/9 04:00

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on thre…

来源：HuggingFace Papers

论文研究

6/9 04:00

P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world know…

来源：HuggingFace Papers

论文研究

6/9 04:00

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

来源：HuggingFace Papers

论文研究

6/9 01:31

Discovering Functionally Selective Brain Regions with a Deep Topographic Multimodal Model

Nearby neurons in cortex share similar response profiles, producing systematic spatial organization across sensory and cognitive systems. Recent topographic models reproduce aspects of this structure…

AI 点评 · 利用深度多模态模型揭示大脑功能分区，为理解脑认知机制和AI神经架构提供新视角。

来源：arXiv

论文研究

6/8 18:58

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multim…

来源：HuggingFace Papers

论文研究

6/8 04:00

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair,…

来源：HuggingFace Papers

论文研究

6/8 04:00

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passiv…

来源：HuggingFace Papers

论文研究

6/8 04:00

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language t…

来源：HuggingFace Papers

论文研究

6/8 04:00

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocent…

来源：HuggingFace Papers

产品发布/更新

6/7 06:06

GaoxiangLuo/MM-FM

[CVPR 2026] Flow Matching for Multimodal Distributions

来源：GitHub

论文研究

6/6 04:00

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing…

来源：HuggingFace Papers

论文研究

6/6 04:00

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optim…

来源：HuggingFace Papers

论文研究

6/6 01:59

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduc…

AI 点评 · 分层记忆架构破解长视频理解瓶颈，用图记忆与智能检索分离感知推理，显著降低计算成本。

来源：arXiv

行业动态

6/5 18:53

智源&清华合作成果登上Science：脑科学多模态基础模型Brainμ支撑揭示“记忆-睡眠”调控的神经机制

研究表明，睡眠中的记忆重激活参与调控睡眠动态，为理解“记忆-睡眠”双向作用机制提供了新的实验证据。

来源：量子位

论文研究

6/5 04:00

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present…

来源：HuggingFace Papers

论文研究

6/5 04:00

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These sce…

来源：HuggingFace Papers

论文研究

6/5 04:00

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. W…

来源：HuggingFace Papers

论文研究

6/5 04:00

Robotic Policy Adaptation via Weight-Space Meta-Learning

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. How…

来源：HuggingFace Papers

论文研究

6/5 04:00

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

来源：HuggingFace Papers

论文研究

6/5 04:00

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based in…

来源：HuggingFace Papers

技巧与观点

6/5 02:57

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

AI 点评 · 为企业提供可定制的多模态安全方案，填补全球AI内容合规缺口。

来源：HuggingFace Blog

论文研究

6/5 01:59

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VL…

来源：arXiv

行业动态

6/4 11:06

戴盟机器人完成亿元融资，阿里通义多模态大牛加盟攻关物理世界模型

甩开视觉内卷

来源：量子位

论文研究

6/4 04:00

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across…

来源：HuggingFace Papers

论文研究

6/4 04:00

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch…

来源：HuggingFace Papers

论文研究

6/4 04:00

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly re…

来源：HuggingFace Papers

论文研究

6/4 04:00

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thou…

来源：HuggingFace Papers

论文研究

6/4 04:00

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to…

来源：HuggingFace Papers

论文研究

6/4 04:00

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimod…

来源：HuggingFace Papers

论文研究

6/4 04:00

Robots Need More than VLA and World Models

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In th…

来源：HuggingFace Papers

论文研究

6/4 04:00

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visua…

来源：HuggingFace Papers

论文研究

6/4 04:00

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than…

来源：HuggingFace Papers

论文研究

6/4 04:00

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across…

来源：HuggingFace Papers

行业动态

6/4 00:04

Gemma 4 12B: A unified, encoder-free multimodal model

来源：Hacker News

产品发布/更新

6/3 22:37

Able-rip/cc-VisionRouter

Transparent proxy for Claude Code that auto-routes image-bearing requests to a multimodal model — so a non-multimodal primary model never crashes your long-runn…

来源：GitHub

行业动态

6/3 12:00

MIT researchers teach AI models to interpret charts

The new ChartNet training dataset could improve the accuracy of vision-language models that help analyze business trends or interpret scientific figures.

来源：MIT News

产品发布/更新

6/3 11:29

0xzall/multimodal-doc-qa

Multimodal document QA: vision + retrieval over PDFs (LLaVA + LlamaIndex)

来源：GitHub

论文研究

6/3 04:00

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

Learning representations of CAD models is a largely open problem. While 3D representation learning has flourished around point clouds and meshes, the native format of CAD - boundary representations BR…

来源：HuggingFace Papers

论文研究

6/3 04:00

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method…

来源：HuggingFace Papers

论文研究

6/3 04:00

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inf…

来源：HuggingFace Papers

论文研究

6/3 01:59

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

来源：arXiv

论文研究

6/3 01:42

VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce…

来源：arXiv

论文研究

6/2 21:25

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it rem…

来源：arXiv

论文研究

6/2 21:09

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tu…

来源：arXiv

论文研究

6/2 21:07

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual r…

来源：arXiv

行业动态

6/2 18:22

字节Seed架构调整：周畅管理范围扩大，具身业务纳入核心

字节跳动多模态负责人周畅管理范围再次扩大，原由李航负责的SeedRobotics团队已向周畅汇报月余，李航现以顾问身份负责学术合作方向。字节也正在招聘具身智能技术负责人，负责机器人业务整体规划，职级定位为L8，对标阿里P10-P11，将向周畅汇报。该岗位候选人主要来自头部具身智能创业公司技术负责人。（晚点 LatePost）

AI 点评 · 架构调整显示字节加速整合资源，具身智能成战略核心，技术负责人招聘透露行业人才争夺升级。

来源：36氪

产品发布/更新

6/2 11:15

Qwen3.7-Plus上线！多模态智能体新基座，一键复刻桌面端专业软件

Qwen3.7-Plus已上线阿里云百炼

AI 点评 · 通杀多模态与桌面软件，AI智能体能力再上台阶，开发者生态迎来新变量。

来源：量子位

产品发布/更新

6/2 06:38

阿里发布 Qwen3.7-Plus 模型，升级多模态交互混合 AI 智能体

IT之家 6 月 2 日消息，阿里千问大模型今天（6 月 2 日）发布博文，宣布推出 Qwen3.7-Plus 模型，定位为多模态交互混合智能体。 Qwen3.7-Plus 是 Qwen3.7 的多模态升级版，核心定位是视觉与语言统一的智能体基座。它保留文本、编码、工具使用和生产力工作流能力，同时强化视觉理解、视觉推理和跨模态任务处理。模型已通过阿里云…

AI 点评 · 多模态与智能体融合，或加速AI从“对话”迈向“行动”的关键一步。

来源：IT之家

论文研究

6/2 04:00

Benchmarking Visual State Tracking in Multimodal Video Understanding

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to vi…

来源：HuggingFace Papers

论文研究

6/2 04:00

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks ei…

来源：HuggingFace Papers

论文研究

6/2 04:00

MAOAM: Unified Object and Material Selection with Vision-Language Models

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interacti…

来源：HuggingFace Papers

论文研究

6/2 04:00

GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods

With the increasing development of Vision-Language Models, it becomes imperative that their predictions are readily explainable to relevant stakeholders. However, the field of explainability has not k…

来源：HuggingFace Papers

论文研究

6/2 01:59

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts…

AI 点评 · 用感知扰动与奖励建模，巧妙解决多模态大模型评判时的视觉偏见问题。

来源：arXiv

论文研究

6/2 01:59

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making…

AI 点评 · 用原型引导自适应扩展与几何整合，攻克多模态持续学习中灾难性遗忘难题。

来源：arXiv

论文研究

6/2 01:56

AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame a…

AI 点评 · 提出视频时序冗余新视角，用预测编码压缩帧，有望大幅降低视频多模态模型算力成本。

来源：arXiv

行业动态

6/2 01:55

Qwen3.7-Plus: Multimodal Agent Intelligence

AI 点评 · Qwen3.7-Plus融合多模态与智能体能力，或开启AI应用新范式。

来源：Hacker News

论文研究

6/2 01:32

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains undere…

AI 点评 · 聚焦视频大模型对瞬间视觉事件的判断精度，填补了时序保真度评估空白。

来源：arXiv

论文研究

6/1 13:50

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes hu…

来源：HuggingFace Papers

模型发布/更新

6/1 11:36

MiniMax M3 正式发布：前沿 Coding 能力、1M 上下文、原生多模态

MiniMax M3 今日正式发布。 MiniMax M3 在编程和智能体等专业任务上达到了前沿的能力。它使用了全新注意力架构 MSA （MiniMax Sparse Attention），最高支持 1M 超长上下文。它也是一个原生多模态模型，支持图片和视频的输入，并能操作电脑桌面。在衡量 Coding 能力的 SWE-Bench Pro 上，MiniMa…

来源：开源中国

论文研究

6/1 04:00

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigat…

AI 点评 · 视觉语言模型实现逆向图形从推理到可编辑3D场景，突破传统重建限制。

来源：HuggingFace Papers

论文研究

6/1 04:00

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuni…

AI 点评 · 评估VLA模型语义理解能力的关键基准，揭示机器人动作预测的深层缺陷。

来源：HuggingFace Papers

论文研究

6/1 04:00

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy sy…

来源：HuggingFace Papers

论文研究

6/1 04:00

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map completeness, or geographic diversity. We present KITScenes Multimodal, a European dataset buil…

来源：HuggingFace Papers

论文研究

6/1 04:00

AdaCodec: A Predictive Visual Code for Video MLLMs

来源：HuggingFace Papers

论文研究

5/31 04:00

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterpa…

AI 点评 · 日本首个政府白皮书图表问答基准，填补非英语视觉语言模型评估空白，推动多语言文档理解发展。

来源：HuggingFace Papers

论文研究

5/30 04:00

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existi…

AI 点评 · 评测视觉语言模型在物理场景下的抗压能力，填补机器人感知鲁棒性空白。

来源：HuggingFace Papers

论文研究

5/30 04:00

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowle…

来源：HuggingFace Papers

论文研究

5/30 01:48

Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings

Transformer-based architectures have advanced sequence modeling in language and vision, yet general-purpose representation learning for heterogeneous multivariate time series remains underexplored. We…

AI 点评 · 多模态联合嵌入让传感器数据“开口说话”，突破时间序列通用表征瓶颈。

来源：arXiv

论文研究

5/30 01:20

Vision-Language Models Suppress Female Representations Under Ambiguous Input

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in f…

AI 点评 · 揭示视觉语言模型在模糊情境下仍会抑制女性表征，暴露了AI公平性研究的深层盲区。

来源：arXiv

模型发布/更新

5/29 19:18

把GPT-4o拉下神坛！星源智联合北大推出RoboAgent，让3B VLM在未知场景跑出94%成功率

AI 点评 · 3B小模型超越GPT-4o，低成本高泛化能力开辟具身智能新路径。

来源：InfoQ

模型发布/更新

5/29 19:18

把GPT-4o拉下神坛！星源智联合北大推出RoboAgent，让3B VLM在未知场景跑出94%成功率

AI 点评 · 小模型大突破，3B参数跑赢未知场景，低成本推理潜力巨大。

来源：InfoQ

产品发布/更新

5/29 15:25

StarTrail-org/PixelRAG

The end of web parsing. The beginning of scalable pixel-native search.

AI 点评 · 像素级搜索技术突破，终结传统网页解析，开启视觉原生检索新范式。

来源：GitHub

产品发布/更新

5/29 15:25

StarTrail-org/PixelRAG

The end of web parsing. The beginning of scalable pixel-native search.

AI 点评 · 将网页解析转向像素级原生搜索，为多模态检索开辟全新路径。

来源：GitHub

论文研究

5/29 04:00

Task-Focused Memorization for Multimodal Agents

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory…

AI 点评 · 聚焦多模态智能体的长期记忆构建，突破传统记忆局限，实现持续学习与知识积累。

来源：HuggingFace Papers

论文研究

5/29 04:00

Linear Scaling Video VLMs for Long Video Understanding

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to gr…

AI 点评 · 突破视频模型计算瓶颈，实现线性缩放，为长视频实时理解铺平道路。

来源：HuggingFace Papers

论文研究

5/29 04:00

Representation Forcing for Bottleneck-Free Unified Multimodal Models

Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structu…

AI 点评 · 打破多模态模型依赖预训练VAE的瓶颈，实现真正统一感知与生成，是迈向高效AI的关键一步。

来源：HuggingFace Papers

论文研究

5/29 04:00

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliabil…

AI 点评 · 提出轨迹捉迷藏方法，主动发现VLA模型运行时的隐藏故障信号，提升机器人可靠性。

来源：HuggingFace Papers

论文研究

5/29 04:00

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference pha…

AI 点评 · 用强化学习让AI内化视觉推理能力，突破多模态模型推理阶段的瓶颈。

来源：HuggingFace Papers

论文研究

5/29 04:00

How can embedding models bind concepts?

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recogn…

来源：HuggingFace Papers

论文研究

5/29 04:00

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains un…

AI 点评 · 首个用《我的世界》评估多模态大模型开放世界探索能力的基准，填补了该领域测试空白。

来源：HuggingFace Papers

论文研究

5/29 04:00

PaintBench: Deterministic Evaluation of Precise Visual Editing

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dy…

来源：HuggingFace Papers

论文研究

5/29 04:00

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent…

来源：HuggingFace Papers

论文研究

5/29 04:00

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, w…

来源：HuggingFace Papers

论文研究

5/29 01:59

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal…

AI 点评 · 用轻量模型实现可信视觉语言推理，突破时序异常检测效率瓶颈。

来源：arXiv

产品发布/更新

5/28 09:46

modelstudioai/cli

Official Model Studio CLI（阿里云百炼 CLI）built for AI Agent frameworks, exposing models, search, multimodal, and workflow capabilities as structured tool calls.

来源：GitHub

论文研究

5/28 04:00

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. H…

AI 点评 · 多智能体协作生成可验证长报告，突破深度研究可信度瓶颈，推动AI从搜索迈向论证。

来源：HuggingFace Papers

论文研究

5/28 04:00

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the r…

AI 点评 · 评估多模态智能体的动态记忆，推动从简单回忆到世界建模的跃迁。

来源：HuggingFace Papers

论文研究

5/28 04:00

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally…

AI 点评 · 提出局部模态替换策略，突破视觉语言融合瓶颈，显著提升多模态理解深度。

来源：HuggingFace Papers

论文研究

5/28 04:00

GenClaw: Code-Driven Agentic Image Generation

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at t…

AI 点评 · 代码驱动生成图像，打通语言与视觉鸿沟，开辟智能代理新范式。

来源：HuggingFace Papers

论文研究

5/28 04:00

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific…

AI 点评 · 突破性将3D空间先验注入视觉语言模型，显著提升几何推理能力，为AI理解三维世界开辟新路径。

来源：HuggingFace Papers

论文研究

5/28 04:00

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and l…

AI 点评 · 轻量级GUI智能体通过知识图谱实现高效行为探索，突破大模型依赖瓶颈。

来源：HuggingFace Papers

论文研究

5/28 04:00

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks,…

AI 点评 · 统一视觉语言与动作建模，突破单一任务限制，推动机器人跨场景泛化能力跃升。

来源：HuggingFace Papers

论文研究

5/28 04:00

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts…

AI 点评 · 揭示多模态模型的空间认知本质，挑战AI视觉推理的深层局限。

来源：HuggingFace Papers

论文研究

5/28 04:00

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

AI 点评 · 轻量模型实现时间序列异常检测，突破大模型效率瓶颈，值得关注。

来源：HuggingFace Papers

论文研究

5/28 04:00

VLM3: Vision Language Models Are Native 3D Learners

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still l…

AI 点评 · 打破视觉语言模型对3D理解的局限，开启原生3D学习新范式。

来源：HuggingFace Papers

论文研究

5/28 04:00

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

AI 点评 · 研究揭示视觉语言模型空间推理的盲点，质疑其是否真正具备三维理解能力，对AI可靠性提出关键挑战。

来源：HuggingFace Papers

论文研究

5/28 04:00

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

AI 点评 · 轻量模型攻克时序异常检测，突破大模型效率瓶颈，实用价值显著。

来源：HuggingFace Papers

论文研究

5/28 04:00

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world:…

AI 点评 · 指出VLMs在空间问答中的盲区与自信误判，揭示视觉语言模型认知边界的关键缺陷。

来源：HuggingFace Papers

论文研究

5/28 04:00

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly te…

来源：HuggingFace Papers

论文研究

5/28 04:00

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by train…

AI 点评 · 高效压缩视觉令牌，缓解推理计算瓶颈，加速视觉语言模型落地。

来源：HuggingFace Papers

论文研究

5/28 04:00

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-lan…

来源：HuggingFace Papers

论文研究

5/28 04:00

Multimodal Music Recommendation System using LLMs

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented,…

来源：HuggingFace Papers

论文研究

5/27 04:00

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world know…

AI 点评 · 因果内化与密度采样策略，突破GUI智能体真实任务瓶颈，值得关注。

来源：HuggingFace Papers

论文研究

5/27 04:00

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely us…

AI 点评 · 对比视觉语言与视频生成模型，揭示哪种预训练范式更利于空间智能发展。

来源：HuggingFace Papers

论文研究

5/26 04:00

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to a…

AI 点评 · 统一多模态模型利用视觉思维进行空间推理，突破语言局限，提升跨视角几何推理能力。

来源：HuggingFace Papers

论文研究

5/26 04:00

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raise…

AI 点评 · 通过在线技能蒸馏，大幅提升多模态AI代理效率，减少推理计算成本，极具实用创新价值。

来源：HuggingFace Papers

论文研究

5/26 04:00

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

来源：HuggingFace Papers

产品发布/更新

5/24 01:13

zdshang/JoyCapture-UR5

Gamepad-Guided Multimodal Demonstration Capture for UR5 Manipulation

来源：GitHub

论文研究

5/23 04:00

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically consequential actions. Robotics foundation models, vision-languag…

AI 点评 · 聚焦物理AI安全盲区，系统梳理运行时动作授权机制，为自主系统风险防控提供关键学术支撑。

来源：HuggingFace Papers

论文研究

5/22 04:00

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers ap…

来源：HuggingFace Papers

产品发布/更新

5/21 19:14

wangchuxiaoji-oss/doubao2api

Reverse-engineered Doubao (豆包) API → OpenAI-compatible REST service. Free multimodal chat, image/video/music generation, and file hosting for AI agents.

AI 点评 · 逆向工程豆包API，提供免费多模态服务，极大降低AI应用开发门槛。

来源：GitHub

产品发布/更新

5/21 19:14

wangchuxiaoji-oss/doubao2api

Reverse-engineered Doubao (豆包) API → OpenAI-compatible REST service. Free multimodal chat, image/video/music generation, and file hosting for AI agents.

AI 点评 · 逆向工程将豆包API转为OpenAI兼容接口，免费提供多模态功能，大幅降低AI开发门槛。

来源：GitHub

论文研究

5/21 04:00

EMMA: Extracting Multiple physical parameters from Multimodal Data

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unl…

来源：HuggingFace Papers

论文研究

5/20 04:00

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as acti…

来源：HuggingFace Papers

产品发布/更新

5/19 11:58

fudan-generative-vision/PromptReinjection

[ICML 2026] Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

AI 点评 · 多模态扩散模型的提示遗忘问题首次被系统性解决，为AI图像生成领域带来突破性进展。

来源：GitHub

产品发布/更新

5/19 11:58

fudan-generative-vision/PromptReinjection

[ICML 2026] Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

AI 点评 · 用强化学习缓解多模态扩散模型的提示遗忘，为提升AI生成质量开辟新路径。

来源：GitHub

论文研究

5/16 04:00

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understa…

AI 点评 · 统一3D理解与生成，Mixture-of-Transformers架构突破模态融合瓶颈，推动多模态大

来源：HuggingFace Papers

产品发布/更新

5/13 15:17

GoodQ02/goodq4all

Local-first multimodal epistemic memory for scene-level video, audio, and text intelligence.

来源：GitHub

产品发布/更新

5/7 21:28

InternLM/ETCHR

A question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).

AI 点评 · 将视觉推理拆解为独立模块，为多模态大模型提供更精准的图像编辑能力。

来源：GitHub

产品发布/更新

5/7 21:28

InternLM/ETCHR

A question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).

AI 点评 · 将推理能力融入图像编辑，为多模态大模型提供解耦式视觉助手，拓展了AI交互边界。

来源：GitHub

产品发布/更新

5/3 21:37

VeniVeci/VLM-wiki

A full-modal personal knowledge base built on the Karpathy LLM Wiki concept.

来源：GitHub

产品发布/更新

5/3 21:37

VeniVeci/VLM-wiki

A full-modal personal knowledge base built on the Karpathy LLM Wiki concept.

来源：GitHub

产品发布/更新

5/3 19:04

shawn0728/OpenSearch-VL

🔍 OpenSearch-VL provides a fully open recipe for training strong multimodal deep search agents through high-quality data curation, diverse visual/search tools,…

来源：GitHub

产品发布/更新

5/3 19:04

shawn0728/OpenSearch-VL

🔍 OpenSearch-VL provides a fully open recipe for training strong multimodal deep search agents through high-quality data curation, diverse visual/search tools,…

来源：GitHub

产品发布/更新

5/2 00:24

cyeinfpro/Lumen

Self-hosted multimodal AI workspace — chat, vision QA, text-to-image, image-to-image in one conversation

来源：GitHub

模型发布/更新

4/28 23:58

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

AI 点评 · 英伟达新模型统一处理文档、音频、视频，突破长上下文多模态智能，将驱动下一代AI Agent应用。

来源：HuggingFace Blog

技巧与观点

6/4 22:00

AGI Is Not Multimodal

"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry Winograd The recent successes of…

AI 点评 · 挑战语言中心主义，揭示具身认知对通用智能的核心价值，重塑AI发展路径。

来源：The Gradient

论文研究

4/8 18:30

Repurposing Protein Folding Models for Generation with Latent Diffusion

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of t…

AI 点评 · 将蛋白质折叠模型与潜扩散结合，实现序列与结构同步生成，为AI药物设计开辟新路径。

来源：Berkeley AI Research