字节跳动多模态负责人周畅管理范围再次扩大,原由李航负责的SeedRobotics团队已向周畅汇报月余,李航现以顾问身份负责学术合作方向。字节也正在招聘具身智能技术负责人,负责机器人业务整体规划,职级定位为L8,对标阿里P10-P11,将向周畅汇报。该岗位候选人主要来自头部具身智能创业公司技术负责人。(晚点 LatePost)
AI 点评 · 架构调整显示字节加速整合资源,具身智能成战略核心,技术负责人招聘透露行业人才争夺升级。
共 67 条相关资讯 · 来自历史归档
字节跳动多模态负责人周畅管理范围再次扩大,原由李航负责的SeedRobotics团队已向周畅汇报月余,李航现以顾问身份负责学术合作方向。字节也正在招聘具身智能技术负责人,负责机器人业务整体规划,职级定位为L8,对标阿里P10-P11,将向周畅汇报。该岗位候选人主要来自头部具身智能创业公司技术负责人。(晚点 LatePost)
AI 点评 · 架构调整显示字节加速整合资源,具身智能成战略核心,技术负责人招聘透露行业人才争夺升级。
Qwen3.7-Plus已上线阿里云百炼
AI 点评 · 通杀多模态与桌面软件,AI智能体能力再上台阶,开发者生态迎来新变量。

IT之家 6 月 2 日消息,阿里千问大模型今天(6 月 2 日)发布博文,宣布推出 Qwen3.7-Plus 模型, 定位为多模态交互混合智能体。 Qwen3.7-Plus 是 Qwen3.7 的多模态升级版,核心定位是视觉与语言统一的智能体基座。 它保留文本、编码、工具使用和生产力工作流能力,同时强化视觉理解、视觉推理和跨模态任务处理。 模型已通过阿里云…
AI 点评 · 多模态与智能体融合,或加速AI从“对话”迈向“行动”的关键一步。
Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts…
AI 点评 · 用感知扰动与奖励建模,巧妙解决多模态大模型评判时的视觉偏见问题。
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making…
AI 点评 · 用原型引导自适应扩展与几何整合,攻克多模态持续学习中灾难性遗忘难题。
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame a…
AI 点评 · 提出视频时序冗余新视角,用预测编码压缩帧,有望大幅降低视频多模态模型算力成本。
AI 点评 · Qwen3.7-Plus融合多模态与智能体能力,或开启AI应用新范式。
Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains undere…
AI 点评 · 聚焦视频大模型对瞬间视觉事件的判断精度,填补了时序保真度评估空白。
MiniMax M3 今日正式发布。 MiniMax M3 在编程和智能体等专业任务上达到了前沿的能力。它使用了全新注意力架构 MSA (MiniMax Sparse Attention),最高支持 1M 超长上下文。它也是一个原生多模态模型,支持图片和视频的输入,并能操作电脑桌面。 在衡量 Coding 能力的 SWE-Bench Pro 上,MiniMa…
Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigat…
AI 点评 · 视觉语言模型实现逆向图形从推理到可编辑3D场景,突破传统重建限制。
Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuni…
AI 点评 · 评估VLA模型语义理解能力的关键基准,揭示机器人动作预测的深层缺陷。
Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterpa…
AI 点评 · 日本首个政府白皮书图表问答基准,填补非英语视觉语言模型评估空白,推动多语言文档理解发展。
Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existi…
AI 点评 · 评测视觉语言模型在物理场景下的抗压能力,填补机器人感知鲁棒性空白。
Transformer-based architectures have advanced sequence modeling in language and vision, yet general-purpose representation learning for heterogeneous multivariate time series remains underexplored. We…
AI 点评 · 多模态联合嵌入让传感器数据“开口说话”,突破时间序列通用表征瓶颈。
Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in f…
AI 点评 · 揭示视觉语言模型在模糊情境下仍会抑制女性表征,暴露了AI公平性研究的深层盲区。
AI 点评 · 3B小模型超越GPT-4o,低成本高泛化能力开辟具身智能新路径。
AI 点评 · 小模型大突破,3B参数跑赢未知场景,低成本推理潜力巨大。
The end of web parsing. The beginning of scalable pixel-native search.
AI 点评 · 像素级搜索技术突破,终结传统网页解析,开启视觉原生检索新范式。
The end of web parsing. The beginning of scalable pixel-native search.
AI 点评 · 将网页解析转向像素级原生搜索,为多模态检索开辟全新路径。
Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory…
AI 点评 · 聚焦多模态智能体的长期记忆构建,突破传统记忆局限,实现持续学习与知识积累。
Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to gr…
AI 点评 · 突破视频模型计算瓶颈,实现线性缩放,为长视频实时理解铺平道路。
Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structu…
AI 点评 · 打破多模态模型依赖预训练VAE的瓶颈,实现真正统一感知与生成,是迈向高效AI的关键一步。
Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliabil…
AI 点评 · 提出轨迹捉迷藏方法,主动发现VLA模型运行时的隐藏故障信号,提升机器人可靠性。
While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference pha…
AI 点评 · 用强化学习让AI内化视觉推理能力,突破多模态模型推理阶段的瓶颈。
Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recogn…
Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains un…
AI 点评 · 首个用《我的世界》评估多模态大模型开放世界探索能力的基准,填补了该领域测试空白。
Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal…
AI 点评 · 用轻量模型实现可信视觉语言推理,突破时序异常检测效率瓶颈。
Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. H…
AI 点评 · 多智能体协作生成可验证长报告,突破深度研究可信度瓶颈,推动AI从搜索迈向论证。
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the r…
AI 点评 · 评估多模态智能体的动态记忆,推动从简单回忆到世界建模的跃迁。
Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally…
AI 点评 · 提出局部模态替换策略,突破视觉语言融合瓶颈,显著提升多模态理解深度。
Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at t…
AI 点评 · 代码驱动生成图像,打通语言与视觉鸿沟,开辟智能代理新范式。
Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific…
AI 点评 · 突破性将3D空间先验注入视觉语言模型,显著提升几何推理能力,为AI理解三维世界开辟新路径。
Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and l…
AI 点评 · 轻量级GUI智能体通过知识图谱实现高效行为探索,突破大模型依赖瓶颈。
Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks,…
AI 点评 · 统一视觉语言与动作建模,突破单一任务限制,推动机器人跨场景泛化能力跃升。
Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts…
AI 点评 · 揭示多模态模型的空间认知本质,挑战AI视觉推理的深层局限。
Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal…
AI 点评 · 轻量模型实现时间序列异常检测,突破大模型效率瓶颈,值得关注。
Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still l…
AI 点评 · 打破视觉语言模型对3D理解的局限,开启原生3D学习新范式。
Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts…
AI 点评 · 研究揭示视觉语言模型空间推理的盲点,质疑其是否真正具备三维理解能力,对AI可靠性提出关键挑战。
Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal…
AI 点评 · 轻量模型攻克时序异常检测,突破大模型效率瓶颈,实用价值显著。
Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world:…
AI 点评 · 指出VLMs在空间问答中的盲区与自信误判,揭示视觉语言模型认知边界的关键缺陷。
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly te…
Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by train…
AI 点评 · 高效压缩视觉令牌,缓解推理计算瓶颈,加速视觉语言模型落地。
Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world know…
AI 点评 · 因果内化与密度采样策略,突破GUI智能体真实任务瓶颈,值得关注。
Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely us…
AI 点评 · 对比视觉语言与视频生成模型,揭示哪种预训练范式更利于空间智能发展。
Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to a…
AI 点评 · 统一多模态模型利用视觉思维进行空间推理,突破语言局限,提升跨视角几何推理能力。
Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raise…
AI 点评 · 通过在线技能蒸馏,大幅提升多模态AI代理效率,减少推理计算成本,极具实用创新价值。
Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raise…
Gamepad-Guided Multimodal Demonstration Capture for UR5 Manipulation
AI 点评 · 游戏手柄引导的多模态数据采集方案,大幅降低机器人示教门槛。
Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically consequential actions. Robotics foundation models, vision-languag…
AI 点评 · 聚焦物理AI安全盲区,系统梳理运行时动作授权机制,为自主系统风险防控提供关键学术支撑。
Reverse-engineered Doubao (豆包) API → OpenAI-compatible REST service. Free multimodal chat, image/video/music generation, and file hosting for AI agents.
AI 点评 · 逆向工程豆包API,提供免费多模态服务,极大降低AI应用开发门槛。
Reverse-engineered Doubao (豆包) API → OpenAI-compatible REST service. Free multimodal chat, image/video/music generation, and file hosting for AI agents.
AI 点评 · 逆向工程将豆包API转为OpenAI兼容接口,免费提供多模态功能,大幅降低AI开发门槛。
The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as acti…
[ICML 2026] Alleviating Prompt Forgetting in Multimodal Diffusion Transformers
AI 点评 · 多模态扩散模型的提示遗忘问题首次被系统性解决,为AI图像生成领域带来突破性进展。
[ICML 2026] Alleviating Prompt Forgetting in Multimodal Diffusion Transformers
AI 点评 · 用强化学习缓解多模态扩散模型的提示遗忘,为提升AI生成质量开辟新路径。
This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understa…
AI 点评 · 统一3D理解与生成,Mixture-of-Transformers架构突破模态融合瓶颈,推动多模态大
A question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).
AI 点评 · 将视觉推理拆解为独立模块,为多模态大模型提供更精准的图像编辑能力。
A question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).
AI 点评 · 将推理能力融入图像编辑,为多模态大模型提供解耦式视觉助手,拓展了AI交互边界。
A full-modal personal knowledge base built on the Karpathy LLM Wiki concept.
A full-modal personal knowledge base built on the Karpathy LLM Wiki concept.
🔍 OpenSearch-VL provides a fully open recipe for training strong multimodal deep search agents through high-quality data curation, diverse visual/search tools,…
🔍 OpenSearch-VL provides a fully open recipe for training strong multimodal deep search agents through high-quality data curation, diverse visual/search tools,…
Self-hosted multimodal AI workspace — chat, vision QA, text-to-image, image-to-image in one conversation
Archived snapshot of Thinking-with-Visual-Primitives
AI 点评 · 英伟达新模型统一处理文档、音频、视频,突破长上下文多模态智能,将驱动下一代AI Agent应用。

"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry Winograd The recent successes of…
AI 点评 · 挑战语言中心主义,揭示具身认知对通用智能的核心价值,重塑AI发展路径。

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of t…
AI 点评 · 将蛋白质折叠模型与潜扩散结合,实现序列与结构同步生成,为AI药物设计开辟新路径。