AAI Search
← 返回首页
话题

多模态

67 条相关资讯 · 来自历史归档

行业动态NEW
2 小时前
字节Seed架构调整:周畅管理范围扩大,具身业务纳入核心

字节跳动多模态负责人周畅管理范围再次扩大,原由李航负责的SeedRobotics团队已向周畅汇报月余,李航现以顾问身份负责学术合作方向。字节也正在招聘具身智能技术负责人,负责机器人业务整体规划,职级定位为L8,对标阿里P10-P11,将向周畅汇报。该岗位候选人主要来自头部具身智能创业公司技术负责人。(晚点 LatePost)

AI 点评 · 架构调整显示字节加速整合资源,具身智能成战略核心,技术负责人招聘透露行业人才争夺升级。

产品发布/更新NEW
14 小时前
阿里发布 Qwen3.7-Plus 模型,升级多模态交互混合 AI 智能体

IT之家 6 月 2 日消息,阿里千问大模型今天(6 月 2 日)发布博文,宣布推出 Qwen3.7-Plus 模型, 定位为多模态交互混合智能体。 Qwen3.7-Plus 是 Qwen3.7 的多模态升级版,核心定位是视觉与语言统一的智能体基座。 它保留文本、编码、工具使用和生产力工作流能力,同时强化视觉理解、视觉推理和跨模态任务处理。 模型已通过阿里云…

AI 点评 · 多模态与智能体融合,或加速AI从“对话”迈向“行动”的关键一步。

论文研究NEW
19 小时前
AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame a…

AI 点评 · 提出视频时序冗余新视角,用预测编码压缩帧,有望大幅降低视频多模态模型算力成本。

产品发布/更新
5/29 15:25
StarTrail-org/PixelRAG

The end of web parsing. The beginning of scalable pixel-native search.

AI 点评 · 像素级搜索技术突破,终结传统网页解析,开启视觉原生检索新范式。

产品发布/更新NEW
5/29 15:25
StarTrail-org/PixelRAG

The end of web parsing. The beginning of scalable pixel-native search.

AI 点评 · 将网页解析转向像素级原生搜索,为多模态检索开辟全新路径。

论文研究NEW
5/29 04:00
Task-Focused Memorization for Multimodal Agents

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory…

AI 点评 · 聚焦多模态智能体的长期记忆构建,突破传统记忆局限,实现持续学习与知识积累。

论文研究NEW
5/29 04:00
Linear Scaling Video VLMs for Long Video Understanding

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to gr…

AI 点评 · 突破视频模型计算瓶颈,实现线性缩放,为长视频实时理解铺平道路。

论文研究
5/28 04:00
GenClaw: Code-Driven Agentic Image Generation

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at t…

AI 点评 · 代码驱动生成图像,打通语言与视觉鸿沟,开辟智能代理新范式。

论文研究NEW
5/28 04:00
VLM3: Vision Language Models Are Native 3D Learners

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still l…

AI 点评 · 打破视觉语言模型对3D理解的局限,开启原生3D学习新范式。

产品发布/更新NEW
5/24 01:13
zdshang/JoyCapture-UR5

Gamepad-Guided Multimodal Demonstration Capture for UR5 Manipulation

AI 点评 · 游戏手柄引导的多模态数据采集方案,大幅降低机器人示教门槛。

产品发布/更新
5/21 19:14
wangchuxiaoji-oss/doubao2api

Reverse-engineered Doubao (豆包) API → OpenAI-compatible REST service. Free multimodal chat, image/video/music generation, and file hosting for AI agents.

AI 点评 · 逆向工程豆包API,提供免费多模态服务,极大降低AI应用开发门槛。

产品发布/更新NEW
5/21 19:14
wangchuxiaoji-oss/doubao2api

Reverse-engineered Doubao (豆包) API → OpenAI-compatible REST service. Free multimodal chat, image/video/music generation, and file hosting for AI agents.

AI 点评 · 逆向工程将豆包API转为OpenAI兼容接口,免费提供多模态功能,大幅降低AI开发门槛。

产品发布/更新
5/19 11:58
fudan-generative-vision/PromptReinjection

[ICML 2026] Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

AI 点评 · 多模态扩散模型的提示遗忘问题首次被系统性解决,为AI图像生成领域带来突破性进展。

产品发布/更新NEW
5/19 11:58
fudan-generative-vision/PromptReinjection

[ICML 2026] Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

AI 点评 · 用强化学习缓解多模态扩散模型的提示遗忘,为提升AI生成质量开辟新路径。

产品发布/更新
5/7 21:28
InternLM/ETCHR

A question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).

AI 点评 · 将视觉推理拆解为独立模块,为多模态大模型提供更精准的图像编辑能力。

产品发布/更新NEW
5/7 21:28
InternLM/ETCHR

A question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).

AI 点评 · 将推理能力融入图像编辑,为多模态大模型提供解耦式视觉助手,拓展了AI交互边界。

产品发布/更新
5/3 19:04
shawn0728/OpenSearch-VL

🔍 OpenSearch-VL provides a fully open recipe for training strong multimodal deep search agents through high-quality data curation, diverse visual/search tools,…

产品发布/更新NEW
5/3 19:04
shawn0728/OpenSearch-VL

🔍 OpenSearch-VL provides a fully open recipe for training strong multimodal deep search agents through high-quality data curation, diverse visual/search tools,…

产品发布/更新
5/2 00:24
cyeinfpro/Lumen

Self-hosted multimodal AI workspace — chat, vision QA, text-to-image, image-to-image in one conversation

技巧与观点NEW
6/4 22:00
AGI Is Not Multimodal

"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry Winograd The recent successes of…

AI 点评 · 挑战语言中心主义,揭示具身认知对通用智能的核心价值,重塑AI发展路径。