Mobile-Agent: An Autonomous Multimodal Agent for Mobile Device Operations

Researchers from Beijing Jiaotong University and Alibaba Group have developed Mobile-Agent, an autonomous multimodal agent designed to operate a variety of mobile applications using a unified visual perception framework. Unlike previous solutions that relied on XML files or mobile system metadata, Mobile-Agent employs visual perception tools to accurately identify and locate visual and textual elements within an app’s interface.

By leveraging its perception abilities, Mobile-Agent autonomously plans and executes complex operation tasks, navigating through mobile apps step by step. The framework utilizes OCR tools for text localization and CLIP for icon localization, enabling the agent to perform tasks such as opening apps, clicking text or icons, typing, and navigating.

The key advantage of Mobile-Agent lies in its vision-centric approach, which enhances adaptability across different mobile operating environments without the need for system-specific customizations. Through iterative self-planning and self-reflection, Mobile-Agent analyzes user instructions and real-time screen analysis to improve task completion. In case of errors during execution, the agent employs a self-reflection method to enhance the success rate of instructions.

To evaluate Mobile-Agent comprehensively, the researchers introduced Mobile-Eval, a benchmark of 10 popular mobile apps with three instructions each. The framework achieved completion rates of 91%, 82%, and 82% across instructions, with a high Process Score of around 80%. When compared to human-operated steps, Mobile-Agent exhibited an efficiency of 80%.

The study showcases the effectiveness and efficiency of Mobile-Agent as a versatile and adaptable solution for language-agnostic interaction with mobile applications. With its robust performance and self-reflective capabilities, Mobile-Agent has the potential to revolutionize mobile device operations and serve as a reliable mobile device assistant.

For more details about Mobile-Agent, you can refer to the research paper and GitHub repository. To stay updated with the latest developments in the field, follow us on Twitter and Google News. Join our ML SubReddit, Facebook Community, Discord Channel, and LinkedIn Group.

If you find our work valuable, don’t forget to subscribe to our newsletter. And make sure to join our Telegram Channel for more updates.

The source of the article is from the blog procarsrl.com.ar