Skip to content

Apple is developing “Ferret UI” artificial intelligence model that can understand iPhone UI

By | Published | No Comments

Apple researchers have published another paper on artificial intelligence (AI) models, this time focused on understanding and navigating through smartphone user interfaces (UI). The research paper, which has not yet been peer-reviewed, highlights a large language model (LLM) called Ferret UI that can go beyond traditional computer vision and understand complex smartphone screens. It’s worth noting that this isn’t the first paper on artificial intelligence published by the tech giant’s research arm. It has published a paper on the Multimodal Master of Laws (MLLM) and another on on-device artificial intelligence models.

A preprint of the research paper has been posted on arXiv, an open-access online repository of academic papers. The paper is titled “Ferret-UI: Mobile UI Understanding Based on Multimodal LLM” and focuses on the use cases of extending MLLM. It emphasizes that most language models with multimodal capabilities cannot understand content beyond natural images and are “limited” in their functionality. It also points out that AI models need to understand complex and dynamic interfaces, such as those on smartphones.

According to the paper, Ferret UI “is designed to perform precise reference and grounding tasks specific to the UI screen, while proficiently interpreting and acting on open language instructions.” In short, visual language models can not only handle A smartphone screen that has multiple elements that represent different pieces of information and can also tell the user about these elements when prompted for a query.

Ferret UI Ferret UI

How Ferret UI handles information on the screen
Photo credit: Apple

Based on the images shared in the paper, the model can understand and classify widgets and identify icons. It can also answer questions like “Where is the launch icon?” and “How to open the Reminders app?” This shows that the AI ​​can not only interpret the screen it sees, but also navigate to different parts of the iPhone based on prompts.

To train Ferret UI, Apple researchers created their own data of varying complexity. This helps the model learn basic tasks and understand single-step processes. “For advanced tasks we use GPT-4 [40] Generate data including detailed descriptions, dialogue perception, dialogue interactions, and functional reasoning. These high-level tasks enable the model to engage in more nuanced discussions about visual components, develop action plans with specific goals, and explain the general purpose of the screen,” the paper explains.

The paper is promising, and if it passes the peer review stage, Apple may be able to use this feature to add powerful tools to the iPhone to perform complex UI navigation tasks through simple text or verbal prompts. This feature seems ideal for Siri.


Affiliate links may be automatically generated – see our Ethics Statement for details.

Follow us on Google news ,Twitter , and Join Whatsapp Group of thelocalreport.in

Surja, a dedicated blog writer and explorer of diverse topics, holds a Bachelor's degree in Science. Her writing journey unfolds as a fascinating exploration of knowledge and creativity.With a background in B.Sc, Surja brings a unique perspective to the world of blogging. Hers articles delve into a wide array of subjects, showcasing her versatility and passion for learning. Whether she's decoding scientific phenomena or sharing insights from her explorations, Surja's blogs reflect a commitment to making complex ideas accessible.