Skip to content

Apple researchers are working on MM1, a family of multi-modal AI models

By | Published | No Comments

Apple researchers have shared their work on building multimodal artificial intelligence (AI) large language models (LLMs) in a preprint paper. The paper, published on an online portal on March 14, focuses on how to implement advanced features of multimodality and enable base models to be trained on plain text data and images. The Cupertino-based tech giant is making new strides in artificial intelligence after CEO Tim Cook, speaking on the company’s earnings call, said AI capabilities could be coming later this year roll out.

Preprint version Research Papers Published on arXiv, an open access online repository of scholarly papers. However, the paper published here has not been peer-reviewed. While Apple is not mentioned in the paper itself, most of the researchers mentioned are affiliated with the company’s machine learning (ML) division, so it is believed that the project is also affiliated with the iPhone maker.

According to the researchers, they are working on MM1, a family of multimodal models containing up to 30 billion parameters. The paper’s authors call it a “High-Performance Multimodal Master of Laws (MLLM),” emphasizing image encoders, visual language connectors, and other architectural components and data selections to create artificial intelligence models capable of understanding text and text. Image-based input.

The paper gives an example, stating: “We demonstrate that for large-scale multi-modal pre-training, using a careful combination of image captions, interleaved image text, and text-only data is critical to achieving state-of-the-art results. With others published Achieved (SOTA) few-sample results on multiple benchmarks compared to pre-training results.”

In detail, the AI ​​model is currently in the pre-training stage, which means it has not been trained enough to provide the required output. At this stage, algorithms and AI architecture are ultimately used to design the model’s workflow and how it processes data. A team of Apple researchers were able to add computer vision to the model using image encoders and visual language connectors. Then, when tested using a mixture of image-only, image and text, and text-only datasets, the team found the results to be competitive with existing models at the same stage.

While this breakthrough is significant, this research paper is not enough to confirm that multi-modal AI chatbots will be added to Apple’s operating system. At this stage, it is even difficult to say whether an AI model is multimodal (whether it can generate AI images) when it accepts input or gives output. But if the results are confirmed to be consistent after peer review, it can be said that the technology giant has taken another big step in building a basic model for native generative artificial intelligence.


Affiliate links may be automatically generated – see our Ethics Statement for details.

Follow us on Google news ,Twitter , and Join Whatsapp Group of thelocalreport.in

Surja, a dedicated blog writer and explorer of diverse topics, holds a Bachelor's degree in Science. Her writing journey unfolds as a fascinating exploration of knowledge and creativity.With a background in B.Sc, Surja brings a unique perspective to the world of blogging. Hers articles delve into a wide array of subjects, showcasing her versatility and passion for learning. Whether she's decoding scientific phenomena or sharing insights from her explorations, Surja's blogs reflect a commitment to making complex ideas accessible.