DeepMind’s Highly Capable Multimodal Model Gemin Reaches Human-Expert Level

A Google DeepMind research team introduces a groundbreaking family of multimodal models Gemini, which showcase exceptional proficiency across image, audio, video, and text comprehension, pushing the boundaries of large-scale language modeling, image interpretation, audio processing, and video understanding.

The Multimodal Large Language Model (MLLM) has recently emerged as a prominent research focus, harnessing the capabilities of powerful Large Language Models (LLMs) to undertake diverse multimodal tasks. The remarkable functionalities of MLLM, such as crafting narratives based on images and OCR-free math reasoning, mark a departure from conventional methods and suggest a potential trajectory toward artificial general intelligence.

Embarking on this trajectory, a Google DeepMind research team introduces a groundbreaking family of multimodal models in their latest paper, titled “Gemini: A Family of Highly Capable Multimodal Models.” These Gemini models showcase exceptional proficiency across image, audio, video, and text comprehension, pushing the boundaries of large-scale language modeling, image interpretation, audio processing, and video understanding.

The foundation of Gemini models lies in Transformer decoders (Vaswani et al., 2017), augmented with enhancements in architecture and model optimization. These improvements facilitate stable training at scale and optimized inference on Google’s Tensor Processing Units. The inaugural version, Gemini 1.0, is available in three sizes: Ultra, designed for highly complex tasks; Pro, offering enhanced performance and deployability at scale; and Nano, tailored for on-device applications, each addressing distinct computational limitations and application requirements.

Gemini models are trained to seamlessly integrate textual input with a diverse range of audio and visual inputs, including natural images, charts, screenshots, PDFs, and videos, generating text and image outputs. Particularly noteworthy is Gemini’s ability to handle variable input resolution for video understanding, allocating compute resources to tasks demanding fine-grained comprehension. Additionally, Gemini captures nuances that may be overlooked when audio is rudimentarily mapped to text input.

Developing the Gemini family of models necessitated innovations in training algorithms, datasets, and infrastructure. The Pro model, benefitting from the inherent scalability of the infrastructure and learning algorithms, completes pretraining in a matter of weeks, utilizing a fraction of the Ultra model’s resources. The Nano series leverages advancements in distillation and training algorithms to create top-tier small language models, ideal for tasks like summarization and reading comprehension, powering the next generation of on-device experiences.

Extensive evaluation across diverse benchmarks reveals that the Gemini Ultra model excels in 30 of 32 benchmarks, notably achieving human-expert performance on the widely studied exam benchmark MMLU. The team is optimistic about the transformative potential of Gemini models in cross-modal reasoning and language understanding, envisioning a multitude of use cases. The researchers emphasize their commitment to deploying these models responsibly, prioritizing ethical considerations in their application to users.

Advertisement

The paper Gemini: A Family of Highly Capable Multimodal Models on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “DeepMind’s Highly Capable Multimodal Model Gemin Reaches Human-Expert Level

  1. This new service revolutionizes the online car buying process. With its vast selection of vehicle models, the platform offers rich, detailed information, including technical specs and user feedback. The site’s design is intuitive, and its interface is user-friendly, making vehicle comparison simple and effective. Choose analyticauto.com for a convenient and trustworthy car shopping experience, where current information and ease of use are the top priorities.

  2. Manufacturing Accounting Software | Manufacturing Software UK

    GSG Provides the best Manufacturing Accounting Software in the UK. Manufacturing Software helps you track inventory, build items, and run financial reports.

  3. good

  4. benjaminlouis680309

    The introduction of Gemini marks a significant leap forward in multimodal AI research.check high speed internet availability by addressIts proficiency across image, audio, video, and text comprehension showcases its potential for transformative applications. The innovative architecture and efficient training methods underline DeepMind’s commitment to pushing the boundaries of AI responsibly. Gemini’s ability to achieve human-expert performance across diverse benchmarks signals a promising trajectory toward more advanced AI capabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *

Original text
Rate this translation
Your feedback will be used to help improve Google Translate