Logo OmniBench

Towards The Future of Universal Omni-Language Models

Yizhi Li*1,2, Ge Zhang*†1,3, Yinghao Ma*1,4, Ruibin Yuan1,5, Kang Zhu1,3,
Hangyu Guo1, Yiming Liang1, Jiaheng Liu1, Jian Yang1, Siwei Wu1,2,
Xingwei Qu1,2, Jinjie Shi4, Xinyue Zhang1, Zhenzhu Yang1, Xiangzhou Wang1,
Zhaoxiang Zhang6, Zachary Liu7, Emmanouil Benetos4, Wenhao Huang1,3, Chenghua Lin†,1,2,

1m-a-p.ai, 2University of Manchester, 301.ai, 4Queen Mary University of London,
5Hongkong University of Science and Technology, 6Nanjing University, 7Dartmouth College

*Core Contributors
†Corresponding to: yizhi.li@hotmail.com, gezhang@umich.edu, c.lin@manchester.ac.uk

🔔News

🔥[2024-09-22]: We release the new benchmark for text, image, and audio large language models!

Introduction

Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omniData -languOmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) these baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities.

Data Samples Across Categories

Action and Activity

Action and Activity

Contextual and Environmental

Contextual and Environmental

Count and Quantity

Count and Quantity

Identity and Relationship

Identity and Relationship

Object Identification and Description

Object Identification and Description

Plot Inference

Plot Inference

Story Description

Story Description

Text and Symbols

Text and Symbols

OmniBench

Overview

The OmniBench aims to create the first comprehensive benchmark for evaluating multimodal large language models that support simultaneous image, audio, and text inputs. While OmniBench is designed to evaluate the understanding capability of MLLMs on cross-modality complementary information, the models are required to interpret the multimodal input and provide accurate text answer. The problem could be formulated as following: given a tuple of (image, audio, text), the model is requito recognize the objects, re-build the contexts, and conduct reasoning based on the given information. The design logic and statistics of the dataset and the annotation protocols are introduced in this section.

data_stats

We propose a novel task type categorization in OmniBench that assesses a broad spectrum of reasoning and cognitive abilities. Our taxonomy progresses from fundamental perception (Object Identification & Description) to complex inference (Contextual & Environmental, Identity & Relationship). It incorporates temporal and logical order understanding of events (Action & Activity, Story Description, Plot Inference), spatial awareness (Contextual & Environmental), entity recognition (Object Identification & Description), symbolic processing (Text & Symbols), and quantitative reasoning (Count & Quantity). This comprehensive design evaluates both low-level perceptual skills and high-level cognitive functions, enabling a holistic assessment of multimodal language models' (MLLMs) capabilities to recognize, describe, integrate information, understand context, and make nuanced inferences. OmniBench comprises 1142 question-answer pairs, with task type distribution, text length, and image and audio characteristics. The dataset's audio content falls into three categories: speech (human vocal communication), sound events (non-speech natural, environmental and mechanical sounds), and music (various compositions and performances).


annotation

Our annotation scheme is built upon a fundamental principle: the correct answer to each question must require information from both the image and audio components. This ensures that the benchmark effectively evaluates the model's ability to analyze information across modalities. Our quality control process was two-fold, including human inspection round and automatic inspection round assisted by MLLM.

Statistics

Experiment Results

Leaderboard

The main focus of OmniBench is to evaluate how well could the omni-language models (OLMs) understand and reconstruct the context given information from image, audio and text modalities. Setting up questions with four available options for the models, we use accuracy, i.e., the ratio matched letter of the correct option and model response, as the evaluation metric (n.b., the accuracy of a random guess model is 25% under this setting).


Open-Source OLM Proprietary OLM

Open-Source VLM or ALM Proprietary VLM or ALM

The first row suggets the input context, where "Img. & Aud." refers to vanilla image and audio, and "(T)" refers to the textual alternative of image and audio. Click on the 4 setting columns to expand detailed results.

Reset Img. & Aud. Img.(T) & Aud. Img. & Aud. (T) Img. (T) & Aud. (T)
Name Size Date Overall Overall Overall Overall

Overall results of different models on the Omnibench leaderboard.

BibTeX


    @misc{li2024omnibench,
        title={OmniBench: Towards The Future of Universal Omni-Language Models}, 
        author={Yizhi Li and Ge Zhang and Yinghao Ma and Ruibin Yuan and Kang Zhu and Hangyu Guo and Yiming Liang and Jiaheng Liu and Jian Yang and Siwei Wu and Xingwei Qu and Jinjie Shi and Xinyue Zhang and Zhenzhu Yang and Xiangzhou Wang and Zhaoxiang Zhang and Zachary Liu and Emmanouil Benetos and Wenhao Huang and Chenghua Lin},
        year={2024},
        eprint={2409.15272},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2409.15272}, 
    }