🔥[2024-09-22]: We release the new benchmark for text, image, and audio large language models!
Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omniData -languOmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) these baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities.
The OmniBench aims to create the first comprehensive benchmark for evaluating multimodal large language models that support simultaneous image, audio, and text inputs. While OmniBench is designed to evaluate the understanding capability of MLLMs on cross-modality complementary information, the models are required to interpret the multimodal input and provide accurate text answer. The problem could be formulated as following: given a tuple of (image, audio, text), the model is requito recognize the objects, re-build the contexts, and conduct reasoning based on the given information. The design logic and statistics of the dataset and the annotation protocols are introduced in this section.
We propose a novel task type categorization in OmniBench that assesses a broad spectrum of reasoning and cognitive abilities. Our taxonomy progresses from fundamental perception (Object Identification & Description) to complex inference (Contextual & Environmental, Identity & Relationship). It incorporates temporal and logical order understanding of events (Action & Activity, Story Description, Plot Inference), spatial awareness (Contextual & Environmental), entity recognition (Object Identification & Description), symbolic processing (Text & Symbols), and quantitative reasoning (Count & Quantity). This comprehensive design evaluates both low-level perceptual skills and high-level cognitive functions, enabling a holistic assessment of multimodal language models' (MLLMs) capabilities to recognize, describe, integrate information, understand context, and make nuanced inferences. OmniBench comprises 1142 question-answer pairs, with task type distribution, text length, and image and audio characteristics. The dataset's audio content falls into three categories: speech (human vocal communication), sound events (non-speech natural, environmental and mechanical sounds), and music (various compositions and performances).
Our annotation scheme is built upon a fundamental principle: the correct answer to each question must require information from both the image and audio components. This ensures that the benchmark effectively evaluates the model's ability to analyze information across modalities. Our quality control process was two-fold, including human inspection round and automatic inspection round assisted by MLLM.
The Distribution of Inspection Frequency of the Passed Samples in OmniBench.
The Data Distribution of OmniBench. Each chart represents the data samples grouped by task types and distinguished by audio types.
The main focus of OmniBench is to evaluate how well could the omni-language models (OLMs) understand and reconstruct the context given information from image, audio and text modalities. Setting up questions with four available options for the models, we use accuracy, i.e., the ratio matched letter of the correct option and model response, as the evaluation metric (n.b., the accuracy of a random guess model is 25% under this setting).
The first row suggets the input context, where "Img. & Aud." refers to vanilla image and audio, and "(T)" refers to the textual alternative of image and audio. Click on the 4 setting columns to expand detailed results.
Reset | Img. & Aud. | Img.(T) & Aud. | Img. & Aud. (T) | Img. (T) & Aud. (T) | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Size | Date | Overall | Action & Activity | Story Description | Plot Inference | Object Identification & Description | Contextual & Environmental | Identity & Relationship | Text & Symbols | Count & Quantity | Overall | Action & Activity | Story Description | Plot Inference | Object Identification & Description | Contextual & Environmental | Identity & Relationship | Text & Symbols | Count & Quantity | Overall | Action & Activity | Story Description | Plot Inference | Object Identification & Description | Contextual & Environmental | Identity & Relationship | Text & Symbols | Count & Quantity | Overall | Action & Activity | Story Description | Plot Inference | Object Identification & Description | Contextual & Environmental | Identity & Relationship | Text & Symbols | Count & Quantity |
Overall results of different models on the Omnibench leaderboard.
@misc{li2024omnibench,
title={OmniBench: Towards The Future of Universal Omni-Language Models},
author={Yizhi Li and Ge Zhang and Yinghao Ma and Ruibin Yuan and Kang Zhu and Hangyu Guo and Yiming Liang and Jiaheng Liu and Jian Yang and Siwei Wu and Xingwei Qu and Jinjie Shi and Xinyue Zhang and Zhenzhu Yang and Xiangzhou Wang and Zhaoxiang Zhang and Zachary Liu and Emmanouil Benetos and Wenhao Huang and Chenghua Lin},
year={2024},
eprint={2409.15272},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.15272},
}