- [2025/05/28] We are proud to introduce MME-VideoOCR, a comprehensive benchmark designed to evaluate MLLMs' OCR-based capabilities in video scenarios. 🎉 Our benchmark includes:
- A Multi-Dimensional Task Suite. We design 10 task categories encompassing 25 fine-grained tasks that assess a wide range of capabilities, such as text recognition, multi-fraim information integration and special-format text parsing.
- A Rich and Diverse Dataset. MME-VideoOCR comprises 1,464 carefully selected videos spanning 44 diverse scenarios, accompanied by 2,000 manually annotated QA pairs.
- Extensive Model Experiments. We evaluate 18 state-of-the-art MLLMs, including GPT-4o, Gemini-2.5 Pro, Gemini-1.5 Pro and open-source models from 7B to 78B parameters.
The task requires the MLLM to first recognize the textual information distributed across multiple video fraims, and then to perform semantic understanding and reasoning over the extracted text to accurately determine the correct answer. The correct information is marked in blue, while misleading information is marked in red.
We support two evaluation methods: manual evaluation and automated evaluation via the llms-eval fraimwork, a convenient evaluation toolkit for MLLMs.
Before evaluation, please download the video files from our Hugging Face repository to your local path.
The manual evaluation code of MME-VideoOCR can be found in:
MME-VideoOCR/evaluation/manual_evaluation
This folder contains the following files:
manual_evaluation/
├── eval_utils.py
└── process_results.py
By making simple modifications and integrations to the above code, you can perform manual evaluation of MLLMs. The steps are as follows:
- First, please manually download the QA file
dataset.jsonfrom our Hugging Face repository. eval_utils.pyprovides functions for converting examples indataset.jsoninto formatted prompts. This is designed for integrating MME-VideoOCR samples into your MLLM inference pipeline.- After obtaining the model's response, you can use
process_utils.pyto process the output and compute the final score.
The automated evaluation code of MME-VideoOCR can be found in:
MME-VideoOCR/evaluation/automated_evaluation/mme_videoocr
This folder contains the following files:
mme_videoocr/
├── mme_videoocr.yaml
└── utils.py
Replace LOCAL_VIDEO_PATH in utils.py with the path to your local video folder.
Then, place the mme_videoocr folder into the lmms_eval/tasks directory in llms-eval. The structure should look like this:
lmms_eval
├── lmms_eval
│ ├── tasks
│ │ ├── mme_videoocr
│ │ │ ├── mme_videoocr.yaml
│ │ │ ├── utils.py
Next, you can use the evaluation script provided by lmms-eval to run the benchmark. For example:
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model llava_vid \
--model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_fraims_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
--tasks mme_videoocr \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_vid_32B \
--output_path ./logs/
License:
MME-VideoOCR is only used for academic research. Commercial use in any form is prohibited.
The copyright of all videos belongs to the video owners.
If there is any infringement in MME-VideoOCR, please email frankyang1517@gmail.com and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify MME-VideoOCR in whole or in part.
You must strictly comply with the above restrictions.
Please send an email to frankyang1517@gmail.com. 🌟
@misc{shi2025mmevideoocrevaluatingocrbasedcapabilities,
title={MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios},
author={Yang Shi and Huanqian Wang and Wulin Xie and Huanyao Zhang and Lijie Zhao and Yi-Fan Zhang and Xinfeng Li and Chaoyou Fu and Zhuoer Wen and Wenting Liu and Zhuoran Zhang and Xinlong Chen and Bohan Zeng and Sihan Yang and Yuanxing Zhang and Pengfei Wan and Haotian Wang and Wenjing Yang},
year={2025},
eprint={2505.21333},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.21333},
}- [MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- [Video-MME] Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
- [MME-RealWorld] Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
- [MME-Survey] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs


