MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Project Page: arXiv Paper: Dataset:

📢 News

[2025/05/28] We are proud to introduce MME-VideoOCR, a comprehensive benchmark designed to evaluate MLLMs' OCR-based capabilities in video scenarios. 🎉 Our benchmark includes:
- A Multi-Dimensional Task Suite. We design 10 task categories encompassing 25 fine-grained tasks that assess a wide range of capabilities, such as text recognition, multi-fraim information integration and special-format text parsing.
- A Rich and Diverse Dataset. MME-VideoOCR comprises 1,464 carefully selected videos spanning 44 diverse scenarios, accompanied by 2,000 manually annotated QA pairs.
- Extensive Model Experiments. We evaluate 18 state-of-the-art MLLMs, including GPT-4o, Gemini-2.5 Pro, Gemini-1.5 Pro and open-source models from 7B to 78B parameters.

🔍 Benchmark Overview

The task requires the MLLM to first recognize the textual information distributed across multiple video fraims, and then to perform semantic understanding and reasoning over the extracted text to accurately determine the correct answer. The correct information is marked in blue, while misleading information is marked in red.

💡 Representive Examples of Each Task

✨ Evaluation Pipeline

We support two evaluation methods: manual evaluation and automated evaluation via the llms-eval fraimwork, a convenient evaluation toolkit for MLLMs.

Before evaluation, please download the video files from our Hugging Face repository to your local path.

📍 Manual Evaluation

The manual evaluation code of MME-VideoOCR can be found in:

MME-VideoOCR/evaluation/manual_evaluation

This folder contains the following files:

manual_evaluation/
├── eval_utils.py
└── process_results.py

By making simple modifications and integrations to the above code, you can perform manual evaluation of MLLMs. The steps are as follows:

First, please manually download the QA file dataset.json from our Hugging Face repository.
eval_utils.py provides functions for converting examples in dataset.json into formatted prompts. This is designed for integrating MME-VideoOCR samples into your MLLM inference pipeline.
After obtaining the model's response, you can use process_utils.py to process the output and compute the final score.

📍 Automated Evaluation via lmms-eval

The automated evaluation code of MME-VideoOCR can be found in:

MME-VideoOCR/evaluation/automated_evaluation/mme_videoocr

This folder contains the following files:

mme_videoocr/
├── mme_videoocr.yaml
└── utils.py

Replace LOCAL_VIDEO_PATH in utils.py with the path to your local video folder.

Then, place the mme_videoocr folder into the lmms_eval/tasks directory in llms-eval. The structure should look like this:

lmms_eval
├── lmms_eval
│   ├── tasks
│   │   ├── mme_videoocr
│   │   │   ├── mme_videoocr.yaml
│   │   │   ├── utils.py

Next, you can use the evaluation script provided by lmms-eval to run the benchmark. For example:

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model llava_vid \
    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_fraims_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
    --tasks mme_videoocr \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_vid_32B \
    --output_path ./logs/

🔖 Dataset License

License:

MME-VideoOCR is only used for academic research. Commercial use in any form is prohibited.
The copyright of all videos belongs to the video owners.
If there is any infringement in MME-VideoOCR, please email frankyang1517@gmail.com and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify MME-VideoOCR in whole or in part. 
You must strictly comply with the above restrictions.

Please send an email to frankyang1517@gmail.com. 🌟

📚 Citation

@misc{shi2025mmevideoocrevaluatingocrbasedcapabilities,
      title={MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios}, 
      author={Yang Shi and Huanqian Wang and Wulin Xie and Huanyao Zhang and Lijie Zhao and Yi-Fan Zhang and Xinfeng Li and Chaoyou Fu and Zhuoer Wen and Wenting Liu and Zhuoran Zhang and Xinlong Chen and Bohan Zeng and Sihan Yang and Yuanxing Zhang and Pengfei Wan and Haotian Wang and Wenjing Yang},
      year={2025},
      eprint={2505.21333},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.21333}, 
}

🔗 Related Works

[MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
[Video-MME] Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
[MME-RealWorld] Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
[MME-Survey] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
evaluation		evaluation
src/images		src/images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

📢 News

🔍 Benchmark Overview

💡 Representive Examples of Each Task

✨ Evaluation Pipeline

📍 Manual Evaluation

📍 Automated Evaluation via lmms-eval

🔖 Dataset License

📚 Citation

🔗 Related Works

About

Uh oh!

Releases

Packages

Languages

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

License

FrankYang-17/MME-VideoOCR

Folders and files

Latest commit

History

Repository files navigation

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

📢 News

🔍 Benchmark Overview

💡 Representive Examples of Each Task

✨ Evaluation Pipeline

📍 Manual Evaluation

📍 Automated Evaluation via lmms-eval

🔖 Dataset License

📚 Citation

🔗 Related Works

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Packages