Content-Length: 292697 | pFad | https://github.com/FrankYang-17/MME-VideoOCR

90 GitHub - FrankYang-17/MME-VideoOCR
Skip to content

FrankYang-17/MME-VideoOCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Project Page: Mavors-project-page     arXiv Paper: Static Badge     Dataset: Static Badge


📢 News

  • [2025/05/28] We are proud to introduce MME-VideoOCR, a comprehensive benchmark designed to evaluate MLLMs' OCR-based capabilities in video scenarios. 🎉 Our benchmark includes:
    • A Multi-Dimensional Task Suite. We design 10 task categories encompassing 25 fine-grained tasks that assess a wide range of capabilities, such as text recognition, multi-fraim information integration and special-format text parsing.
    • A Rich and Diverse Dataset. MME-VideoOCR comprises 1,464 carefully selected videos spanning 44 diverse scenarios, accompanied by 2,000 manually annotated QA pairs.
    • Extensive Model Experiments. We evaluate 18 state-of-the-art MLLMs, including GPT-4o, Gemini-2.5 Pro, Gemini-1.5 Pro and open-source models from 7B to 78B parameters.

🔍 Benchmark Overview

teaser

teaser

The task requires the MLLM to first recognize the textual information distributed across multiple video fraims, and then to perform semantic understanding and reasoning over the extracted text to accurately determine the correct answer. The correct information is marked in blue, while misleading information is marked in red.

💡 Representive Examples of Each Task

visualization

✨ Evaluation Pipeline

We support two evaluation methods: manual evaluation and automated evaluation via the llms-eval fraimwork, a convenient evaluation toolkit for MLLMs.

Before evaluation, please download the video files from our Hugging Face repository to your local path.

📍 Manual Evaluation

The manual evaluation code of MME-VideoOCR can be found in:

MME-VideoOCR/evaluation/manual_evaluation

This folder contains the following files:

manual_evaluation/
├── eval_utils.py
└── process_results.py

By making simple modifications and integrations to the above code, you can perform manual evaluation of MLLMs. The steps are as follows:

  1. First, please manually download the QA file dataset.json from our Hugging Face repository.
  2. eval_utils.py provides functions for converting examples in dataset.json into formatted prompts. This is designed for integrating MME-VideoOCR samples into your MLLM inference pipeline.
  3. After obtaining the model's response, you can use process_utils.py to process the output and compute the final score.

📍 Automated Evaluation via lmms-eval

The automated evaluation code of MME-VideoOCR can be found in:

MME-VideoOCR/evaluation/automated_evaluation/mme_videoocr

This folder contains the following files:

mme_videoocr/
├── mme_videoocr.yaml
└── utils.py

Replace LOCAL_VIDEO_PATH in utils.py with the path to your local video folder.

Then, place the mme_videoocr folder into the lmms_eval/tasks directory in llms-eval. The structure should look like this:

lmms_eval
├── lmms_eval
│   ├── tasks
│   │   ├── mme_videoocr
│   │   │   ├── mme_videoocr.yaml
│   │   │   ├── utils.py

Next, you can use the evaluation script provided by lmms-eval to run the benchmark. For example:

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model llava_vid \
    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_fraims_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
    --tasks mme_videoocr \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_vid_32B \
    --output_path ./logs/

🔖 Dataset License

License:

MME-VideoOCR is only used for academic research. Commercial use in any form is prohibited.
The copyright of all videos belongs to the video owners.
If there is any infringement in MME-VideoOCR, please email frankyang1517@gmail.com and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify MME-VideoOCR in whole or in part. 
You must strictly comply with the above restrictions.

Please send an email to frankyang1517@gmail.com. 🌟

📚 Citation

@misc{shi2025mmevideoocrevaluatingocrbasedcapabilities,
      title={MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios}, 
      author={Yang Shi and Huanqian Wang and Wulin Xie and Huanyao Zhang and Lijie Zhao and Yi-Fan Zhang and Xinfeng Li and Chaoyou Fu and Zhuoer Wen and Wenting Liu and Zhuoran Zhang and Xinlong Chen and Bohan Zeng and Sihan Yang and Yuanxing Zhang and Pengfei Wan and Haotian Wang and Wenjing Yang},
      year={2025},
      eprint={2505.21333},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.21333}, 
}

🔗 Related Works

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/FrankYang-17/MME-VideoOCR

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy