Papers
arxiv:2403.12895

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Published on Mar 19
· Submitted by akhaliq on Mar 20
#1 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.

Community

Great work! Would love to see this evaluated on DUDE, a really challenging document VQA dataset. Happy to help ;)

·
Paper author

Thanks for your approval and suggestion, we will consider multi-page document understanding in our next version and DUDE will be a suitable dataset.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Hi,

Seems to be interesting.

Is it an open source project or proprietary? Do you share the model?

Best,
Yannick

·
Paper author

mPLUG-DocOwl 1.5 is an open-source project. Data, models, and codes are scheduled to be shared at https://github.com/X-PLUG/mPLUG-DocOwl in next week.

Thanks Anweb Hu for your short notice answer.

Amazing, can’t wait to test and will give you feedback for sure.

Have a great day.

Cheers,
Yannick

Mastering Text-Rich Images: Discover mPLUG-DocOwl 1.5's OCR-Free Revolution!

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2403.12895 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.12895 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 15