A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Feng, Yukang; Sun, Jianwen; Li, Chuanhao; Li, Zizhen; Ai, Jiaxin; Zhang, Fanrui; Chang, Yifan; Zhou, Sizhuo; Zhang, Shenglin; Dai, Yu; Zhang, Kaipeng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2506.09427 (cs)

[Submitted on 11 Jun 2025 (v1), last revised 2 Mar 2026 (this version, v2)]

Title:A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Authors:Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang

View PDF HTML (experimental)

Abstract:Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality, and instructional richness of current training datasets. To address this, we introduce InterSyn, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed Self-Evaluation with Iterative Refinement (SEIR) method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image-text generation capabilities. To evaluate the capabilities, we propose SynJudge, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image-Text Synergy (ITS). These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework. Experimental results on InterSyn subsets of up to 200K samples show that 25K-50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn's: (1) scalability, as performance consistently improves with more data; (2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.

Comments:	Accepted in ICLR2026
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2506.09427 [cs.CV]
	(or arXiv:2506.09427v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2506.09427

Submission history

From: Yukang Feng [view email]
[v1] Wed, 11 Jun 2025 06:21:20 UTC (3,285 KB)
[v2] Mon, 2 Mar 2026 08:02:37 UTC (6,457 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Submission history

Access Paper:

Current browse context:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Submission history

Access Paper:

Current browse context:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators