My research focuses on developing self-evolving large multimodal models for generalizable multimodal
intelligence, within the broader context of multimodal representation
learning for reasoning. I also work on unified large-scale models for image understanding and
generation, with a focus on geometry-aware representations and neural rendering in computer
graphics.
EvoLMM is a fully unsupervised self-evolving framework for large multimodal models (LMMs) that
improves visual reasoning from raw images only by coupling a Proposer and a Solver trained via
continuous self-consistency rewards.
Abstract:
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception
abilities, yet most existing training pipelines still depend on human-curated data or externally
verified reward models, limiting their autonomy and scalability. In this work, we strive to
improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or
reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that
instantiates two cooperative agents from a single backbone model: a Proposer, which generates
diverse, image-grounded questions, and a Solver, which solves them through internal consistency,
where learning proceeds through a continuous self-rewarding process. This dynamic feedback
encourages both the generation of informative queries and the refinement of structured reasoning
without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base
model, our EvoLMM yields consistent gains up to ∼3% on multimodal math-reasoning benchmarks,
including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple
yet effective approach will serve as a solid baseline easing future research in self-improving
LMMs in a fully unsupervised fashion.
BibTeX:
@misc{thawakar2025evolmmselfevolvinglargemultimodal,
title={EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards},
author={Omkar Thawakar and Shravan Venkatraman and Ritesh Thawkar and Abdelrahman Shaker and Hisham Cholakkal and Rao Muhammad Anwer and Salman Khan and Fahad Khan},
year={2025},
eprint={2511.16672},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.16672},
}
Making NeRF See Structure, Not Just Light: Enforcing
PDE-Based Surface Constraints