My research focuses on developing self-evolving large multimodal models for generalizable multimodal
intelligence, within the broader context of multimodal representation
learning for reasoning. I also work on unified large-scale models for image understanding and
generation, and (on the side) I'm interested in geometry-aware representations and neural
rendering for computer graphics.
Dec 12, 2024–Proud to have been selected as a recipient of the Sir C.
V.
Raman Award by
VIT Chennai for my research!
Jun 23, 2024–
I presented our paper
on attention-fused deep CNNs at ICRAS 2024 in Tokyo, Japan!
Selected Publications
Hover over publications for quick
preview
EvoLMM: Self-Evolving Large Multimodal Models with Continuous
Rewards
CVPR 2026 (Findings)
EvoLMM is a fully unsupervised self-evolving framework for large multimodal models (LMMs) that
improves visual reasoning from raw images only by coupling a Proposer and a Solver trained via
continuous self-consistency rewards.
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception
abilities, yet most existing training pipelines still depend on human-curated data or externally
verified reward models, limiting their autonomy and scalability. In this work, we strive to
improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or
reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that
instantiates two cooperative agents from a single backbone model: a Proposer, which generates
diverse, image-grounded questions, and a Solver, which solves them through internal consistency,
where learning proceeds through a continuous self-rewarding process. This dynamic feedback
encourages both the generation of informative queries and the refinement of structured reasoning
without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base
model, our EvoLMM yields consistent gains up to ∼3% on multimodal math-reasoning benchmarks,
including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple
yet effective approach will serve as a solid baseline easing future research in self-improving
LMMs in a fully unsupervised fashion.
BibTeX:
@misc{thawakar2025evolmmselfevolvinglargemultimodal,
title={EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards},
author={Omkar Thawakar and Shravan Venkatraman and Ritesh Thawkar and Abdelrahman Shaker and Hisham Cholakkal and Rao Muhammad Anwer and Salman Khan and Fahad Khan},
year={2025},
eprint={2511.16672},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.16672},
}
Making NeRF See Structure, Not Just Light: Enforcing
PDE-Based Surface Constraints
Under Review
Enforcing physical surface properties through PDE constraints yields geometrically accurate neural
scene representations from sparse views.