My research focuses on developing self-evolving large multimodal models for generalizable multimodal
intelligence, within the broader context of multimodal representation
learning for reasoning. I also work on unified large-scale models for image understanding and
generation, and (on the side) I'm interested in geometry-aware representations and neural
rendering for computer graphics.
Dec 12, 2024–Proud to have been selected as a recipient of the Sir C.
V.
Raman Award by
VIT Chennai for my research!
Jun 23, 2024–
I presented our paper
on attention-fused deep CNNs at ICRAS 2024 in Tokyo, Japan!
Selected Publications
Hover over publications for quick
preview
EvoLMM: Self-Evolving Large Multimodal Models with Continuous
Rewards
arXiv
EvoLMM is a fully unsupervised self-evolving framework for
large multimodal models (LMMs) that
improves visual reasoning from raw images only by coupling a Proposer and a Solver trained via
continuous self-consistency rewards.
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception
abilities, yet most existing training pipelines still depend on human-curated data or externally
verified reward models, limiting their autonomy and scalability. In this work, we strive to
improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or
reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that
instantiates two cooperative agents from a single backbone model: a Proposer, which generates
diverse, image-grounded questions, and a Solver, which solves them through internal consistency,
where learning proceeds through a continuous self-rewarding process. This dynamic feedback
encourages both the generation of informative queries and the refinement of structured reasoning
without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base
model, our EvoLMM yields consistent gains up to ∼3% on multimodal math-reasoning benchmarks,
including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple
yet effective approach will serve as a solid baseline easing future research in self-improving
LMMs in a fully unsupervised fashion.
BibTeX:
@misc{thawakar2025evolmmselfevolvinglargemultimodal,
title={EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards},
author={Omkar Thawakar and Shravan Venkatraman and Ritesh Thawkar and Abdelrahman Shaker and Hisham Cholakkal and Rao Muhammad Anwer and Salman Khan and Fahad Khan},
year={2025},
eprint={2511.16672},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.16672},
}
Making NeRF See Structure, Not Just Light: Enforcing
PDE-Based Surface Constraints
Bachelor's Thesis
Enforcing physical surface properties through PDE constraints
yields geometrically accurate neural
scene representations from sparse views.
Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to
capture complex patterns and long-range dependencies between image patches. However, a key
challenge for ViTs is efficiently incorporating multi-scale feature representations, which is
inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph
transformers have made strides in addressing this by leveraging graph-based modeling, but they
often lose or insufficiently represent spatial hierarchies, especially since redundant or less
relevant areas dilute the image's contextual representation. To bridge this gap, we propose
SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of
CNNs, representational power of ViTs, and graph-attended patching to enable richer contextual
representation. Using EfficientNetV2 as a backbone, the model extracts multi-scale feature maps,
dividing them into patches to preserve richer semantic information compared to directly patching
the input images. The patches are structured into a graph using spatial and feature similarities,
where a Graph Attention Network (GAT) refines the node embeddings. This refined graph
representation is then processed by a Transformer encoder, capturing long-range dependencies and
complex interactions. We evaluate SAG-ViT on benchmark datasets across various domains, validating
its effectiveness in advancing image classification tasks. Our code and weights are available at
https://github.com/shravan-18/SAG-ViT.
BibTeX:
@misc{SAGViT,
title={SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers},
author={Shravan Venkatraman and Jaskaran Singh Walia and Joe Dhanith P R},
year={2025},
eprint={2411.09420},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.09420},
}