Shravan Venkatraman

News

Jun 2026 – Excited to start my research internship at Aalto University in the Probabilistic ML and Generative AI Lab, working with Dr. Arno Solin, as part of the AScI Program! 🇫🇮

Apr 28, 2026 – Honored to have been elected President of the MBZUAI Student Council for AY 2026–27!

Apr 07, 2026 – Three papers (PCM-NeRF, TIDE, NTRM [Oral]) are accepted to CVPR'26 Workshops! 🎉

Jan 13, 2026 – I've been selected for the MBZUAI ML Winter School 2026 on Representation Learning & GenAI!

Dec 16, 2025 – Grateful to have received the Sir C. V. Raman Award from VIT Chennai for the third time (post-graduation, too) for my work on SAG-ViT.

Nov 21, 2025 – Our paper on EvoLMM, a purely self-evolving framework for LMMs, is now on arXiv!

Nov 27, 2025 – SPROUT has been accepted to Neurocomputing!

Oct 27, 2025 – RG-ViT has been accepted to Computers in Biology and Medicine!

Sep 18, 2025 – I'm honored to serve as Student Representative for the Computer Vision department at MBZUAI!

Aug 10, 2025 – I'm excited to start my MSc. in Computer Vision at MBZUAI!

Jul 27, 2025 – SAG-ViT has been accepted to Complex and Intelligent Systems!

Jul 14, 2025 – UGPL is accepted to ICCV'25 Workshops: CVAMD! Paper and Code are available!

Apr 17, 2025 – I have successfully defended my bachelor's thesis (titled: Making NeRF See Structure, Not Just Light) at VIT Chennai!

Feb 28, 2025 – FUSION is accepted to CVPR'25 Workshops: NTIRE!

Apr 07, 2025 – Honored to receive the Sir C. V. Raman Award from VIT Chennai for the second time in recognition of my research!

Feb 28, 2025 – We showcased and presented CerviLens at IInvenTiv'25 @IIT Madras, representing MedxAI Innovations!

Feb 06, 2025 – Our paper, Leveraging Bi-Focal Perspectives and Granular Feature Integration for Accurate, Reliable Early Alzheimer's Detection is accepted to IEEE Access!

Jan 25, 2025 – I am honored to have been admitted to the MSc. Computer Vision program at MBZUAI!

Dec 12, 2024 – Proud to have been selected as a recipient of the Sir C. V. Raman Award by VIT Chennai for my research!

Jun 23, 2024 – I presented our paper on attention-fused deep CNNs at ICRAS 2024 in Tokyo, Japan!

Abstract:

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains up to ∼3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully unsupervised fashion.

Abstract:

Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. However, a key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph transformers have made strides in addressing this by leveraging graph-based modeling, but they often lose or insufficiently represent spatial hierarchies, especially since redundant or less relevant areas dilute the image's contextual representation. To bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, and graph-attended patching to enable richer contextual representation. Using EfficientNetV2 as a backbone, the model extracts multi-scale feature maps, dividing them into patches to preserve richer semantic information compared to directly patching the input images. The patches are structured into a graph using spatial and feature similarities, where a Graph Attention Network (GAT) refines the node embeddings. This refined graph representation is then processed by a Transformer encoder, capturing long-range dependencies and complex interactions. We evaluate SAG-ViT on benchmark datasets across various domains, validating its effectiveness in advancing image classification tasks. Our code and weights are available at https://github.com/shravan-18/SAG-ViT.