How Can AI Look at an Image and Answer Questions Like Humans Do?
Artificial intelligence researchers are making new progress in enabling machines to understand images and respond to questions with human-like reasoning through research development in the field of Visual Question Answering (VQA). While humans can naturally identify objects, interpret actions, and recognize relationships within an image, VQA remains highly complex for AI systems, which must jointly process visual and textual information and often struggle with multi-step reasoning and understanding complex relationships.
A recent study titled ‘MRAN-VQA: Multimodal Recursive Attention Network for Visual Question Answering’ proposes a novel framework designed to overcome these challenges. The multimodal model allows AI systems to interpret images more deeply and respond to questions with higher accuracy by improving object recognition, counting ability, and relational reasoning. This is achieved through recursive attention mechanisms and hierarchical feature fusion across both visual and textual modalities, enabling more effective reasoning than traditional approaches.
The research was led by Abu Tareq Rony, a former 13th batch student in the Department of Statistics at Noakhali Science and Technology University (NSTU), who is currently pursuing a PhD in Computer Science at Iowa State University, USA, working with an international team of collaborators.
Speaking about the research significance, Abu Tareq Rony said the work aims to make AI systems more capable of understanding the real world rather than simply detecting objects. ‘Our goal was to design a model that can think step-by-step while looking at an image, much like humans do. We hope this research will contribute to building AI systems that are more reliable, interpretable, and useful across different languages and environments,’ he said.
The paper has been published in Elsevier, recognized as a Q1 open-access journal with a CiteScore of 11.3 and an Impact Factor of 5.4, reflecting the academic quality and global reach of the research. Experimental results reported in the study show that the proposed MRAN-VQA model outperforms existing state-of-the-art methods on widely used VQA benchmarks, particularly excelling in multi-step reasoning and multilingual environments.
According to his ResearchGate profile, Abu Tareq Rony has accumulated over 470 citations, holds an h-index of 12, and maintains a strong research interest score, underscoring his sustained research contribution in artificial intelligence, machine learning, and computer vision.