Knowledge-Based Visual Question Answering System Using Multimodal Deep Learning

Noorbhasha Junnubabu; K. Geethanjali; B. Bhuvaneswari; J. Gnaneswari; D. Bhanuprakash

doi:10.1051/matecconf/202641901019

Open Access

Issue		MATEC Web Conf. Volume 419, 2026 International Conference on Mechanical and Materials Engineering (ICMME 2025)


Article Number		01019
Number of page(s)		10
DOI		https://doi.org/10.1051/matecconf/202641901019
Published online		18 March 2026

MATEC Web of Conferences 419, 01019 (2026)

Knowledge-Based Visual Question Answering System Using Multimodal Deep Learning

Noorbhasha Junnubabu, K. Geethanjali, B. Bhuvaneswari, J. Gnaneswari and D. Bhanuprakash^*

Department of CST, Madanapalle Institute of Technology & Science, Andhra Pradesh, India

^* Corresponding author: This email address is being protected from spambots. You need JavaScript enabled to view it.

Abstract

Knowledge-driven Visual Question Answering (VQA) necessitates combining external information apart from an image’s visual elements to produce accurate and contextually appropriate answers. Although Large Language Models (LLMs) show considerable promise in this area, their deficiency in structured reasoning and restricted access to specialized information limits their effectiveness, especially in specific domains such as medical diagnostics and patient care. In this study, we introduce a versatile, resilient, and domain-independent framework that improves LLM-powered Visual Question Answering (VQA) systems by incorporating structured reasoning and external knowledge. Our system utilizes ResNet50 for effective image feature extraction and FLAN-T5 for language-driven question answering, integrating them with a reasoning module to enhance accuracy. ResNet50 was chosen for its dependable efficiency and minimal computational demands, while FLAN-T5 offers robust reasoning skills with less complexity than larger models. In contrast to conventional end-to-end fine-tuning methods, our framework facilitates smooth incorporation with both open-source and commercial LLMs, lowering computational expenses while preserving high accuracy in zero-shot and few-shot learning contexts. ResNet50 and FLAN-T5 were chosen for their effective balance of performance and computational efficiency in comparison to more intricate models such as ViT or GPT-4. Utilizing multi-query ensemble techniques, context-sensitive feature selection, and the retrieval of external domain knowledge, our system greatly enhances explainability and reliability, making it especially appropriate for medical VQA applications. The integration of ResNet50 for advanced image comprehension, FLAN-T5 for intricate reasoning, and prompts guided by direction to integrate structured knowledge more efficiently guarantees a scalable and effective solution for real-time, knowledge-driven VQA systems. The suggested approach results in a 7% boost in accuracy and decreases response time by 30% in comparison to baseline techniques on the OKVQA dataset.

Key words: Visual Question Answering / Large Language Models (LLMs) / Convolutional Neural Network / Image Classification / Feature Extraction

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.