Home > Research > Publications & Outputs > Hindi Reading Comprehension

Electronic data

  • 2025.indonlp-1.1

    Final published version, 410 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

View graph of relations

Hindi Reading Comprehension: Do Large Language Models Exhibit Semantic Understanding?

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Hindi Reading Comprehension: Do Large Language Models Exhibit Semantic Understanding? / Lal, Daisy Monika; Rayson, Paul; El-Haj, Mo.
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages. ed. / Ruvan Weerasinghe; Isuri Anuradha; Deshan Sumanathilaka. Abu Dhabi: Association for Computational Linguistics, 2025. p. 1-10.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Lal, DM, Rayson, P & El-Haj, M 2025, Hindi Reading Comprehension: Do Large Language Models Exhibit Semantic Understanding? in R Weerasinghe, I Anuradha & D Sumanathilaka (eds), Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages. Association for Computational Linguistics, Abu Dhabi, pp. 1-10. <https://aclanthology.org/2025.indonlp-1.1/>

APA

Lal, D. M., Rayson, P., & El-Haj, M. (2025). Hindi Reading Comprehension: Do Large Language Models Exhibit Semantic Understanding? In R. Weerasinghe, I. Anuradha, & D. Sumanathilaka (Eds.), Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages (pp. 1-10). Association for Computational Linguistics. https://aclanthology.org/2025.indonlp-1.1/

Vancouver

Lal DM, Rayson P, El-Haj M. Hindi Reading Comprehension: Do Large Language Models Exhibit Semantic Understanding? In Weerasinghe R, Anuradha I, Sumanathilaka D, editors, Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages. Abu Dhabi: Association for Computational Linguistics. 2025. p. 1-10

Author

Lal, Daisy Monika ; Rayson, Paul ; El-Haj, Mo. / Hindi Reading Comprehension : Do Large Language Models Exhibit Semantic Understanding?. Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages. editor / Ruvan Weerasinghe ; Isuri Anuradha ; Deshan Sumanathilaka. Abu Dhabi : Association for Computational Linguistics, 2025. pp. 1-10

Bibtex

@inproceedings{710159cdb24642cba04e7f26c5bc2729,
title = "Hindi Reading Comprehension: Do Large Language Models Exhibit Semantic Understanding?",
abstract = "In this study, we explore the performance of four advanced Generative AI models—GPT-3.5, GPT-4, Llama3, and HindiGPT, for the Hindi reading comprehension task. Using a zero-shot, instruction-based prompting strategy, we assess model responses through a comprehensive triple evaluation framework using the HindiRC dataset. Our framework combines (1) automatic evaluation using ROUGE, BLEU, BLEURT, METEOR, and Cosine Similarity; (2) rating-based assessments focussing on correctness, comprehension depth, and informativeness; and (3) preference-based selection to identify the best responses. Human ratings indicate that GPT-4 outperforms the other LLMs on all parameters, followed by HindiGPT, GPT-3.5, and then Llama3. Preference-based evaluation similarly placed GPT-4 (80%) as the best model, followed by HindiGPT(74%). However, automatic evaluation showed GPT-4 to be the lowest performer on n-gram metrics, yet the best performer on semantic metrics, suggesting it captures deeper meaning and semantic alignment over direct lexical overlap, which aligns with its strong human evaluation scores. This study also highlights that even though the models mostly address literal factual recall questions with high precision, they still face the challenge of specificity and interpretive bias at times.",
author = "Lal, {Daisy Monika} and Paul Rayson and Mo El-Haj",
year = "2025",
month = jan,
day = "20",
language = "English",
pages = "1--10",
editor = "Ruvan Weerasinghe and Isuri Anuradha and Deshan Sumanathilaka",
booktitle = "Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages",
publisher = "Association for Computational Linguistics",

}

RIS

TY - GEN

T1 - Hindi Reading Comprehension

T2 - Do Large Language Models Exhibit Semantic Understanding?

AU - Lal, Daisy Monika

AU - Rayson, Paul

AU - El-Haj, Mo

PY - 2025/1/20

Y1 - 2025/1/20

N2 - In this study, we explore the performance of four advanced Generative AI models—GPT-3.5, GPT-4, Llama3, and HindiGPT, for the Hindi reading comprehension task. Using a zero-shot, instruction-based prompting strategy, we assess model responses through a comprehensive triple evaluation framework using the HindiRC dataset. Our framework combines (1) automatic evaluation using ROUGE, BLEU, BLEURT, METEOR, and Cosine Similarity; (2) rating-based assessments focussing on correctness, comprehension depth, and informativeness; and (3) preference-based selection to identify the best responses. Human ratings indicate that GPT-4 outperforms the other LLMs on all parameters, followed by HindiGPT, GPT-3.5, and then Llama3. Preference-based evaluation similarly placed GPT-4 (80%) as the best model, followed by HindiGPT(74%). However, automatic evaluation showed GPT-4 to be the lowest performer on n-gram metrics, yet the best performer on semantic metrics, suggesting it captures deeper meaning and semantic alignment over direct lexical overlap, which aligns with its strong human evaluation scores. This study also highlights that even though the models mostly address literal factual recall questions with high precision, they still face the challenge of specificity and interpretive bias at times.

AB - In this study, we explore the performance of four advanced Generative AI models—GPT-3.5, GPT-4, Llama3, and HindiGPT, for the Hindi reading comprehension task. Using a zero-shot, instruction-based prompting strategy, we assess model responses through a comprehensive triple evaluation framework using the HindiRC dataset. Our framework combines (1) automatic evaluation using ROUGE, BLEU, BLEURT, METEOR, and Cosine Similarity; (2) rating-based assessments focussing on correctness, comprehension depth, and informativeness; and (3) preference-based selection to identify the best responses. Human ratings indicate that GPT-4 outperforms the other LLMs on all parameters, followed by HindiGPT, GPT-3.5, and then Llama3. Preference-based evaluation similarly placed GPT-4 (80%) as the best model, followed by HindiGPT(74%). However, automatic evaluation showed GPT-4 to be the lowest performer on n-gram metrics, yet the best performer on semantic metrics, suggesting it captures deeper meaning and semantic alignment over direct lexical overlap, which aligns with its strong human evaluation scores. This study also highlights that even though the models mostly address literal factual recall questions with high precision, they still face the challenge of specificity and interpretive bias at times.

M3 - Conference contribution/Paper

SP - 1

EP - 10

BT - Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages

A2 - Weerasinghe, Ruvan

A2 - Anuradha, Isuri

A2 - Sumanathilaka, Deshan

PB - Association for Computational Linguistics

CY - Abu Dhabi

ER -