Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
#large language models#Greek QA#DemosQA dataset#monolingual models#multilingual models#prompt engineering#under‑resourced languages#social media data#evaluation framework
📌 Key Takeaways
Introduction of DemosQA, a Greek QA dataset derived from social media questions and community‑reviewed answers
Development of a memory‑efficient evaluation framework adaptable to various QA datasets and languages
Extensive benchmarking of 11 monolingual and multilingual LLMs across 6 human‑curated Greek QA datasets
Assessment of three prompting strategies for evaluating LLM performance
Public release of code and data to facilitate reproducibility and further research
📖 Full Retelling
Who: The study was conducted by researchers Charalampos Mastrokostas, Nikolaos Giarelis, and Nikos Karacapilidis.
What: It evaluates the performance of 11 monolingual and multilingual large language models (LLMs) on Greek question‑answering tasks, introduces the DemosQA benchmark dataset, and presents a memory‑efficient LLM evaluation framework.
Where: The research focuses on Greek, leveraging social‑media user questions and community‑reviewed answers gathered via Greek online platforms.
When: The work was submitted as an arXiv preprint on 18 February 2026.
Why: The authors address the gap in LLM research for under‑resourced languages by comparing monolingual and multilingual models on culturally relevant Greek QA tasks and providing reproducible tools and data for future study.
🏷️ Themes
Natural Language Processing, Large Language Model Evaluation, Under‑resourced Language Research, Greek Question Answering, Dataset Creation
Entity Intersection Graph
No entity connections available yet for this article.
Deep Analysis
Why It Matters
The study introduces a Greek QA benchmark that reflects local culture, addressing the data bias in multilingual LLMs. It enables fair evaluation of monolingual versus multilingual models for an under-resourced language.
Context & Background
Under-resourced languages are often underrepresented in LLM training data
Existing QA benchmarks focus on high-resource languages like English
The DemosQA dataset is built from social media questions and community answers
What Happens Next
The released code and data will allow researchers to benchmark new Greek models and improve performance. Future work may extend the framework to other under-resourced languages and refine prompting strategies.
Frequently Asked Questions
What is DemosQA?
A Greek question answering dataset derived from social media user questions and community-reviewed answers.
How many models were evaluated?
Eleven monolingual and multilingual LLMs were tested across six Greek QA datasets.
Original Source
--> Computer Science > Computation and Language arXiv:2602.16811 [Submitted on 18 Feb 2026] Title: Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark Authors: Charalampos Mastrokostas , Nikolaos Giarelis , Nikos Karacapilidis View a PDF of the paper titled Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark, by Charalampos Mastrokostas and 2 other authors View PDF Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models , which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering . Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility. Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (c...