{
    "status": "completed",
    "meta": {
        "engine": "Axus Search API v16.0",
        "query": "Retrieval-Augmented Generation in clinical trials",
        "total_sources": 18,
        "sources_debug": {
            "wikipedia": "OK (1)",
            "crossref": "OK (19)"
        },
        "total_tokens_yield": 10743,
        "data_quality": "enterprise_clean_verified",
        "format_compatibility": [
            "RAG_Ready",
            "LLM_FineTuning_JSONL"
        ],
        "completed_at": "2026-05-23T07:10:14.771Z",
        "data_quality_report": {
            "relevance_score": 0.9,
            "average_trust": 0.98
        }
    },
    "data": [
        {
            "id": "doi_10.59350%2Fr9dj1-zkx52",
            "title": "Efficient Information Retrieval and Response Generation with Retrieval-Augmented Generation (RAG)",
            "url": "https://doi.org/10.59350/r9dj1-zkx52",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by/4.0/legalcode",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 113,
                "chunks_count": 1,
                "language": "en",
                "char_length": 449,
                "is_jsonl_ready": true,
                "relevance_match_score": 1,
                "doi": "10.59350/r9dj1-zkx52",
                "publisher": "Front Matter",
                "publication_year": 2024,
                "citation_count": 1,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "lt;strong gt; How to efficiently retrieve information for different applications lt; strong gt; Author Wenyi Pi (ORCID: 0009-0002-2884-2771) This article aims to explore various ways in which Retrieval-Augmented Generation (RAG) can be utilised to retrieve information and generate responses effectively within the dialogue system. The rationale behind utilising RAG as well as potential ways in which it can be employed effectively will be covered.",
            "chunks": [
                "lt;strong gt; How to efficiently retrieve information for different applications lt; strong gt; Author Wenyi Pi (ORCID: 0009-0002-2884-2771) This article aims to explore various ways in which Retrieval-Augmented Generation (RAG) can be utilised to retrieve information and generate responses effectively within the dialogue system. The rationale behind utilising RAG as well as potential ways in which it can be employed effectively will be covered."
            ]
        },
        {
            "id": "doi_10.2139%2Fssrn.6608518",
            "title": "MedRAG: Retrieval-Augmented Generation for Medical QA-Comparing Base and RAG-Augmented LLM Performance on Evidence from Peer-Reviewed Clinical Research",
            "url": "https://doi.org/10.2139/ssrn.6608518",
            "source": "Crossref Academic Index",
            "license": "Publisher Proprietary / OpenAccess Check Required",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 391,
                "chunks_count": 2,
                "language": "en",
                "char_length": 1561,
                "is_jsonl_ready": true,
                "relevance_match_score": 2,
                "doi": "10.2139/ssrn.6608518",
                "publisher": "Elsevier BV",
                "publication_year": 2026,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "Large language models (LLMs) deployed for medical question answering must provide precise, evidencegrounded responses-yet their parametric knowledge frequently fails on specific quantitative findings from recent clinical trials and systematic reviews. We build a medical retrieval-augmented generation (MedRAG) system by indexing 5025 peer-reviewed papers across 104 clinical domains-retrieved via PubMed and the Consensus academic search engine, embedded using sentence-transformers, and stored in a ChromaDB vector database. We evaluate a local LLaMA 3. 2 base model against its RAG-augmented counterpart on 25 medical QA questions stratified by domain and difficulty. The base model achieves 0. 760 accuracy while the RAG-augmented system achieves 1. 000, a gain of 24 percentage points. by domain and difficulty. The base model achieves 0. 760 accuracy while the RAG-augmented system achieves 1. 000, a gain of 24 percentage points. Qualitative analysis of the 6 base-model failures reveals two distinct hallucination patterns: (i) fabrication, where the model invents specific trial names, authors, and effect sizes with confident language; and (ii) confident substitution, where the model returns plausible but numerically wrong statistics. RAG eliminates both patterns by grounding responses in retrieved evidence. Mean gold-keyword hit rate improves from 2. 36 to 4. 20 keywords per question. These results demonstrate that lightweight RAG over curated clinical literature substantially improves LLM factual accuracy for medical QA on consumer hardware.",
            "chunks": [
                "Large language models (LLMs) deployed for medical question answering must provide precise, evidencegrounded responses-yet their parametric knowledge frequently fails on specific quantitative findings from recent clinical trials and systematic reviews. We build a medical retrieval-augmented generation (MedRAG) system by indexing 5025 peer-reviewed papers across 104 clinical domains-retrieved via PubMed and the Consensus academic search engine, embedded using sentence-transformers, and stored in a ChromaDB vector database. We evaluate a local LLaMA 3. 2 base model against its RAG-augmented counterpart on 25 medical QA questions stratified by domain and difficulty. The base model achieves 0. 760 accuracy while the RAG-augmented system achieves 1. 000, a gain of 24 percentage points.",
                "by domain and difficulty. The base model achieves 0. 760 accuracy while the RAG-augmented system achieves 1. 000, a gain of 24 percentage points. Qualitative analysis of the 6 base-model failures reveals two distinct hallucination patterns: (i) fabrication, where the model invents specific trial names, authors, and effect sizes with confident language; and (ii) confident substitution, where the model returns plausible but numerically wrong statistics. RAG eliminates both patterns by grounding responses in retrieved evidence. Mean gold-keyword hit rate improves from 2. 36 to 4. 20 keywords per question. These results demonstrate that lightweight RAG over curated clinical literature substantially improves LLM factual accuracy for medical QA on consumer hardware."
            ]
        },
        {
            "id": "doi_10.59350%2Fxcq3s-jvk04",
            "title": "Understanding Retrieval Pitfalls: Challenges Faced by Retrieval Augmented Generation (RAG) models",
            "url": "https://doi.org/10.59350/xcq3s-jvk04",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by/4.0/legalcode",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 84,
                "chunks_count": 1,
                "language": "en",
                "char_length": 333,
                "is_jsonl_ready": true,
                "relevance_match_score": 1,
                "doi": "10.59350/xcq3s-jvk04",
                "publisher": "Front Matter",
                "publication_year": 2024,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "Introduction Large language models (LLMs) like GPT-4, the engine of products like ChatGPT, have taken centre stage in recent years due to their astonishing capabilities. Yet, they are far from perfect. Many of us have since learnt perhaps when asking ChatGPT a question or employing it to write our reports that LLMs can hallucinate.",
            "chunks": [
                "Introduction Large language models (LLMs) like GPT-4, the engine of products like ChatGPT, have taken centre stage in recent years due to their astonishing capabilities. Yet, they are far from perfect. Many of us have since learnt perhaps when asking ChatGPT a question or employing it to write our reports that LLMs can hallucinate."
            ]
        },
        {
            "id": "doi_10.36085%2Fjsai.v9i1.9897",
            "title": "Evaluation of Retrieval Methods in Domain-Specific Chatbots Based on Retrieval-Augmented Generation",
            "url": "https://doi.org/10.36085/jsai.v9i1.9897",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by-nc-nd/4.0",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 442,
                "chunks_count": 3,
                "language": "en",
                "char_length": 1767,
                "is_jsonl_ready": true,
                "relevance_match_score": 3,
                "doi": "10.36085/jsai.v9i1.9897",
                "publisher": "Universitas Muhammadiyah Bengkulu",
                "publication_year": 2026,
                "citation_count": 0,
                "journal_title": "JSAI (Journal Scientific and Applied Informatics)",
                "type": "journal-article",
                "issn": "2614-3054"
            },
            "text": "This study evaluated retrieval methods in the implementation of a domain-specific chatbot based on Retrieval-Augmented Generation to improve information accuracy and relevance while reducing hallucination risks. The primary problem addressed was the incorrect selection and prioritization of contextual documents in chatbot systems built on large language models, particularly in technical domains. An experimental approach was applied by comparing three retrieval strategies: lexical retrieval based on term frequency inverse document frequency, semantic retrieval using vector representations, and a hybrid retrieval method combining lexical and semantic signals. inverse document frequency, semantic retrieval using vector representations, and a hybrid retrieval method combining lexical and semantic signals. System performance was measured using Recall at different ranking thresholds and Mean Reciprocal Rank to assess both document discovery and ranking quality. The results demonstrated that lexical retrieval achieved the highest precision at the top-ranked position, while semantic retrieval showed reduced effectiveness due to semantic drift in technical documents. The hybrid approach improved mid-range recall performance but still exhibited ranking ambiguity for top-ranked results. drift in technical documents. The hybrid approach improved mid-range recall performance but still exhibited ranking ambiguity for top-ranked results. These findings indicated that retrieval quality in Retrieval-Augmented Generation systems depended more on effective ranking and context prioritization than on document availability alone. The study concluded that systematic evaluation of retrieval methods was essential for developing reliable domain-specific chatbots.",
            "chunks": [
                "This study evaluated retrieval methods in the implementation of a domain-specific chatbot based on Retrieval-Augmented Generation to improve information accuracy and relevance while reducing hallucination risks. The primary problem addressed was the incorrect selection and prioritization of contextual documents in chatbot systems built on large language models, particularly in technical domains. An experimental approach was applied by comparing three retrieval strategies: lexical retrieval based on term frequency inverse document frequency, semantic retrieval using vector representations, and a hybrid retrieval method combining lexical and semantic signals.",
                "inverse document frequency, semantic retrieval using vector representations, and a hybrid retrieval method combining lexical and semantic signals. System performance was measured using Recall at different ranking thresholds and Mean Reciprocal Rank to assess both document discovery and ranking quality. The results demonstrated that lexical retrieval achieved the highest precision at the top-ranked position, while semantic retrieval showed reduced effectiveness due to semantic drift in technical documents. The hybrid approach improved mid-range recall performance but still exhibited ranking ambiguity for top-ranked results.",
                "drift in technical documents. The hybrid approach improved mid-range recall performance but still exhibited ranking ambiguity for top-ranked results. These findings indicated that retrieval quality in Retrieval-Augmented Generation systems depended more on effective ranking and context prioritization than on document availability alone. The study concluded that systematic evaluation of retrieval methods was essential for developing reliable domain-specific chatbots."
            ]
        },
        {
            "id": "doi_10.20944%2Fpreprints202602.0996.v1",
            "title": "Clinical Large Language Models with Multi-Stage Instruction Tuning and Advanced Retrieval-Augmented Generation",
            "url": "https://doi.org/10.20944/preprints202602.0996.v1",
            "source": "Crossref Academic Index",
            "license": "http://creativecommons.org/licenses/by/4.0",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 502,
                "chunks_count": 3,
                "language": "en",
                "char_length": 2008,
                "is_jsonl_ready": true,
                "relevance_match_score": 3,
                "doi": "10.20944/preprints202602.0996.v1",
                "publisher": "MDPI AG",
                "publication_year": 2026,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "The demand for efficient and accurate Clinical Decision Support Systems (CDSS) is growing rapidly, driven by the escalating volume of medical data. While Large Language Models (LLMs) offer significant potential, their direct application in healthcare is limited by issues like hallucinations and lack of domain-specific knowledge. Retrieval-Augmented Generation (RAG) addresses these challenges by grounding LLMs with external knowledge, and recent lightweight RAG-based CDSS have shown promise. Building on this, we propose Enhanced Clinical RAG-LLM (ECRAG-LLM), a novel system designed to elevate performance in complex clinical scenarios. Building on this, we propose Enhanced Clinical RAG-LLM (ECRAG-LLM), a novel system designed to elevate performance in complex clinical scenarios. ECRAG-LLM utilizes a robust yet lightweight Mistral-based LLM, integrated with a multi-stage instruction tuning strategy that first adapts to general medical knowledge and then reinforces context-aware and causal reasoning using a custom dataset of structured clinical cases. We employ BioSimCSE for domain-specific embeddings and introduce an enhanced RAG architecture featuring hybrid retrieval, cross-encoder-based contextual re-ranking, and context summarization to optimize retrieved information. RAG architecture featuring hybrid retrieval, cross-encoder-based contextual re-ranking, and context summarization to optimize retrieved information. Extensive experiments on medical benchmarks demonstrate that ECRAG-LLM consistently outperforms baseline lightweight fine-tuned LLMs, achieving significant improvements in diagnostic accuracy, treatment appropriateness, and explanatory quality, particularly in tasks requiring deep clinical reasoning. An ablation study confirms the synergistic contributions of our innovations, and an error analysis highlights a substantial reduction in critical errors, positioning ECRAG-LLM as a more reliable and intelligent solution for resource-constrained clinical environments.",
            "chunks": [
                "The demand for efficient and accurate Clinical Decision Support Systems (CDSS) is growing rapidly, driven by the escalating volume of medical data. While Large Language Models (LLMs) offer significant potential, their direct application in healthcare is limited by issues like hallucinations and lack of domain-specific knowledge. Retrieval-Augmented Generation (RAG) addresses these challenges by grounding LLMs with external knowledge, and recent lightweight RAG-based CDSS have shown promise. Building on this, we propose Enhanced Clinical RAG-LLM (ECRAG-LLM), a novel system designed to elevate performance in complex clinical scenarios.",
                "Building on this, we propose Enhanced Clinical RAG-LLM (ECRAG-LLM), a novel system designed to elevate performance in complex clinical scenarios. ECRAG-LLM utilizes a robust yet lightweight Mistral-based LLM, integrated with a multi-stage instruction tuning strategy that first adapts to general medical knowledge and then reinforces context-aware and causal reasoning using a custom dataset of structured clinical cases. We employ BioSimCSE for domain-specific embeddings and introduce an enhanced RAG architecture featuring hybrid retrieval, cross-encoder-based contextual re-ranking, and context summarization to optimize retrieved information.",
                "RAG architecture featuring hybrid retrieval, cross-encoder-based contextual re-ranking, and context summarization to optimize retrieved information. Extensive experiments on medical benchmarks demonstrate that ECRAG-LLM consistently outperforms baseline lightweight fine-tuned LLMs, achieving significant improvements in diagnostic accuracy, treatment appropriateness, and explanatory quality, particularly in tasks requiring deep clinical reasoning. An ablation study confirms the synergistic contributions of our innovations, and an error analysis highlights a substantial reduction in critical errors, positioning ECRAG-LLM as a more reliable and intelligent solution for resource-constrained clinical environments."
            ]
        },
        {
            "id": "doi_10.51473%2Frcmos.v1i2.2024.736",
            "title": "Protótipo de Pesquisa Documental para a Polícia Militar do Paraná com Retrieval Augmented Generation e Gemini AI em Ambiente Dockerizado",
            "url": "https://doi.org/10.51473/rcmos.v1i2.2024.736",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by/4.0",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 275,
                "chunks_count": 2,
                "language": "en",
                "char_length": 1097,
                "is_jsonl_ready": true,
                "relevance_match_score": 2,
                "doi": "10.51473/rcmos.v1i2.2024.736",
                "publisher": "Editora Aluz",
                "publication_year": 2024,
                "citation_count": 0,
                "journal_title": "RCMOS - Revista Científica Multidisciplinar O Saber",
                "type": "journal-article",
                "issn": "2675-9128"
            },
            "text": "Este trabalho apresenta o desenvolvimento de um protótipo de pesquisa documental para as diretrizes e doutrinas da Polícia Militar do Paraná (PMPR) utilizando a estratégia Retrieval Augmented Generation (RAG) e o modelo de linguagem Gemini AI Flash 1. 5. O protótipo foi implementado em um ambiente containerizado com Docker, visando garantir portabilidade e reprodutibilidade. A estratégia RAG combina a busca tradicional com modelos de linguagem avançados para gerar respostas mais precisas e completas às consultas dos usuários. O protótipo foi testado com perguntas reais e os resultados preliminares demonstram a capacidade do sistema em compreender as perguntas e fornecer respostas relevantes, com base nas informações contidas nos documentos da PMPR. a capacidade do sistema em compreender as perguntas e fornecer respostas relevantes, com base nas informações contidas nos documentos da PMPR. O trabalho discute o potencial do protótipo para auxiliar os policiais militares no acesso às informações relevantes, superando as limitações da atual forma de pesquisa documental na instituição.",
            "chunks": [
                "Este trabalho apresenta o desenvolvimento de um protótipo de pesquisa documental para as diretrizes e doutrinas da Polícia Militar do Paraná (PMPR) utilizando a estratégia Retrieval Augmented Generation (RAG) e o modelo de linguagem Gemini AI Flash 1. 5. O protótipo foi implementado em um ambiente containerizado com Docker, visando garantir portabilidade e reprodutibilidade. A estratégia RAG combina a busca tradicional com modelos de linguagem avançados para gerar respostas mais precisas e completas às consultas dos usuários. O protótipo foi testado com perguntas reais e os resultados preliminares demonstram a capacidade do sistema em compreender as perguntas e fornecer respostas relevantes, com base nas informações contidas nos documentos da PMPR.",
                "a capacidade do sistema em compreender as perguntas e fornecer respostas relevantes, com base nas informações contidas nos documentos da PMPR. O trabalho discute o potencial do protótipo para auxiliar os policiais militares no acesso às informações relevantes, superando as limitações da atual forma de pesquisa documental na instituição."
            ]
        },
        {
            "id": "doi_10.5121%2Fcsit.2025.150904",
            "title": "Large Language Models in Clinical Advice: Direct Generation and Retrieval Augmented Generation vs Expert Advice",
            "url": "https://doi.org/10.5121/csit.2025.150904",
            "source": "Crossref Academic Index",
            "license": "Publisher Proprietary / OpenAccess Check Required",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 465,
                "chunks_count": 3,
                "language": "en",
                "char_length": 1858,
                "is_jsonl_ready": true,
                "relevance_match_score": 3,
                "doi": "10.5121/csit.2025.150904",
                "publisher": "Academy & Industry Research Collaboration Center",
                "publication_year": 2025,
                "citation_count": 2,
                "journal_title": "Advanced Natural Language Processing 2025",
                "type": "proceedings-article",
                "issn": null
            },
            "text": "The NHS faces mounting pressures, resulting in workforce attrition and growing care backlogs. Pharmacy services, critical for ensuring medication safety and effectiveness, are often overlooked in digital innovation efforts. This pilot study investigates the potential of Large Language Models (LLMs) to alleviate pharmacy pressures by answering clinical pharmaceutical queries. Two retrieval techniques were evaluated: Vanilla Retrieval Augmented Generation (RAG) and Graph RAG, supported by an external knowledge source designed specifically for this study. ChatGPT 4o without retrieval served as a control. Quantitative and qualitative evaluations were conducted, including expert human assessments for response accuracy, relevance, and safety. a control. Quantitative and qualitative evaluations were conducted, including expert human assessments for response accuracy, relevance, and safety. Results demonstrated that LLMs can generate high-quality responses. In expert evaluations, Vanilla RAG outperformed other models and even human reference answers for accuracy and risk. Graph RAG revealed challenges related to retrieval accuracy. Despite the promise of LLMs, hallucinations and the ambiguity around LLM evaluations in healthcare remain key barriers to clinical deployment. This pilot study underscores the importance of robust evaluation frameworks to ensure the safe integration of LLMs into clinical workflows. However, regulatory bodies have yet to catch up with the rapid pace of LLM development. ensure the safe integration of LLMs into clinical workflows. However, regulatory bodies have yet to catch up with the rapid pace of LLM development. Guidelines are urgently needed to address the issues of transparency, explainability, data protection, and validation, to facilitate the safe and effective deployment of LLMs in clinical practice.",
            "chunks": [
                "The NHS faces mounting pressures, resulting in workforce attrition and growing care backlogs. Pharmacy services, critical for ensuring medication safety and effectiveness, are often overlooked in digital innovation efforts. This pilot study investigates the potential of Large Language Models (LLMs) to alleviate pharmacy pressures by answering clinical pharmaceutical queries. Two retrieval techniques were evaluated: Vanilla Retrieval Augmented Generation (RAG) and Graph RAG, supported by an external knowledge source designed specifically for this study. ChatGPT 4o without retrieval served as a control. Quantitative and qualitative evaluations were conducted, including expert human assessments for response accuracy, relevance, and safety.",
                "a control. Quantitative and qualitative evaluations were conducted, including expert human assessments for response accuracy, relevance, and safety. Results demonstrated that LLMs can generate high-quality responses. In expert evaluations, Vanilla RAG outperformed other models and even human reference answers for accuracy and risk. Graph RAG revealed challenges related to retrieval accuracy. Despite the promise of LLMs, hallucinations and the ambiguity around LLM evaluations in healthcare remain key barriers to clinical deployment. This pilot study underscores the importance of robust evaluation frameworks to ensure the safe integration of LLMs into clinical workflows. However, regulatory bodies have yet to catch up with the rapid pace of LLM development.",
                "ensure the safe integration of LLMs into clinical workflows. However, regulatory bodies have yet to catch up with the rapid pace of LLM development. Guidelines are urgently needed to address the issues of transparency, explainability, data protection, and validation, to facilitate the safe and effective deployment of LLMs in clinical practice."
            ]
        },
        {
            "id": "doi_10.21070%2Fijins.v27i1.1836",
            "title": "Comparative Performance of Retrieval Augmented Generation Tourism Chatbots",
            "url": "https://doi.org/10.21070/ijins.v27i1.1836",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by/4.0",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 585,
                "chunks_count": 4,
                "language": "en",
                "char_length": 2340,
                "is_jsonl_ready": true,
                "relevance_match_score": 4,
                "doi": "10.21070/ijins.v27i1.1836",
                "publisher": "Universitas Muhammadiyah Sidoarjo",
                "publication_year": 2026,
                "citation_count": 0,
                "journal_title": "Indonesian Journal of Innovation Studies",
                "type": "journal-article",
                "issn": "2598-9936"
            },
            "text": "General Background: The rapid adoption of artificial intelligence in smart tourism has increased the use of contextual chatbots to deliver destination information efficiently. Specific Background: However, tourism chatbots based on Large Language Models frequently encounter information hallucination, reducing reliability when handling dynamic and local tourism data. Knowledge Gap: Existing studies mainly focus on rule-based or single-model chatbot implementations and provide limited comparative evaluation of Retrieval Augmented Generation configurations combining embedding models and Large Language Models. and provide limited comparative evaluation of Retrieval Augmented Generation configurations combining embedding models and Large Language Models. Aims: This study aims to comparatively evaluate multiple Retrieval Augmented Generation configurations to identify the most suitable combination for contextual tourism chatbots and to analyze differences between large multilingual and small monolingual embedding models using a local tourism dataset. Results: Experimental evaluation using data from 49 tourist destinations in Banyumas Regency shows that the Multilingual-E5-Large embedding model consistently achieves perfect Precision, Recall, and F1-Score across all tested Large Language Models. The combination of Multilingual-E5-Large and GPT-4. achieves perfect Precision, Recall, and F1-Score across all tested Large Language Models. The combination of Multilingual-E5-Large and GPT-4. 1-Mini demonstrates the most balanced performance, achieving a BERTScore F1 of 0. 7515 with an average response time of 1. 555 seconds. Novelty: This research provides a systematic comparative assessment of embedding capacity and Large Language Model selection within a unified Retrieval Augmented Generation framework for tourism chatbots. Implications: The findings offer practical guidance for selecting model configurations that ensure accurate retrieval, high-quality responses, and efficient system performance in contextual tourism information services. configurations that ensure accurate retrieval, high-quality responses, and efficient system performance in contextual tourism information services. Highlights Multilingual embedding models deliver consistently higher retrieval accuracy across all tested configurations GPT-4.",
            "chunks": [
                "General Background: The rapid adoption of artificial intelligence in smart tourism has increased the use of contextual chatbots to deliver destination information efficiently. Specific Background: However, tourism chatbots based on Large Language Models frequently encounter information hallucination, reducing reliability when handling dynamic and local tourism data. Knowledge Gap: Existing studies mainly focus on rule-based or single-model chatbot implementations and provide limited comparative evaluation of Retrieval Augmented Generation configurations combining embedding models and Large Language Models.",
                "and provide limited comparative evaluation of Retrieval Augmented Generation configurations combining embedding models and Large Language Models. Aims: This study aims to comparatively evaluate multiple Retrieval Augmented Generation configurations to identify the most suitable combination for contextual tourism chatbots and to analyze differences between large multilingual and small monolingual embedding models using a local tourism dataset. Results: Experimental evaluation using data from 49 tourist destinations in Banyumas Regency shows that the Multilingual-E5-Large embedding model consistently achieves perfect Precision, Recall, and F1-Score across all tested Large Language Models. The combination of Multilingual-E5-Large and GPT-4.",
                "achieves perfect Precision, Recall, and F1-Score across all tested Large Language Models. The combination of Multilingual-E5-Large and GPT-4. 1-Mini demonstrates the most balanced performance, achieving a BERTScore F1 of 0. 7515 with an average response time of 1. 555 seconds. Novelty: This research provides a systematic comparative assessment of embedding capacity and Large Language Model selection within a unified Retrieval Augmented Generation framework for tourism chatbots. Implications: The findings offer practical guidance for selecting model configurations that ensure accurate retrieval, high-quality responses, and efficient system performance in contextual tourism information services.",
                "configurations that ensure accurate retrieval, high-quality responses, and efficient system performance in contextual tourism information services. Highlights Multilingual embedding models deliver consistently higher retrieval accuracy across all tested configurations GPT-4."
            ]
        },
        {
            "id": "doi_10.36227%2Ftechrxiv.171837853.31531482%2Fv1",
            "title": "Large Language Model with Federated Retrieval-Augmented Generation for Improved Knowledge Retrieval",
            "url": "https://doi.org/10.36227/techrxiv.171837853.31531482/v1",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by-nc-sa/4.0/",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 359,
                "chunks_count": 2,
                "language": "en",
                "char_length": 1434,
                "is_jsonl_ready": true,
                "relevance_match_score": 2,
                "doi": "10.36227/techrxiv.171837853.31531482/v1",
                "publisher": "Institute of Electrical and Electronics Engineers (IEEE)",
                "publication_year": 2024,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "Natural language processing models have shown lots of advancements in generating coherent and contextually relevant responses, yet they often struggle with retrieving precise and upto-date information due to the static nature of their training data. Introducing Federated Retrieval-Augmented Generation (RAG) represents a novel and significant approach by integrating federated learning with dynamic retrieval mechanisms to enhance information retrieval and response generation. This article presents the implementation of Federated RAG on Mistral 8x7b, an open-source large language model, demonstrating substantial improvements in retrieval quality and response accuracy. RAG on Mistral 8x7b, an open-source large language model, demonstrating substantial improvements in retrieval quality and response accuracy. The federated learning framework facilitated distributed training across multiple nodes, ensuring data privacy while enabling the model to leverage diverse information sources. Comprehensive evaluation on the MMLU benchmark revealed that the Federated RAG model consistently outperformed the baseline RAG model, achieving higher accuracy and relevance in the generated responses. Detailed analysis and optimization of the retrieval mechanisms and training processes contributed to the model s enhanced performance, highlighting the potential of Federated RAG as a scalable solution for knowledge-intensive applications.",
            "chunks": [
                "Natural language processing models have shown lots of advancements in generating coherent and contextually relevant responses, yet they often struggle with retrieving precise and upto-date information due to the static nature of their training data. Introducing Federated Retrieval-Augmented Generation (RAG) represents a novel and significant approach by integrating federated learning with dynamic retrieval mechanisms to enhance information retrieval and response generation. This article presents the implementation of Federated RAG on Mistral 8x7b, an open-source large language model, demonstrating substantial improvements in retrieval quality and response accuracy.",
                "RAG on Mistral 8x7b, an open-source large language model, demonstrating substantial improvements in retrieval quality and response accuracy. The federated learning framework facilitated distributed training across multiple nodes, ensuring data privacy while enabling the model to leverage diverse information sources. Comprehensive evaluation on the MMLU benchmark revealed that the Federated RAG model consistently outperformed the baseline RAG model, achieving higher accuracy and relevance in the generated responses. Detailed analysis and optimization of the retrieval mechanisms and training processes contributed to the model s enhanced performance, highlighting the potential of Federated RAG as a scalable solution for knowledge-intensive applications."
            ]
        },
        {
            "id": "doi_10.36227%2Ftechrxiv.174060240.09460752%2Fv1",
            "title": "Enhancing Context-Aware Search with Retrieval-Augmented Generation",
            "url": "https://doi.org/10.36227/techrxiv.174060240.09460752/v1",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by/4.0/",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 527,
                "chunks_count": 3,
                "language": "en",
                "char_length": 2108,
                "is_jsonl_ready": true,
                "relevance_match_score": 3,
                "doi": "10.36227/techrxiv.174060240.09460752/v1",
                "publisher": "Institute of Electrical and Electronics Engineers (IEEE)",
                "publication_year": 2025,
                "citation_count": 1,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "Digital content has grown exponentially, requiring efficient and context-aware information retrieval systems. Search methods which utilize key word based functions, such as BM25, rely on lexical matching but often fail to capture semantic relationships. Using dense vector embeddings models, Small Language Models (SLMs) can improve retrieval accuracy, however Retrieval-Augmented Generation (RAG) incorporates the concept of contextual ranking through generative artificial intelligence to further enhance search relevance. An experiment on a single document corpus and a cryptography-related query evaluates BM25, SLM embeddings, and SLM RAG for document retrieval. Based on experimental results, BM25 achieves a moderate relevance score of 0. evaluates BM25, SLM embeddings, and SLM RAG for document retrieval. Based on experimental results, BM25 achieves a moderate relevance score of 0. 500, retrieving documents based on exact matches but lacking contextual understanding. As SLM embeddings identify semantically similar concepts, they increase recall for queries with conceptual variations and a relevance score 0. 7668. SLM RAG outperforms both approaches' relevance score 0. 9202, retrieving the most relevant document with accuracy and contextually enriched responses. A hybrid retrieval model that combines dense embeddings and generative ranking improves search quality substantially. enriched responses. A hybrid retrieval model that combines dense embeddings and generative ranking improves search quality substantially. This exploratory study has multiple implications for information retrieval and including the need for scalable multi-document retrieval, FAISS DPR integration for efficient vector searches, and domain-specific fine-tuning of SLMs. Using this approach, enterprises can achieve faster, more accurate, and contextually aware document retrieval by combining SLMs and RAG methods. Hybrid approaches could be explored in large-scale retrieval settings, with stable workflow integration processes across diverse industries like legal research, healthcare, government, and finance.",
            "chunks": [
                "Digital content has grown exponentially, requiring efficient and context-aware information retrieval systems. Search methods which utilize key word based functions, such as BM25, rely on lexical matching but often fail to capture semantic relationships. Using dense vector embeddings models, Small Language Models (SLMs) can improve retrieval accuracy, however Retrieval-Augmented Generation (RAG) incorporates the concept of contextual ranking through generative artificial intelligence to further enhance search relevance. An experiment on a single document corpus and a cryptography-related query evaluates BM25, SLM embeddings, and SLM RAG for document retrieval. Based on experimental results, BM25 achieves a moderate relevance score of 0.",
                "evaluates BM25, SLM embeddings, and SLM RAG for document retrieval. Based on experimental results, BM25 achieves a moderate relevance score of 0. 500, retrieving documents based on exact matches but lacking contextual understanding. As SLM embeddings identify semantically similar concepts, they increase recall for queries with conceptual variations and a relevance score 0. 7668. SLM RAG outperforms both approaches' relevance score 0. 9202, retrieving the most relevant document with accuracy and contextually enriched responses. A hybrid retrieval model that combines dense embeddings and generative ranking improves search quality substantially.",
                "enriched responses. A hybrid retrieval model that combines dense embeddings and generative ranking improves search quality substantially. This exploratory study has multiple implications for information retrieval and including the need for scalable multi-document retrieval, FAISS DPR integration for efficient vector searches, and domain-specific fine-tuning of SLMs. Using this approach, enterprises can achieve faster, more accurate, and contextually aware document retrieval by combining SLMs and RAG methods. Hybrid approaches could be explored in large-scale retrieval settings, with stable workflow integration processes across diverse industries like legal research, healthcare, government, and finance."
            ]
        },
        {
            "id": "doi_10.36227%2Ftechrxiv.177155637.75354706%2Fv1",
            "title": "Adaptive Optimization for Retrieval-Augmented Generation via Diagnostic-Driven Strategy Generation",
            "url": "https://doi.org/10.36227/techrxiv.177155637.75354706/v1",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by/4.0/",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 455,
                "chunks_count": 3,
                "language": "en",
                "char_length": 1817,
                "is_jsonl_ready": true,
                "relevance_match_score": 3,
                "doi": "10.36227/techrxiv.177155637.75354706/v1",
                "publisher": "Institute of Electrical and Electronics Engineers (IEEE)",
                "publication_year": 2026,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) are essential for precise applications, yet their performance suffers from a significant gap between problem diagnosis and effective optimization. While recent diagnostic frameworks like THELMA offer fine-grained RAG performance insights, they still require manual expert intervention for resolution, leading to slow, experience-dependent optimization cycles. To bridge this \"diagnosis-to-optimization\" gap, we propose DynaRAG (Dynamic RAG Optimization Framework), an adaptive system that extends diagnostic capabilities to intelligent, LLMpowered strategy generation and semi-automated pipeline adjustment. an adaptive system that extends diagnostic capabilities to intelligent, LLMpowered strategy generation and semi-automated pipeline adjustment. DynaRAG comprises a Granular Diagnosis Module (GDM) using THELMA, a novel RAG Component Adaptation Knowledge Base (RCAK) mapping diagnostic patterns to root causes and optimization strategies, and an Adaptive Strategy Generation amp; Recommendation Module (ASGR) generating specific, actionable advice, including configuration snippets. Experiments on the THELMA-WikiEval dataset and an expert-annotated subset demonstrate DynaRAG's superior performance across key metrics: achieving a Diagnosis Accuracy of 0. 93, a Strategy Match Score of 0. 87, an Actionability Score of 0. 90, and an Average Performance Improvement of 0. a Diagnosis Accuracy of 0. 93, a Strategy Match Score of 0. 87, an Actionability Score of 0. 90, and an Average Performance Improvement of 0. 15, significantly outperforming existing diagnostic-only or heuristic-based approaches. DynaRAG thus represents a significant step towards continuous, efficient, and semi-automated performance optimization for RAG applications.",
            "chunks": [
                "Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) are essential for precise applications, yet their performance suffers from a significant gap between problem diagnosis and effective optimization. While recent diagnostic frameworks like THELMA offer fine-grained RAG performance insights, they still require manual expert intervention for resolution, leading to slow, experience-dependent optimization cycles. To bridge this \"diagnosis-to-optimization\" gap, we propose DynaRAG (Dynamic RAG Optimization Framework), an adaptive system that extends diagnostic capabilities to intelligent, LLMpowered strategy generation and semi-automated pipeline adjustment.",
                "an adaptive system that extends diagnostic capabilities to intelligent, LLMpowered strategy generation and semi-automated pipeline adjustment. DynaRAG comprises a Granular Diagnosis Module (GDM) using THELMA, a novel RAG Component Adaptation Knowledge Base (RCAK) mapping diagnostic patterns to root causes and optimization strategies, and an Adaptive Strategy Generation amp; Recommendation Module (ASGR) generating specific, actionable advice, including configuration snippets. Experiments on the THELMA-WikiEval dataset and an expert-annotated subset demonstrate DynaRAG's superior performance across key metrics: achieving a Diagnosis Accuracy of 0. 93, a Strategy Match Score of 0. 87, an Actionability Score of 0. 90, and an Average Performance Improvement of 0.",
                "a Diagnosis Accuracy of 0. 93, a Strategy Match Score of 0. 87, an Actionability Score of 0. 90, and an Average Performance Improvement of 0. 15, significantly outperforming existing diagnostic-only or heuristic-based approaches. DynaRAG thus represents a significant step towards continuous, efficient, and semi-automated performance optimization for RAG applications."
            ]
        },
        {
            "id": "doi_10.20944%2Fpreprints202504.0443.v1",
            "title": "Retrieval-Augmented Text Generation: Methods, Challenges, and Applications",
            "url": "https://doi.org/10.20944/preprints202504.0443.v1",
            "source": "Crossref Academic Index",
            "license": "http://creativecommons.org/licenses/by/4.0",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 707,
                "chunks_count": 4,
                "language": "en",
                "char_length": 2825,
                "is_jsonl_ready": true,
                "relevance_match_score": 4,
                "doi": "10.20944/preprints202504.0443.v1",
                "publisher": "MDPI AG",
                "publication_year": 2025,
                "citation_count": 1,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, they are inherently constrained by the static nature of their pretraining data, leading to challenges such as knowledge obsolescence, hallucination, and limited factual grounding. Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that addresses these limitations by dynamically integrating external knowledge retrieval with generative text modeling. By retrieving relevant documents or structured knowledge at inference time, RAG enhances model reliability, improves factual accuracy, and enables real-time knowledge adaptation. or structured knowledge at inference time, RAG enhances model reliability, improves factual accuracy, and enables real-time knowledge adaptation. This survey provides a comprehensive overview of RAG, covering its foundational principles, retrieval mechanisms, generative strategies, and integration methodologies. We discuss various retrieval approaches, including sparse and dense retrieval, hybrid search models, and reinforcement learning-based retrieval optimization. We explore different fusion techniques for incorporating retrieved knowledge into generation, such as prompt concatenation, attention-based integration, and iterative refinement. for incorporating retrieved knowledge into generation, such as prompt concatenation, attention-based integration, and iterative refinement. Additionally, we examine the diverse applications of RAG across domains such as open-domain question answering, conversational AI, scientific literature summarization, code generation, legal document analysis, and biomedical research. Despite its advantages, RAG introduces new challenges, including retrieval noise, latency constraints, security vulnerabilities, and bias in retrieved content. We highlight key research directions to address these challenges, including scalable retrieval architectures, multimodal knowledge integration, continual learning for adaptive retrieval, and bias-aware ranking techniques. scalable retrieval architectures, multimodal knowledge integration, continual learning for adaptive retrieval, and bias-aware ranking techniques. Furthermore, we discuss the broader implications of RAG in enabling explainable AI, bridging structured and unstructured knowledge sources, and democratizing access to real-time information. By synthesizing recent advancements and outlining future research opportunities, this survey serves as a foundational resource for researchers and practitioners working on retrieval-augmented systems. As RAG continues to evolve, it is poised to redefine the landscape of AI-driven text generation, paving the way for more accurate, interpretable, and knowledge-aware artificial intelligence systems.",
            "chunks": [
                "Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, they are inherently constrained by the static nature of their pretraining data, leading to challenges such as knowledge obsolescence, hallucination, and limited factual grounding. Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that addresses these limitations by dynamically integrating external knowledge retrieval with generative text modeling. By retrieving relevant documents or structured knowledge at inference time, RAG enhances model reliability, improves factual accuracy, and enables real-time knowledge adaptation.",
                "or structured knowledge at inference time, RAG enhances model reliability, improves factual accuracy, and enables real-time knowledge adaptation. This survey provides a comprehensive overview of RAG, covering its foundational principles, retrieval mechanisms, generative strategies, and integration methodologies. We discuss various retrieval approaches, including sparse and dense retrieval, hybrid search models, and reinforcement learning-based retrieval optimization. We explore different fusion techniques for incorporating retrieved knowledge into generation, such as prompt concatenation, attention-based integration, and iterative refinement.",
                "for incorporating retrieved knowledge into generation, such as prompt concatenation, attention-based integration, and iterative refinement. Additionally, we examine the diverse applications of RAG across domains such as open-domain question answering, conversational AI, scientific literature summarization, code generation, legal document analysis, and biomedical research. Despite its advantages, RAG introduces new challenges, including retrieval noise, latency constraints, security vulnerabilities, and bias in retrieved content. We highlight key research directions to address these challenges, including scalable retrieval architectures, multimodal knowledge integration, continual learning for adaptive retrieval, and bias-aware ranking techniques.",
                "scalable retrieval architectures, multimodal knowledge integration, continual learning for adaptive retrieval, and bias-aware ranking techniques. Furthermore, we discuss the broader implications of RAG in enabling explainable AI, bridging structured and unstructured knowledge sources, and democratizing access to real-time information. By synthesizing recent advancements and outlining future research opportunities, this survey serves as a foundational resource for researchers and practitioners working on retrieval-augmented systems. As RAG continues to evolve, it is poised to redefine the landscape of AI-driven text generation, paving the way for more accurate, interpretable, and knowledge-aware artificial intelligence systems."
            ]
        },
        {
            "id": "doi_10.1101%2F2025.11.25.25341010",
            "title": "Development and validation of Retrieval Augmented Generation (RAG) and GraphRAG for complex clinical cases",
            "url": "https://doi.org/10.1101/2025.11.25.25341010",
            "source": "Crossref Academic Index",
            "license": "http://creativecommons.org/licenses/by-nc/4.0/",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 832,
                "chunks_count": 5,
                "language": "en",
                "char_length": 3325,
                "is_jsonl_ready": true,
                "relevance_match_score": 5,
                "doi": "10.1101/2025.11.25.25341010",
                "publisher": "openRxiv",
                "publication_year": 2025,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "Abstract Objective Chronic Kidney Disease (CKD) is a progressive condition requiring evidence-based management, but adherence to complex guidelines remains challenging. Large Language Models (LLMs) could support clinical decision-making, yet their unreliability limits direct use. This study aimed to evaluate whether Retrieval-Augmented Generation (RAG), particularly a knowledge graph-enhanced pipeline (GraphRAG), improves guideline-based clinical decision support (CDS) in CKD management. Methods and Analysis We compared three approaches: a baseline LLM (GPT-4o), a vector-indexed RAG pipeline, and a GraphRAG pipeline. Each model answered nine clinically relevant questions for a synthetic cohort of 70 CKD patients. RAG pipeline, and a GraphRAG pipeline. Each model answered nine clinically relevant questions for a synthetic cohort of 70 CKD patients. Outputs were assessed for clinical correctness, patient-specificity, and clarity, using both clinician-led evaluations and an LLM-as-Judge framework. Results RAG-based methods outperformed the baseline LLM in clinical correctness and guideline adherence. GraphRAG achieved the highest patient-specificity by leveraging multi-hop relationships across a knowledge graph derived from NICE CKD guidelines, particularly for tasks involving thresholds, algorithmic decisions, or open-ended management. However, GraphRAG scored lower in clarity, as its graph walks often returned long guideline excerpts that obscured key recommendations. management. However, GraphRAG scored lower in clarity, as its graph walks often returned long guideline excerpts that obscured key recommendations. All RAG systems were limited by the scope of the indexed guideline and performed poorly when essential information was missing. Conclusions RAG and GraphRAG provide a scalable, auditable foundation for guideline-aligned CDS in CKD, with GraphRAG showing particular strengths in tailoring advice to patient data. Nonetheless, trade-offs remain between specificity and clarity, and effective deployment will require robust content management, transparent validation pipelines, and integration within established clinical governance frameworks. will require robust content management, transparent validation pipelines, and integration within established clinical governance frameworks. Key points - LLMs have comprehensive medical knowledge but require access to up-to-date, evidence-based, and locally relevant guidelines to be effective in CDS. - Hallucinations (the generation of inaccurate or misleading information) remain a major limitation for LLMs in healthcare. - Traditional information retrieval methods face several challenges in providing accurate, context-specific evidence. - Retrieval-Augmented Generation (RAG) and graph-based RAG approaches have emerged as promising solutions to overcome these limitations. evidence. - Retrieval-Augmented Generation (RAG) and graph-based RAG approaches have emerged as promising solutions to overcome these limitations. - Renal medicine provides an ideal test domain to evaluate these models, given its complexity and reliance on nuanced, multidisciplinary decision-making. - Studying LLM performance in kidney health can yield valuable insights into how such models can safely and effectively support complex clinical decision-making.",
            "chunks": [
                "Abstract Objective Chronic Kidney Disease (CKD) is a progressive condition requiring evidence-based management, but adherence to complex guidelines remains challenging. Large Language Models (LLMs) could support clinical decision-making, yet their unreliability limits direct use. This study aimed to evaluate whether Retrieval-Augmented Generation (RAG), particularly a knowledge graph-enhanced pipeline (GraphRAG), improves guideline-based clinical decision support (CDS) in CKD management. Methods and Analysis We compared three approaches: a baseline LLM (GPT-4o), a vector-indexed RAG pipeline, and a GraphRAG pipeline. Each model answered nine clinically relevant questions for a synthetic cohort of 70 CKD patients.",
                "RAG pipeline, and a GraphRAG pipeline. Each model answered nine clinically relevant questions for a synthetic cohort of 70 CKD patients. Outputs were assessed for clinical correctness, patient-specificity, and clarity, using both clinician-led evaluations and an LLM-as-Judge framework. Results RAG-based methods outperformed the baseline LLM in clinical correctness and guideline adherence. GraphRAG achieved the highest patient-specificity by leveraging multi-hop relationships across a knowledge graph derived from NICE CKD guidelines, particularly for tasks involving thresholds, algorithmic decisions, or open-ended management. However, GraphRAG scored lower in clarity, as its graph walks often returned long guideline excerpts that obscured key recommendations.",
                "management. However, GraphRAG scored lower in clarity, as its graph walks often returned long guideline excerpts that obscured key recommendations. All RAG systems were limited by the scope of the indexed guideline and performed poorly when essential information was missing. Conclusions RAG and GraphRAG provide a scalable, auditable foundation for guideline-aligned CDS in CKD, with GraphRAG showing particular strengths in tailoring advice to patient data. Nonetheless, trade-offs remain between specificity and clarity, and effective deployment will require robust content management, transparent validation pipelines, and integration within established clinical governance frameworks.",
                "will require robust content management, transparent validation pipelines, and integration within established clinical governance frameworks. Key points - LLMs have comprehensive medical knowledge but require access to up-to-date, evidence-based, and locally relevant guidelines to be effective in CDS. - Hallucinations (the generation of inaccurate or misleading information) remain a major limitation for LLMs in healthcare. - Traditional information retrieval methods face several challenges in providing accurate, context-specific evidence. - Retrieval-Augmented Generation (RAG) and graph-based RAG approaches have emerged as promising solutions to overcome these limitations.",
                "evidence. - Retrieval-Augmented Generation (RAG) and graph-based RAG approaches have emerged as promising solutions to overcome these limitations. - Renal medicine provides an ideal test domain to evaluate these models, given its complexity and reliance on nuanced, multidisciplinary decision-making. - Studying LLM performance in kidney health can yield valuable insights into how such models can safely and effectively support complex clinical decision-making."
            ]
        },
        {
            "id": "doi_10.36227%2Ftechrxiv.173152556.61823435%2Fv1",
            "title": "A Multimodal Framework for Quantifying Retrieval-Augmented Generation Efficacy",
            "url": "https://doi.org/10.36227/techrxiv.173152556.61823435/v1",
            "source": "Crossref Academic Index",
            "license": "https://creativecommons.org/licenses/by-sa/4.0/",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 438,
                "chunks_count": 3,
                "language": "en",
                "char_length": 1752,
                "is_jsonl_ready": true,
                "relevance_match_score": 3,
                "doi": "10.36227/techrxiv.173152556.61823435/v1",
                "publisher": "Institute of Electrical and Electronics Engineers (IEEE)",
                "publication_year": 2024,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "This paper presents a new hybrid methodology for evaluating Retrieval-Augmented Generation (RAG) systems. The existing approaches provide a limited, one-dimensional assessment of RAG that lacks generalizability and is not suitable for interdisciplinary applications which cannot be extended to more than one domain. To address this, we propose a comprehensive framework that incorporates semantic metrics assessing several aspects, such as query relevance, factual accuracy, context consistency, semantic coherence, relevance, hallucination detection, and answer correctness using modern natural language processing tools. Our proposed methodology was compared to the existing state-of-the-art like llm-as-judge approach and outperformed the competitors by upto 10 . Our proposed methodology was compared to the existing state-of-the-art like llm-as-judge approach and outperformed the competitors by upto 10 . The architecture is designed for flexibility, making it applicable not only to RAG systems but also to a range of natural language generation tasks. This work extends the existing body of knowledge in both RAG systems as well as natural language generation tasks by providing a robust multidimensional evaluation approach. Robust scoring system that penalizes lower scores with Harmonic Means together with PCA, Adaptive, and Entropy weighting approaches identifies areas for improvement and provides specific retrieval and chunking method recommendations. PCA, Adaptive, and Entropy weighting approaches identifies areas for improvement and provides specific retrieval and chunking method recommendations. Semantic metrics provides a low-cost alternative evaluation technique suitable for closed or offline contexts across multiple domains.",
            "chunks": [
                "This paper presents a new hybrid methodology for evaluating Retrieval-Augmented Generation (RAG) systems. The existing approaches provide a limited, one-dimensional assessment of RAG that lacks generalizability and is not suitable for interdisciplinary applications which cannot be extended to more than one domain. To address this, we propose a comprehensive framework that incorporates semantic metrics assessing several aspects, such as query relevance, factual accuracy, context consistency, semantic coherence, relevance, hallucination detection, and answer correctness using modern natural language processing tools. Our proposed methodology was compared to the existing state-of-the-art like llm-as-judge approach and outperformed the competitors by upto 10 .",
                "Our proposed methodology was compared to the existing state-of-the-art like llm-as-judge approach and outperformed the competitors by upto 10 . The architecture is designed for flexibility, making it applicable not only to RAG systems but also to a range of natural language generation tasks. This work extends the existing body of knowledge in both RAG systems as well as natural language generation tasks by providing a robust multidimensional evaluation approach. Robust scoring system that penalizes lower scores with Harmonic Means together with PCA, Adaptive, and Entropy weighting approaches identifies areas for improvement and provides specific retrieval and chunking method recommendations.",
                "PCA, Adaptive, and Entropy weighting approaches identifies areas for improvement and provides specific retrieval and chunking method recommendations. Semantic metrics provides a low-cost alternative evaluation technique suitable for closed or offline contexts across multiple domains."
            ]
        },
        {
            "id": "doi_10.2196%2Fpreprints.79922",
            "title": "Imperatives for Retrieval-Augmented Generation in Clinical Nursing: Ensuring Responsible AI Implementation (Preprint)",
            "url": "https://doi.org/10.2196/preprints.79922",
            "source": "Crossref Academic Index",
            "license": "Publisher Proprietary / OpenAccess Check Required",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 307,
                "chunks_count": 2,
                "language": "en",
                "char_length": 1225,
                "is_jsonl_ready": true,
                "relevance_match_score": 2,
                "doi": "10.2196/preprints.79922",
                "publisher": "JMIR Publications Inc.",
                "publication_year": 2025,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "UNSTRUCTURED Retrieval-Augmented Generation models have emerged as a powerful technique for optimizing general large language models in specialized domains, and are being increasingly adopted by researchers in the medical field. This article acknowledges the significant potential of RAG to enhance clinical decision-making. However, it argues that researchers and practitioners must proactively address the ethical risks associated with RAG implementation in healthcare. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, as discussed in detail. We propose that robust data governance, explainable AI techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. robust data governance, explainable AI techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires collaboration among healthcare professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care.",
            "chunks": [
                "UNSTRUCTURED Retrieval-Augmented Generation models have emerged as a powerful technique for optimizing general large language models in specialized domains, and are being increasingly adopted by researchers in the medical field. This article acknowledges the significant potential of RAG to enhance clinical decision-making. However, it argues that researchers and practitioners must proactively address the ethical risks associated with RAG implementation in healthcare. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, as discussed in detail. We propose that robust data governance, explainable AI techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy.",
                "robust data governance, explainable AI techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires collaboration among healthcare professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care."
            ]
        },
        {
            "id": "doi_10.2196%2Fpreprints.82026",
            "title": "Multi-Evidence Clinical Reasoning With Retrieval-Augmented Generation for Emergency Triage: Retrospective Evaluation Study (Preprint)",
            "url": "https://doi.org/10.2196/preprints.82026",
            "source": "Crossref Academic Index",
            "license": "Publisher Proprietary / OpenAccess Check Required",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 1040,
                "chunks_count": 6,
                "language": "en",
                "char_length": 4157,
                "is_jsonl_ready": true,
                "relevance_match_score": 6,
                "doi": "10.2196/preprints.82026",
                "publisher": "JMIR Publications Inc.",
                "publication_year": 2025,
                "citation_count": 1,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "BACKGROUND Emergency triage accuracy is critical but varies with clinician experience, cognitive load, and case complexity. Mis-triage can delay care for high-risk patients and exacerbate crowding through unnecessary prioritization. Large language models (LLMs) show promise as triage decision-support tools but are vulnerable to hallucinations. Retrieval-augmented generation (RAG) may improve reliability by grounding LLM reasoning in authoritative guidelines and real clinical cases. OBJECTIVE This study aimed to evaluate whether a dual-source RAG system that integrates guideline- and case-based evidence improves emergency triage performance versus a baseline LLM and to assess how closely its urgency assignments align with expert consensus and outcome-defined clinical severity. versus a baseline LLM and to assess how closely its urgency assignments align with expert consensus and outcome-defined clinical severity. METHODS We developed a dual-source RAG system Multi-Evidence Clinical Reasoning RAG (MECR-RAG) that retrieves sections from the Hong Kong Accident and Emergency Triage Guidelines (HKAETG) and cases from a database of 3000 emergency department triage encounters. In a retrospective single center evaluation, MECR RAG and a prompt only baseline LLM (both DeepSeek V3) were tested on 236 routine triage encounters to predict 5 level triage categories. Expert consensus reference labels were assigned by blinded senior triage nurses. Primary outcomes were quadratic weighted kappa (QWK) and accuracy versus consensus labels. labels were assigned by blinded senior triage nurses. Primary outcomes were quadratic weighted kappa (QWK) and accuracy versus consensus labels. Secondary analyses examined performance within 3 operationally and clinically relevant triage bands immediate (categories 1 and 2), urgent (category 3), and nonurgent (categories 4 and 5). In 226 encounters with follow up, we also assigned outcome based severity tiers (R1-R3) using a published 3 level urgency reference standard and defined a disposition safety composite. RESULTS MECR RAG achieved a mean QWK of 0. 902 (SD 0. 0021; 95 CI 0. 901-0. 904) and accuracy of 0. 802 (SD 0. 0082; 95 CI 0. 795-0. 808), outperforming the baseline LLM (QWK 0. 801, SD 0. 004; accuracy 0. 542, SD 0. 0073; both lt;i gt;P lt; i gt; amp;lt;. 0. 0082; 95 CI 0. 795-0. 808), outperforming the baseline LLM (QWK 0. 801, SD 0. 004; accuracy 0. 542, SD 0. 0073; both lt;i gt;P lt; i gt; amp;lt;. 001) and demonstrating expert comparable agreement with triage nurses (interrater QWK 0. 887). In 3 group analysis, MECR RAG reduced overtriage from 68 236 (28. 8 ) with the baseline LLM to 30 236 (12. 7 ) and maintained low undertriage from 4 236 (1. 7 ) to 3 236 (1. 3 ), with the largest gains in the diagnostically ambiguous yet operationally important categories 3 and 4. In a secondary outcome based analysis defining high severity courses as R1 R2, MECR RAG detected high-risk patients more sensitively than initial nurse triage (124 130, 95. 4 vs 117 130, 90. 0 ; lt;i gt;P lt; i gt; . 02) while maintaining nurse level specificity. more sensitively than initial nurse triage (124 130, 95. 4 vs 117 130, 90. 0 ; lt;i gt;P lt; i gt; . 02) while maintaining nurse level specificity. MECR RAG yielded the lowest weighted harm index (13. 7, 19. 5, and 20. 3 per 100 patients for MECR RAG, nurses, and the baseline LLM, respectively). CONCLUSIONS A dual source RAG triage system that combines guideline based rules with case based reasoning achieved expert comparable agreement, reduced overtriage, and better aligned urgency assignments than a prompt only baseline LLM. based reasoning achieved expert comparable agreement, reduced overtriage, and better aligned urgency assignments than a prompt only baseline LLM. Secondary outcome based analyses in this cohort suggested more favorable triage patterns than initial nurse triage, supporting MECR RAG as a concurrent decision support layer that flags discordant or high risk assignments; prospective multicenter implementation studies are needed to determine effects on emergency department crowding, delays, and patient outcomes.",
            "chunks": [
                "BACKGROUND Emergency triage accuracy is critical but varies with clinician experience, cognitive load, and case complexity. Mis-triage can delay care for high-risk patients and exacerbate crowding through unnecessary prioritization. Large language models (LLMs) show promise as triage decision-support tools but are vulnerable to hallucinations. Retrieval-augmented generation (RAG) may improve reliability by grounding LLM reasoning in authoritative guidelines and real clinical cases. OBJECTIVE This study aimed to evaluate whether a dual-source RAG system that integrates guideline- and case-based evidence improves emergency triage performance versus a baseline LLM and to assess how closely its urgency assignments align with expert consensus and outcome-defined clinical severity.",
                "versus a baseline LLM and to assess how closely its urgency assignments align with expert consensus and outcome-defined clinical severity. METHODS We developed a dual-source RAG system Multi-Evidence Clinical Reasoning RAG (MECR-RAG) that retrieves sections from the Hong Kong Accident and Emergency Triage Guidelines (HKAETG) and cases from a database of 3000 emergency department triage encounters. In a retrospective single center evaluation, MECR RAG and a prompt only baseline LLM (both DeepSeek V3) were tested on 236 routine triage encounters to predict 5 level triage categories. Expert consensus reference labels were assigned by blinded senior triage nurses. Primary outcomes were quadratic weighted kappa (QWK) and accuracy versus consensus labels.",
                "labels were assigned by blinded senior triage nurses. Primary outcomes were quadratic weighted kappa (QWK) and accuracy versus consensus labels. Secondary analyses examined performance within 3 operationally and clinically relevant triage bands immediate (categories 1 and 2), urgent (category 3), and nonurgent (categories 4 and 5). In 226 encounters with follow up, we also assigned outcome based severity tiers (R1-R3) using a published 3 level urgency reference standard and defined a disposition safety composite. RESULTS MECR RAG achieved a mean QWK of 0. 902 (SD 0. 0021; 95 CI 0. 901-0. 904) and accuracy of 0. 802 (SD 0. 0082; 95 CI 0. 795-0. 808), outperforming the baseline LLM (QWK 0. 801, SD 0. 004; accuracy 0. 542, SD 0. 0073; both lt;i gt;P lt; i gt; amp;lt;.",
                "0. 0082; 95 CI 0. 795-0. 808), outperforming the baseline LLM (QWK 0. 801, SD 0. 004; accuracy 0. 542, SD 0. 0073; both lt;i gt;P lt; i gt; amp;lt;. 001) and demonstrating expert comparable agreement with triage nurses (interrater QWK 0. 887). In 3 group analysis, MECR RAG reduced overtriage from 68 236 (28. 8 ) with the baseline LLM to 30 236 (12. 7 ) and maintained low undertriage from 4 236 (1. 7 ) to 3 236 (1. 3 ), with the largest gains in the diagnostically ambiguous yet operationally important categories 3 and 4. In a secondary outcome based analysis defining high severity courses as R1 R2, MECR RAG detected high-risk patients more sensitively than initial nurse triage (124 130, 95. 4 vs 117 130, 90. 0 ; lt;i gt;P lt; i gt; . 02) while maintaining nurse level specificity.",
                "more sensitively than initial nurse triage (124 130, 95. 4 vs 117 130, 90. 0 ; lt;i gt;P lt; i gt; . 02) while maintaining nurse level specificity. MECR RAG yielded the lowest weighted harm index (13. 7, 19. 5, and 20. 3 per 100 patients for MECR RAG, nurses, and the baseline LLM, respectively). CONCLUSIONS A dual source RAG triage system that combines guideline based rules with case based reasoning achieved expert comparable agreement, reduced overtriage, and better aligned urgency assignments than a prompt only baseline LLM.",
                "based reasoning achieved expert comparable agreement, reduced overtriage, and better aligned urgency assignments than a prompt only baseline LLM. Secondary outcome based analyses in this cohort suggested more favorable triage patterns than initial nurse triage, supporting MECR RAG as a concurrent decision support layer that flags discordant or high risk assignments; prospective multicenter implementation studies are needed to determine effects on emergency department crowding, delays, and patient outcomes."
            ]
        },
        {
            "id": "doi_10.20944%2Fpreprints202512.2609.v1",
            "title": "An Enhanced Lightweight Clinical Decision Support System via Refined Fine-Tuning and Intelligent Retrieval-Augmented Generation",
            "url": "https://doi.org/10.20944/preprints202512.2609.v1",
            "source": "Crossref Academic Index",
            "license": "http://creativecommons.org/licenses/by/4.0",
            "image_url": null,
            "trust_score": 0.99,
            "metadata": {
                "estimated_tokens": 394,
                "chunks_count": 2,
                "language": "en",
                "char_length": 1576,
                "is_jsonl_ready": true,
                "relevance_match_score": 2,
                "doi": "10.20944/preprints202512.2609.v1",
                "publisher": "MDPI AG",
                "publication_year": 2025,
                "citation_count": 0,
                "journal_title": null,
                "type": "posted-content",
                "issn": null
            },
            "text": "The increasing complexity of clinical decision-making demands advanced support, yet traditional Clinical Decision Support Systems (CDSS) lack flexibility, and general Large Language Models (LLMs) struggle with medical specificity, factual accuracy, and resource demands. This paper presents an Enhanced Lightweight Clinical Decision Support System, optimizing the \"lightweight LLM Retrieval-Augmented Generation (RAG)\" architecture for superior accuracy, robustness, and resource efficiency. Our method employs a QLoRA fine-tuned base model and features two key innovations: a refined medical domain data fine-tuning strategy using semantic labeling and ontology-based domain balancing to enhance specialized knowledge; and an intelligent context optimization module within the RAG pipeline. and ontology-based domain balancing to enhance specialized knowledge; and an intelligent context optimization module within the RAG pipeline. This module utilizes secondary relevance re-ranking with a lightweight cross-encoder, redundancy reduction, and key information extraction to provide the LLM with precise and compact context. Experiments on medical benchmarks demonstrate that our system consistently outperforms a standard QLoRA fine-tuned model, achieving notable accuracy improvements in challenging domains such as College Medicine and Medical Genetics. This enhanced performance is achieved while maintaining a lightweight computational footprint, making our system a practical and reliable tool for clinical decision support, especially in resource-constrained settings.",
            "chunks": [
                "The increasing complexity of clinical decision-making demands advanced support, yet traditional Clinical Decision Support Systems (CDSS) lack flexibility, and general Large Language Models (LLMs) struggle with medical specificity, factual accuracy, and resource demands. This paper presents an Enhanced Lightweight Clinical Decision Support System, optimizing the \"lightweight LLM Retrieval-Augmented Generation (RAG)\" architecture for superior accuracy, robustness, and resource efficiency. Our method employs a QLoRA fine-tuned base model and features two key innovations: a refined medical domain data fine-tuning strategy using semantic labeling and ontology-based domain balancing to enhance specialized knowledge; and an intelligent context optimization module within the RAG pipeline.",
                "and ontology-based domain balancing to enhance specialized knowledge; and an intelligent context optimization module within the RAG pipeline. This module utilizes secondary relevance re-ranking with a lightweight cross-encoder, redundancy reduction, and key information extraction to provide the LLM with precise and compact context. Experiments on medical benchmarks demonstrate that our system consistently outperforms a standard QLoRA fine-tuned model, achieving notable accuracy improvements in challenging domains such as College Medicine and Medical Genetics. This enhanced performance is achieved while maintaining a lightweight computational footprint, making our system a practical and reliable tool for clinical decision support, especially in resource-constrained settings."
            ]
        },
        {
            "id": "wiki_62329",
            "title": "Meta-analysis",
            "url": "https://en.wikipedia.org/wiki/Meta-analysis",
            "source": "Wikipedia Core",
            "license": "CC BY-SA 3.0",
            "image_url": null,
            "trust_score": 0.85,
            "metadata": {
                "estimated_tokens": 2827,
                "chunks_count": 16,
                "language": "en",
                "char_length": 11308,
                "is_jsonl_ready": true,
                "relevance_match_score": 16
            },
            "text": "who stated \"Meta-analysis refers to the analysis of analyses\". Glass's work aimed at describing aggregated measures of relationships and effects. While Glass is credited with authoring the first modern meta-analysis, a paper published in 1904 by the statistician Karl Pearson in the British Medical Journal collated data from several studies of typhoid inoculation and is seen as the first time a meta-analytic approach was used to aggregate the outcomes of multiple clinical studies. Numerous other examples of early meta-analyses can be found including occupational aptitude testing, and agriculture. The first model meta-analysis was published in 1978 on the effectiveness of psychotherapy outcomes by Mary Lee Smith and Gene Glass. the statistical error and are potentially\noverconfident in their conclusions. Several fixes have been suggested but the debate continues on. A further concern is that the average treatment effect can sometimes be even less conservative compared to the fixed effect model and therefore misleading in practice. One interpretational fix that has been suggested is to create a prediction interval around the random effects estimate to portray the range of possible effects in practice. However, an assumption behind the calculation of such a prediction interval is that trials are considered more or less homogeneous entities and that included patient populations and comparator treatments should be considered exchangeable and this is usually unattainable in practice. to the contribution of variance due to random error that is used in any fixed effects meta-analysis model to generate weights for each study. The strength of the quality effects meta-analysis is that it allows available methodological evidence to be used over subjective random effects, and thereby helps to close the damaging gap which has opened up between methodology and statistics in clinical research. To do this a synthetic bias variance is computed based on quality information to adjust inverse variance weights and the quality adjusted weight of the ith study is introduced. These adjusted weights are then used in meta-analysis. methods (also called network meta-analyses, in particular when multiple treatments are assessed simultaneously) generally use two main methodologies. First, is the Bucher methodwhich is a single or repeated comparison of a closed loop of three-treatments such that one of them is common to the two studies and forms the node where the loop begins and ends. Therefore, multiple two-by-two comparisons (3-treatment loops) are needed to compare multiple treatments. This methodology requires that trials with more than two arms have two arms only selected as independent pair-wise comparisons are required. The alternative methodology uses complex statistical modelling to include the multiple arm trials and comparisons simultaneously between all competing treatments. methodology uses complex statistical modelling to include the multiple arm trials and comparisons simultaneously between all competing treatments. These have been executed using Bayesian methods, mixed linear models and meta-regression approaches. Bayesian framework\nSpecifying a Bayesian network meta-analysis model involves writing a directed acyclic graph (DAG) model for general-purpose Markov chain Monte Carlo (MCMC) software such as WinBUGS. In addition, prior distributions have to be specified for a number of the parameters, and the data have to be supplied in a specific format. Together, the DAG, priors, and data form a Bayesian hierarchical model. method has been developed for complex networks by some researchers as a way to make this methodology available to the mainstream research community. This proposal does restrict each trial to two interventions, but also introduces a workaround for multiple arm trials: a different fixed control node can be selected in different runs. It also utilizes robust meta-analysis methods so that many of the problems highlighted above are avoided. Further research around this framework is required to determine if this is indeed superior to the Bayesian or multivariate frequentist frameworks. Researchers willing to try this out have access to this framework through a free software. Two commonly used models are the bivariate random-effects model and the hierarchical summary receiver operating characteristic (HSROC) model. These approaches are recommended by the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy and are widely used in reviews of screening tests, imaging tools, and laboratory diagnostics. Beyond the standard hierarchical models, other approaches have been developed to address various complexities in diagnostic accuracy synthesis. These include methods that incorporate differences in threshold effects, account for covariates through meta-regression, or improve applicability by considering test setting and clinical variation. Some frameworks aim to adapt the synthesis to reflect intended use conditions more directly. by considering test setting and clinical variation. Some frameworks aim to adapt the synthesis to reflect intended use conditions more directly. These extensions are part of an evolving body of methodology that reflects growing experience in the field and increasing demands from clinical and policy decision-makers. Aggregating IPD and AD\nMeta-analysis can also be applied to combine IPD and AD. This is convenient when the researchers who conduct the analysis have their own raw data while collecting aggregate or summary data from the literature. The generalized integration model (GIM) is a generalization of the meta-analysis. It allows that the model fitted on the individual participant data (IPD) is different from the ones used to compute the aggregate data (AD). and the mechanism by which the data came into being. A random effect can be present in either of these roles, but the two roles are quite distinct. There's no reason to think the analysis model and data-generation mechanism (model) are similar in form, but many sub-fields of statistics have developed the habit of assuming, for theory and simulations, that the data-generation mechanism (model) is identical to the analysis model we choose (or would like others to choose). for theory and simulations, that the data-generation mechanism (model) is identical to the analysis model we choose (or would like others to choose). As a hypothesized mechanisms for producing the data, the random effect model for meta-analysis is silly and it is more appropriate to think of this model as a superficial description and something we choose as an analytical tool but this choice for meta-analysis may not work because the study effects are a fixed feature of the respective meta-analysis and the probability distribution is only a descriptive tool. Problems arising from agenda-driven bias\nThe most severe fault in meta-analysis often occurs when the person or persons doing the meta-analysis have an economic, social, or political agenda such as the passage or defeat of legislation. overall political, social, or economic goals in ways such as selecting small favorable data sets and not incorporating larger unfavorable data sets. The influence of such biases on the results of a meta-analysis is possible because the methodology of meta-analysis is highly malleable. A 2011 study done to disclose possible conflicts of interests in underlying research studies used for medical meta-analyses reviewed 29 meta-analyses and found that conflicts of interests in the studies underlying the meta-analyses were rarely disclosed. The 29 meta-analyses included 11 from general medicine journals, 15 from specialty medicine journals, and three from the Cochrane Database of Systematic Reviews. The 29 meta-analyses reviewed a total of 509 randomized controlled trials (RCTs). and three from the Cochrane Database of Systematic Reviews. The 29 meta-analyses reviewed a total of 509 randomized controlled trials (RCTs). Of these, 318 RCTs reported funding sources, with 219 (69 ) receiving funding from industry (i. e. one or more authors\nhaving financial ties to the pharmaceutical industry). Of the 509 RCTs, 132 reported author conflict of interest disclosures, with 91 studies (69 ) disclosing one or more authors having financial ties to industry. The information was, however, seldom reflected in the meta-analyses. Only two (7 ) reported RCT funding sources and none reported RCT author-industry ties. open data and open protocols may often not mitigate such problems, for instance as relevant factors and criteria could be unknown or not be recorded. There is a debate about the appropriate balance between testing with as few animals or humans as possible and the need to obtain robust, reliable findings. It has been argued that unreliable research is inefficient and wasteful and that studies are not just wasteful when they stop too late but also when they stop too early. In large clinical trials, planned, sequential analyses are sometimes used if there is considerable expense or potential harm associated with testing participants. trials, planned, sequential analyses are sometimes used if there is considerable expense or potential harm associated with testing participants. In applied behavioural science, \"megastudies\" have been proposed to investigate the efficacy of many different interventions designed in an interdisciplinary manner by separate teams. One such study used a fitness chain to recruit a large number participants. It has been suggested that behavioural interventions are often hard to compare in meta-analyses and reviews , as \"different scientists test different intervention ideas in different samples using different outcomes over different time intervals\", causing a lack of comparability of such individual investigations which limits \"their potential to inform policy\". over different time intervals\", causing a lack of comparability of such individual investigations which limits \"their potential to inform policy\". Weak inclusion standards lead to misleading conclusions\nMeta-analyses in education are often not restrictive enough in regards to the methodological quality of the studies they include. For example, studies that include small samples or researcher-made measures lead to inflated effect size estimates. However, this problem also troubles meta-analysis of clinical trials. The use of different quality assessment tools (QATs) lead to including different studies and obtaining conflicting estimates of average treatment effects. to reduce variance of the estimator (see statistical models above). Thus some methodological weaknesses in studies can be corrected statistically. Other uses of meta-analytic methods include the development and validation of clinical prediction models, where meta-analysis may be used to combine individual participant data from different research centers and to assess the model's generalisability, or even to aggregate existing prediction models. Meta-analysis can be done with single-subject design as well as group research designs. This is important because much research has been done with single-subject research designs. Considerable dispute exists for the most appropriate meta-analytic technique for single subject research.",
            "chunks": [
                "who stated \"Meta-analysis refers to the analysis of analyses\". Glass's work aimed at describing aggregated measures of relationships and effects. While Glass is credited with authoring the first modern meta-analysis, a paper published in 1904 by the statistician Karl Pearson in the British Medical Journal collated data from several studies of typhoid inoculation and is seen as the first time a meta-analytic approach was used to aggregate the outcomes of multiple clinical studies. Numerous other examples of early meta-analyses can be found including occupational aptitude testing, and agriculture. The first model meta-analysis was published in 1978 on the effectiveness of psychotherapy outcomes by Mary Lee Smith and Gene Glass.",
                "the statistical error and are potentially\noverconfident in their conclusions. Several fixes have been suggested but the debate continues on. A further concern is that the average treatment effect can sometimes be even less conservative compared to the fixed effect model and therefore misleading in practice. One interpretational fix that has been suggested is to create a prediction interval around the random effects estimate to portray the range of possible effects in practice. However, an assumption behind the calculation of such a prediction interval is that trials are considered more or less homogeneous entities and that included patient populations and comparator treatments should be considered exchangeable and this is usually unattainable in practice.",
                "to the contribution of variance due to random error that is used in any fixed effects meta-analysis model to generate weights for each study. The strength of the quality effects meta-analysis is that it allows available methodological evidence to be used over subjective random effects, and thereby helps to close the damaging gap which has opened up between methodology and statistics in clinical research. To do this a synthetic bias variance is computed based on quality information to adjust inverse variance weights and the quality adjusted weight of the ith study is introduced. These adjusted weights are then used in meta-analysis.",
                "methods (also called network meta-analyses, in particular when multiple treatments are assessed simultaneously) generally use two main methodologies. First, is the Bucher methodwhich is a single or repeated comparison of a closed loop of three-treatments such that one of them is common to the two studies and forms the node where the loop begins and ends. Therefore, multiple two-by-two comparisons (3-treatment loops) are needed to compare multiple treatments. This methodology requires that trials with more than two arms have two arms only selected as independent pair-wise comparisons are required. The alternative methodology uses complex statistical modelling to include the multiple arm trials and comparisons simultaneously between all competing treatments.",
                "methodology uses complex statistical modelling to include the multiple arm trials and comparisons simultaneously between all competing treatments. These have been executed using Bayesian methods, mixed linear models and meta-regression approaches. Bayesian framework\nSpecifying a Bayesian network meta-analysis model involves writing a directed acyclic graph (DAG) model for general-purpose Markov chain Monte Carlo (MCMC) software such as WinBUGS. In addition, prior distributions have to be specified for a number of the parameters, and the data have to be supplied in a specific format. Together, the DAG, priors, and data form a Bayesian hierarchical model.",
                "method has been developed for complex networks by some researchers as a way to make this methodology available to the mainstream research community. This proposal does restrict each trial to two interventions, but also introduces a workaround for multiple arm trials: a different fixed control node can be selected in different runs. It also utilizes robust meta-analysis methods so that many of the problems highlighted above are avoided. Further research around this framework is required to determine if this is indeed superior to the Bayesian or multivariate frequentist frameworks. Researchers willing to try this out have access to this framework through a free software.",
                "Two commonly used models are the bivariate random-effects model and the hierarchical summary receiver operating characteristic (HSROC) model. These approaches are recommended by the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy and are widely used in reviews of screening tests, imaging tools, and laboratory diagnostics. Beyond the standard hierarchical models, other approaches have been developed to address various complexities in diagnostic accuracy synthesis. These include methods that incorporate differences in threshold effects, account for covariates through meta-regression, or improve applicability by considering test setting and clinical variation. Some frameworks aim to adapt the synthesis to reflect intended use conditions more directly.",
                "by considering test setting and clinical variation. Some frameworks aim to adapt the synthesis to reflect intended use conditions more directly. These extensions are part of an evolving body of methodology that reflects growing experience in the field and increasing demands from clinical and policy decision-makers. Aggregating IPD and AD\nMeta-analysis can also be applied to combine IPD and AD. This is convenient when the researchers who conduct the analysis have their own raw data while collecting aggregate or summary data from the literature. The generalized integration model (GIM) is a generalization of the meta-analysis. It allows that the model fitted on the individual participant data (IPD) is different from the ones used to compute the aggregate data (AD).",
                "and the mechanism by which the data came into being. A random effect can be present in either of these roles, but the two roles are quite distinct. There's no reason to think the analysis model and data-generation mechanism (model) are similar in form, but many sub-fields of statistics have developed the habit of assuming, for theory and simulations, that the data-generation mechanism (model) is identical to the analysis model we choose (or would like others to choose).",
                "for theory and simulations, that the data-generation mechanism (model) is identical to the analysis model we choose (or would like others to choose). As a hypothesized mechanisms for producing the data, the random effect model for meta-analysis is silly and it is more appropriate to think of this model as a superficial description and something we choose as an analytical tool but this choice for meta-analysis may not work because the study effects are a fixed feature of the respective meta-analysis and the probability distribution is only a descriptive tool. Problems arising from agenda-driven bias\nThe most severe fault in meta-analysis often occurs when the person or persons doing the meta-analysis have an economic, social, or political agenda such as the passage or defeat of legislation.",
                "overall political, social, or economic goals in ways such as selecting small favorable data sets and not incorporating larger unfavorable data sets. The influence of such biases on the results of a meta-analysis is possible because the methodology of meta-analysis is highly malleable. A 2011 study done to disclose possible conflicts of interests in underlying research studies used for medical meta-analyses reviewed 29 meta-analyses and found that conflicts of interests in the studies underlying the meta-analyses were rarely disclosed. The 29 meta-analyses included 11 from general medicine journals, 15 from specialty medicine journals, and three from the Cochrane Database of Systematic Reviews. The 29 meta-analyses reviewed a total of 509 randomized controlled trials (RCTs).",
                "and three from the Cochrane Database of Systematic Reviews. The 29 meta-analyses reviewed a total of 509 randomized controlled trials (RCTs). Of these, 318 RCTs reported funding sources, with 219 (69 ) receiving funding from industry (i. e. one or more authors\nhaving financial ties to the pharmaceutical industry). Of the 509 RCTs, 132 reported author conflict of interest disclosures, with 91 studies (69 ) disclosing one or more authors having financial ties to industry. The information was, however, seldom reflected in the meta-analyses. Only two (7 ) reported RCT funding sources and none reported RCT author-industry ties.",
                "open data and open protocols may often not mitigate such problems, for instance as relevant factors and criteria could be unknown or not be recorded. There is a debate about the appropriate balance between testing with as few animals or humans as possible and the need to obtain robust, reliable findings. It has been argued that unreliable research is inefficient and wasteful and that studies are not just wasteful when they stop too late but also when they stop too early. In large clinical trials, planned, sequential analyses are sometimes used if there is considerable expense or potential harm associated with testing participants.",
                "trials, planned, sequential analyses are sometimes used if there is considerable expense or potential harm associated with testing participants. In applied behavioural science, \"megastudies\" have been proposed to investigate the efficacy of many different interventions designed in an interdisciplinary manner by separate teams. One such study used a fitness chain to recruit a large number participants. It has been suggested that behavioural interventions are often hard to compare in meta-analyses and reviews , as \"different scientists test different intervention ideas in different samples using different outcomes over different time intervals\", causing a lack of comparability of such individual investigations which limits \"their potential to inform policy\".",
                "over different time intervals\", causing a lack of comparability of such individual investigations which limits \"their potential to inform policy\". Weak inclusion standards lead to misleading conclusions\nMeta-analyses in education are often not restrictive enough in regards to the methodological quality of the studies they include. For example, studies that include small samples or researcher-made measures lead to inflated effect size estimates. However, this problem also troubles meta-analysis of clinical trials. The use of different quality assessment tools (QATs) lead to including different studies and obtaining conflicting estimates of average treatment effects.",
                "to reduce variance of the estimator (see statistical models above). Thus some methodological weaknesses in studies can be corrected statistically. Other uses of meta-analytic methods include the development and validation of clinical prediction models, where meta-analysis may be used to combine individual participant data from different research centers and to assess the model's generalisability, or even to aggregate existing prediction models. Meta-analysis can be done with single-subject design as well as group research designs. This is important because much research has been done with single-subject research designs. Considerable dispute exists for the most appropriate meta-analytic technique for single subject research."
            ]
        }
    ]
}