MEDHALU: Hallucinations in Responses to Healthcare Queries by Large Language Models


Abstract

Large language models (LLMs) are starting to complement traditional information seeking mechanisms such as web search. LLM-powered chatbots like ChatGPT are gaining prominence among the general public. AI chatbots are also increasingly producing content on social media platforms. However, LLMs are also prone to hallucinations, generating plausible yet factually incorrect or fabricated information. This becomes a critical problem when laypeople start seeking information about sensitive issues such as healthcare. Existing works in LLM hallucinations in the medical domain mainly focus on testing the medical knowledge of LLMs through standardized medical exam questions which are often well-defined and clear-cut with definitive answers. However, these approaches may not fully capture how these LLMs perform during real-world interactions with patients.
This work conducts a pioneering study on hallucinations in LLM-generated responses to real-world healthcare queries from patients. We introduce MEDHALU, a novel medical hallucination benchmark featuring diverse health-related topics and hallucinated responses from LLMs, with detailed annotation of the hallucination types and text spans. We also propose MEDHALUDETECT, a comprehensive framework for evaluating LLMs’ abilities to detect hallucinations. Furthermore, we study the vulnerability to medical hallucinations among three groups — medical experts, LLMs, and laypeople. Notably, LLMs significantly underperform human experts and, in some cases, even laypeople in detecting medical hallucinations. To improve hallucination detection, we propose an expert-in-the-loop approach that integrates expert reasoning into LLM inputs, significantly improving hallucination detection for all LLMs, including a 6.3% macro-F1 improvement for GPT-4.

Dataset and Codes

An anonymized version of the dataset and codes used in our paper are available for the research community.

  1. MedHalu Dataset: This dataset features real-world queries from diverse health-related topics and hallucinated responses from LLMs, with detailed annotation of the hallucination types and hallucinated text spans.

  2. Codes: Implementation details and codes for MedHalu are available on GITHUB.

You can find the format of the MedHalu dataset here.


Contact Us


If you are interested in using this data, please fill the form to . Request specific data to get the link where you can download the data.

We are sharing the dataset under the terms and conditions specified here below. Please note that submitting the form indicates that you accept the terms and conditions of the data. In the form, please indicate which part of the dataset you need. If you do not get any email notification for your logged request within 24 hours, please e-mail us at netsys.noreply[at]gmail.com.

Dataset Terms and Conditions

  1. You will use the data solely for the purpose of non-profit research or non-profit education.

  2. You will respect the privacy of end users and organizations that may be identified in the data. You will not attempt to reverse engineer, decrypt, de-anonymize, derive or otherwise re-identify anonymized information.

  3. You will not distribute the data beyond your immediate research group.

  4. If you create a publication using our datasets, please cite our papers as follows.

            @article{agarwal2024medhalu,
              title={MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models},
              author={Agarwal, Vibhor and Jin, Yiqiao and Chandra, Mohit and De Choudhury, Munmun and Kumar, Srijan and Sastry, Nishanth},
              journal={arXiv preprint arXiv:2409.19492},
              year={2024}
            }