
Excellent! Next you can
create a new website with this list, or
embed it in an existing web page by copying & pasting
any of the following snippets.
JavaScript
(easiest)
PHP
iFrame
(not recommended)
<script src="https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1&jsonp=1"></script>
<?php
$contents = file_get_contents("https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1");
print_r($contents);
?>
<iframe src="https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1"></iframe>
For more details see the documention.
This is a preview! To use this list on your own web site
or create a new web site from it,
create a free account. The file will be added
and you will be able to edit it in the File Manager.
We will show you instructions once you've created your account.
To the site owner:
Action required! Mendeley is changing its API. In order to keep using Mendeley with BibBase past April 14th, you need to:
- renew the authorization for BibBase on Mendeley, and
- update the BibBase URL in your page the same way you did when you initially set up this page.
2023
(23)
TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.
Yang, Z.; Mitra, A.; Liu, W.; Berlowitz, D.; and Yu, H.
Nature Communications, 14(1): 1–10. November 2023.
Number: 1 Publisher: Nature Publishing Group
Paper
doi
link
bibtex
abstract
@article{yang_transformehr_2023, title = {{TransformEHR}: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records}, volume = {14}, copyright = {2023 This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply}, issn = {2041-1723}, shorttitle = {{TransformEHR}}, url = {https://www.nature.com/articles/s41467-023-43715-z}, doi = {10.1038/s41467-023-43715-z}, abstract = {Deep learning transformer-based models using longitudinal electronic health records (EHRs) have shown a great success in prediction of clinical diseases or outcomes. Pretraining on a large dataset can help such models map the input space better and boost their performance on relevant tasks through finetuning with limited data. In this study, we present TransformEHR, a generative encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future visit from previous visits. TransformEHR’s encoder-decoder framework, paired with the novel pretraining objective, helps it achieve the new state-of-the-art performance on multiple clinical prediction tasks. Comparing with the previous model, TransformEHR improves area under the precision–recall curve by 2\% (p \< 0.001) for pancreatic cancer onset and by 24\% (p = 0.007) for intentional self-harm in patients with post-traumatic stress disorder. The high performance in predicting intentional self-harm shows the potential of TransformEHR in building effective clinical intervention systems. TransformEHR is also generalizable and can be easily finetuned for clinical prediction tasks with limited data. Using AI to predict disease can improve interventions slow down or prevent disease. Here, the authors show that generative AI models built on the framework of Transformer, the model that also empowers ChatGPT, can achieve state-of-the-art performance on disease predictions based on longitudinal electronic records.}, language = {en}, number = {1}, urldate = {2023-11-30}, journal = {Nature Communications}, author = {Yang, Zhichao and Mitra, Avijit and Liu, Weisong and Berlowitz, Dan and Yu, Hong}, month = nov, year = {2023}, note = {Number: 1 Publisher: Nature Publishing Group}, keywords = {Computer science, Disease prevention, Experimental models of disease}, pages = {1--10}, }
Deep learning transformer-based models using longitudinal electronic health records (EHRs) have shown a great success in prediction of clinical diseases or outcomes. Pretraining on a large dataset can help such models map the input space better and boost their performance on relevant tasks through finetuning with limited data. In this study, we present TransformEHR, a generative encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future visit from previous visits. TransformEHR’s encoder-decoder framework, paired with the novel pretraining objective, helps it achieve the new state-of-the-art performance on multiple clinical prediction tasks. Comparing with the previous model, TransformEHR improves area under the precision–recall curve by 2% (p < 0.001) for pancreatic cancer onset and by 24% (p = 0.007) for intentional self-harm in patients with post-traumatic stress disorder. The high performance in predicting intentional self-harm shows the potential of TransformEHR in building effective clinical intervention systems. TransformEHR is also generalizable and can be easily finetuned for clinical prediction tasks with limited data. Using AI to predict disease can improve interventions slow down or prevent disease. Here, the authors show that generative AI models built on the framework of Transformer, the model that also empowers ChatGPT, can achieve state-of-the-art performance on disease predictions based on longitudinal electronic records.
NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes.
Wang, J.; Yao, Z.; Yang, Z.; Zhou, H.; Li, R.; Wang, X.; Xu, Y.; and Yu, H.
October 2023.
Number: arXiv:2310.15959 arXiv:2310.15959 [cs]
Paper
link
bibtex
abstract
@misc{wang_notechat_2023, title = {{NoteChat}: {A} {Dataset} of {Synthetic} {Doctor}-{Patient} {Conversations} {Conditioned} on {Clinical} {Notes}}, shorttitle = {{NoteChat}}, url = {http://arxiv.org/abs/2310.15959}, abstract = {The detailed clinical records drafted by doctors after each patient's visit are crucial for medical practitioners and researchers. Automating the creation of these notes with language models can reduce the workload of doctors. However, training such models can be difficult due to the limited public availability of conversations between patients and doctors. In this paper, we introduce NoteChat, a cooperative multi-agent framework leveraging Large Language Models (LLMs) for generating synthetic doctor-patient conversations conditioned on clinical notes. NoteChat consists of Planning, Roleplay, and Polish modules. We provide a comprehensive automatic and human evaluation of NoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT and GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic doctor-patient conversations, underscoring the untapped potential of LLMs in healthcare. This work represents the first instance of multiple LLMs cooperating to complete a doctor-patient conversation conditioned on clinical notes, offering promising avenues for the intersection of AI and healthcare}, urldate = {2023-11-15}, publisher = {arXiv}, author = {Wang, Junda and Yao, Zonghai and Yang, Zhichao and Zhou, Huixue and Li, Rumeng and Wang, Xun and Xu, Yucheng and Yu, Hong}, month = oct, year = {2023}, note = {Number: arXiv:2310.15959 arXiv:2310.15959 [cs]}, keywords = {Computer Science - Computation and Language}, }
The detailed clinical records drafted by doctors after each patient's visit are crucial for medical practitioners and researchers. Automating the creation of these notes with language models can reduce the workload of doctors. However, training such models can be difficult due to the limited public availability of conversations between patients and doctors. In this paper, we introduce NoteChat, a cooperative multi-agent framework leveraging Large Language Models (LLMs) for generating synthetic doctor-patient conversations conditioned on clinical notes. NoteChat consists of Planning, Roleplay, and Polish modules. We provide a comprehensive automatic and human evaluation of NoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT and GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic doctor-patient conversations, underscoring the untapped potential of LLMs in healthcare. This work represents the first instance of multiple LLMs cooperating to complete a doctor-patient conversation conditioned on clinical notes, offering promising avenues for the intersection of AI and healthcare
Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations.
Yang, Z.; Yao, Z.; Tasmin, M.; Vashisht, P.; Jang, W. S.; Ouyang, F.; Wang, B.; Berlowitz, D.; and Yu, H.
November 2023.
Pages: 2023.10.26.23297629
Paper
doi
link
bibtex
abstract
@misc{yang_performance_2023, title = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}: {Potential} for {Imaging} {Diagnostic} {Support} with {Explanations}}, copyright = {© 2023, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/}, shorttitle = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}}, url = {https://www.medrxiv.org/content/10.1101/2023.10.26.23297629v2}, doi = {10.1101/2023.10.26.23297629}, abstract = {Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2\%), AMBOSS (62.0\%), and DRQCE (73.1\%), outperforming ChatGPT and GPT-4 by relative increase of 131.8\% and 64.5\% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7\%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.}, language = {en}, urldate = {2023-11-14}, publisher = {medRxiv}, author = {Yang, Zhichao and Yao, Zonghai and Tasmin, Mahbuba and Vashisht, Parth and Jang, Won Seok and Ouyang, Feiyun and Wang, Beining and Berlowitz, Dan and Yu, Hong}, month = nov, year = {2023}, note = {Pages: 2023.10.26.23297629}, }
Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2%), AMBOSS (62.0%), and DRQCE (73.1%), outperforming ChatGPT and GPT-4 by relative increase of 131.8% and 64.5% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.
SELF-EXPLAIN: Teaching Large Language Models to Reason Complex Questions by Themselves.
Zhao, J.; Yao, Z.; Yang, Z.; and Yu, H.
November 2023.
Number: arXiv:2311.06985 arXiv:2311.06985 [cs] R0-FoMo: Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023.
Paper
link
bibtex
abstract
@misc{zhao_self-explain_2023, title = {{SELF}-{EXPLAIN}: {Teaching} {Large} {Language} {Models} to {Reason} {Complex} {Questions} by {Themselves}}, shorttitle = {{SELF}-{EXPLAIN}}, url = {http://arxiv.org/abs/2311.06985}, abstract = {Large language models (LLMs) can generate intermediate reasoning steps. To elicit the reliable reasoning, the common practice is to employ few-shot chain-of-thought prompting, where several in-context demonstrations for reasoning are prepended to the question. However, such chain-of-thought examples are expensive to craft, especially for professional domains, and can have high variance depending on human annotators. Therefore, this work investigates whether LLMs can teach themselves to reason without human-crafted demonstrations. We propose SELF-EXPLAIN to generate CoT examples by LLMs inspired by "encoding specificity" in human memory retrieval. We find using self-explanations makes LLMs more confident, more calibrated and less biased when answering complex questions. Moreover, we find prompting with self-explanations can even significantly outperform using human-crafted CoTs on several complex question answering dataset.}, urldate = {2023-11-14}, publisher = {arXiv}, author = {Zhao, Jiachen and Yao, Zonghai and Yang, Zhichao and Yu, Hong}, month = nov, year = {2023}, note = {Number: arXiv:2311.06985 arXiv:2311.06985 [cs] R0-FoMo: Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023.}, keywords = {Computer Science - Computation and Language}, }
Large language models (LLMs) can generate intermediate reasoning steps. To elicit the reliable reasoning, the common practice is to employ few-shot chain-of-thought prompting, where several in-context demonstrations for reasoning are prepended to the question. However, such chain-of-thought examples are expensive to craft, especially for professional domains, and can have high variance depending on human annotators. Therefore, this work investigates whether LLMs can teach themselves to reason without human-crafted demonstrations. We propose SELF-EXPLAIN to generate CoT examples by LLMs inspired by "encoding specificity" in human memory retrieval. We find using self-explanations makes LLMs more confident, more calibrated and less biased when answering complex questions. Moreover, we find prompting with self-explanations can even significantly outperform using human-crafted CoTs on several complex question answering dataset.
BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing.
Tran, H.; Yang, Z.; Yao, Z.; and Yu, H.
November 2023.
Number: arXiv:2310.19975 arXiv:2310.19975 [cs]
Paper
doi
link
bibtex
abstract
@misc{tran_bioinstruct_2023, title = {{BioInstruct}: {Instruction} {Tuning} of {Large} {Language} {Models} for {Biomedical} {Natural} {Language} {Processing}}, shorttitle = {{BioInstruct}}, url = {http://arxiv.org/abs/2310.19975}, doi = {10.48550/arXiv.2310.19975}, abstract = {To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 \& 2, 7B \& 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3\% in QA, 5.7\% in IE, and 96\% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.}, urldate = {2023-11-14}, publisher = {arXiv}, author = {Tran, Hieu and Yang, Zhichao and Yao, Zonghai and Yu, Hong}, month = nov, year = {2023}, note = {Number: arXiv:2310.19975 arXiv:2310.19975 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.
Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing.
Yao, Z.; Cao, Y.; Yang, Z.; and Yu, H.
AMIA Summits on Translational Science Proceedings, 2023: 592–601. June 2023.
Paper
link
bibtex
abstract
@article{yao_context_2023, title = {Context {Variance} {Evaluation} of {Pretrained} {Language} {Models} for {Prompt}-based {Biomedical} {Knowledge} {Probing}}, volume = {2023}, issn = {2153-4063}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10283095/}, abstract = {Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs’ knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduced context variance into the prompt generation and proposed a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we proposed the concept of ”Misunderstand” in LAMA for the first time. Through experiments on 12 PLMs, we showed that our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric make BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle ”understand” from just ”read and copy”.}, urldate = {2023-11-14}, journal = {AMIA Summits on Translational Science Proceedings}, author = {Yao, Zonghai and Cao, Yi and Yang, Zhichao and Yu, Hong}, month = jun, year = {2023}, pmid = {37350903}, pmcid = {PMC10283095}, pages = {592--601}, }
Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs’ knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduced context variance into the prompt generation and proposed a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we proposed the concept of ”Misunderstand” in LAMA for the first time. Through experiments on 12 PLMs, we showed that our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric make BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle ”understand” from just ”read and copy”.
Extracting Biomedical Factual Knowledge Using Pretrained Language Model and Electronic Health Record Context.
Yao, Z.; Cao, Y.; Yang, Z.; Deshpande, V.; and Yu, H.
AMIA Annual Symposium Proceedings, 2022: 1188–1197. April 2023.
Paper
link
bibtex
abstract
@article{yao_extracting_2023, title = {Extracting {Biomedical} {Factual} {Knowledge} {Using} {Pretrained} {Language} {Model} and {Electronic} {Health} {Record} {Context}}, volume = {2022}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148358/}, abstract = {Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.}, urldate = {2023-11-14}, journal = {AMIA Annual Symposium Proceedings}, author = {Yao, Zonghai and Cao, Yi and Yang, Zhichao and Deshpande, Vijeta and Yu, Hong}, month = apr, year = {2023}, pmid = {37128373}, pmcid = {PMC10148358}, pages = {1188--1197}, }
Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.
UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations?.
Wang, J.; Yao, Z.; Mitra, A.; Osebe, S.; Yang, Z.; and Yu, H.
In Naumann, T.; Ben Abacha, A.; Bethard, S.; Roberts, K.; and Rumshisky, A., editor(s), Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 460–471, Toronto, Canada, July 2023. Association for Computational Linguistics
Paper
doi
link
bibtex
abstract
@inproceedings{wang_umass_bionlp_2023, address = {Toronto, Canada}, title = {{UMASS}\_BioNLP at {MEDIQA}-{Chat} 2023: {Can} {LLMs} generate high-quality synthetic note-oriented doctor-patient conversations?}, shorttitle = {{UMASS}\_BioNLP at {MEDIQA}-{Chat} 2023}, url = {https://aclanthology.org/2023.clinicalnlp-1.49}, doi = {10.18653/v1/2023.clinicalnlp-1.49}, abstract = {This paper presents UMASS\_BioNLP team participation in the MEDIQA-Chat 2023 shared task for Task-A and Task-C. We focus especially on Task-C and propose a novel LLMs cooperation system named a doctor-patient loop to generate high-quality conversation data sets. The experiment results demonstrate that our approaches yield reasonable performance as evaluated by automatic metrics such as ROUGE, medical concept recall, BLEU, and Self-BLEU. Furthermore, we conducted a comparative analysis between our proposed method and ChatGPT and GPT-4. This analysis also investigates the potential of utilizing cooperation LLMs to generate high-quality datasets.}, urldate = {2023-11-14}, booktitle = {Proceedings of the 5th {Clinical} {Natural} {Language} {Processing} {Workshop}}, publisher = {Association for Computational Linguistics}, author = {Wang, Junda and Yao, Zonghai and Mitra, Avijit and Osebe, Samuel and Yang, Zhichao and Yu, Hong}, editor = {Naumann, Tristan and Ben Abacha, Asma and Bethard, Steven and Roberts, Kirk and Rumshisky, Anna}, month = jul, year = {2023}, pages = {460--471}, }
This paper presents UMASS_BioNLP team participation in the MEDIQA-Chat 2023 shared task for Task-A and Task-C. We focus especially on Task-C and propose a novel LLMs cooperation system named a doctor-patient loop to generate high-quality conversation data sets. The experiment results demonstrate that our approaches yield reasonable performance as evaluated by automatic metrics such as ROUGE, medical concept recall, BLEU, and Self-BLEU. Furthermore, we conducted a comparative analysis between our proposed method and ChatGPT and GPT-4. This analysis also investigates the potential of utilizing cooperation LLMs to generate high-quality datasets.
EHRTutor: Enhancing Patient Understanding of Discharge Instructions.
Zhang, Z.; Yao, Z.; Zhou, H.; ouyang , F.; and Yu, H.
October 2023.
To appear in NeurIPS'23 Workshop on Generative AI for Education (GAIED), December, New Orleans
Paper
doi
link
bibtex
abstract
@misc{zhang_ehrtutor_2023, title = {{EHRTutor}: {Enhancing} {Patient} {Understanding} of {Discharge} {Instructions}}, shorttitle = {{EHRTutor}}, url = {http://arxiv.org/abs/2310.19212}, doi = {10.48550/arXiv.2310.19212}, abstract = {Large language models have shown success as a tutor in education in various fields. Educating patients about their clinical visits plays a pivotal role in patients' adherence to their treatment plans post-discharge. This paper presents EHRTutor, an innovative multi-component framework leveraging the Large Language Model (LLM) for patient education through conversational question-answering. EHRTutor first formulates questions pertaining to the electronic health record discharge instructions. It then educates the patient through conversation by administering each question as a test. Finally, it generates a summary at the end of the conversation. Evaluation results using LLMs and domain experts have shown a clear preference for EHRTutor over the baseline. Moreover, EHRTutor also offers a framework for generating synthetic patient education dialogues that can be used for future in-house system training.}, urldate = {2023-11-01}, publisher = {arXiv}, author = {Zhang, Zihao and Yao, Zonghai and Zhou, Huixue and ouyang, Feiyun and Yu, Hong}, month = oct, year = {2023}, note = {To appear in NeurIPS'23 Workshop on Generative AI for Education (GAIED), December, New Orleans}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Large language models have shown success as a tutor in education in various fields. Educating patients about their clinical visits plays a pivotal role in patients' adherence to their treatment plans post-discharge. This paper presents EHRTutor, an innovative multi-component framework leveraging the Large Language Model (LLM) for patient education through conversational question-answering. EHRTutor first formulates questions pertaining to the electronic health record discharge instructions. It then educates the patient through conversation by administering each question as a test. Finally, it generates a summary at the end of the conversation. Evaluation results using LLMs and domain experts have shown a clear preference for EHRTutor over the baseline. Moreover, EHRTutor also offers a framework for generating synthetic patient education dialogues that can be used for future in-house system training.
Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization.
Mishra, P.; Yao, Z.; Chen, S.; Wang, B.; Mittal, R.; and Yu, H.
October 2023.
NeurIPS 2023 Workshop SyntheticData4ML Accepted
Paper
link
bibtex
abstract
@misc{mishra_synthetic_2023, title = {Synthetic {Imitation} {Edit} {Feedback} for {Factual} {Alignment} in {Clinical} {Summarization}}, url = {http://arxiv.org/abs/2310.20033}, abstract = {Large Language Models (LLMs) like the GPT and LLaMA families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. However, community concerns about these models' hallucination issues continue to rise. LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain NLP tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using human feedback has shown the promise of aligning LLMs to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. In this work, we propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. We focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical NLP tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. In addition, although GPT has reached the expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much previous work discussing whether GPT can generate expert-level edit feedback for LMs in the clinical note summarization task. We hope to fill this gap. Finally, our evaluations demonstrate the potential use of GPT edits in human alignment, especially from a factuality perspective.}, urldate = {2023-11-01}, publisher = {arXiv}, author = {Mishra, Prakamya and Yao, Zonghai and Chen, Shuwei and Wang, Beining and Mittal, Rohan and Yu, Hong}, month = oct, year = {2023}, note = {NeurIPS 2023 Workshop SyntheticData4ML Accepted}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Large Language Models (LLMs) like the GPT and LLaMA families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. However, community concerns about these models' hallucination issues continue to rise. LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain NLP tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using human feedback has shown the promise of aligning LLMs to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. In this work, we propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. We focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical NLP tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. In addition, although GPT has reached the expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much previous work discussing whether GPT can generate expert-level edit feedback for LMs in the clinical note summarization task. We hope to fill this gap. Finally, our evaluations demonstrate the potential use of GPT edits in human alignment, especially from a factuality perspective.
Improving Summarization with Human Edits.
Yao, Z.; Schloss, B. J.; and Selvaraj, S. P.
December 2023.
EMNLP 2023
Paper
link
bibtex
abstract
@misc{yao_improving_2023, title = {Improving {Summarization} with {Human} {Edits}}, url = {http://arxiv.org/abs/2310.05857}, abstract = {Recent work has shown the promise of learning with human feedback paradigms to produce human-determined high-quality text. Existing works use human feedback to train large language models (LLMs) in general domain abstractive summarization and have obtained summary quality exceeding traditional likelihood training. In this paper, we focus on a less explored form of human feedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training (SALT), a novel technique to use both the human-edited and model-generated data together in the training loop. In addition, we demonstrate simulating Human Edits with ground truth summaries coming from existing training data -- Imitation edits, along with the model-generated summaries obtained after the training, to reduce the need for expensive human-edit data. In our experiments, we extend human feedback exploration from general domain summarization to medical domain summarization. Our results demonstrate the effectiveness of SALT to improve the summary quality with Human and Imitation Edits.}, urldate = {2023-10-10}, publisher = {arXiv}, author = {Yao, Zonghai and Schloss, Benjamin J. and Selvaraj, Sai P.}, month = dec, year = {2023}, note = {EMNLP 2023}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning}, }
Recent work has shown the promise of learning with human feedback paradigms to produce human-determined high-quality text. Existing works use human feedback to train large language models (LLMs) in general domain abstractive summarization and have obtained summary quality exceeding traditional likelihood training. In this paper, we focus on a less explored form of human feedback – Human Edits. We propose Sequence Alignment (un)Likelihood Training (SALT), a novel technique to use both the human-edited and model-generated data together in the training loop. In addition, we demonstrate simulating Human Edits with ground truth summaries coming from existing training data – Imitation edits, along with the model-generated summaries obtained after the training, to reduce the need for expensive human-edit data. In our experiments, we extend human feedback exploration from general domain summarization to medical domain summarization. Our results demonstrate the effectiveness of SALT to improve the summary quality with Human and Imitation Edits.
Intentional Self-Harm Among US Veterans With Traumatic Brain Injury or Posttraumatic Stress Disorder: Retrospective Cohort Study From 2008 to 2017.
Rawat, B. P. S.; Reisman, J.; Pogoda, T. K; Liu, W.; Rongali, S.; Aseltine Jr, R. H; Chen, K.; Tsai, J.; Berlowitz, D.; Yu, H.; and Carlson, K. F
JMIR Public Health and Surveillance, 9: e42803. July 2023.
Paper
doi
link
bibtex
abstract
@article{rawat_intentional_2023, title = {Intentional {Self}-{Harm} {Among} {US} {Veterans} {With} {Traumatic} {Brain} {Injury} or {Posttraumatic} {Stress} {Disorder}: {Retrospective} {Cohort} {Study} {From} 2008 to 2017}, volume = {9}, issn = {2369-2960}, shorttitle = {Intentional {Self}-{Harm} {Among} {US} {Veterans} {With} {Traumatic} {Brain} {Injury} or {Posttraumatic} {Stress} {Disorder}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10407646/}, doi = {10.2196/42803}, abstract = {Background Veterans with a history of traumatic brain injury (TBI) and/or posttraumatic stress disorder (PTSD) may be at increased risk of suicide attempts and other forms of intentional self-harm as compared to veterans without TBI or PTSD. Objective Using administrative data from the US Veterans Health Administration (VHA), we studied associations between TBI and PTSD diagnoses, and subsequent diagnoses of intentional self-harm among US veterans who used VHA health care between 2008 and 2017. Methods All veterans with encounters or hospitalizations for intentional self-harm were assigned “index dates” corresponding to the date of the first related visit; among those without intentional self-harm, we randomly selected a date from among the veteran’s health care encounters to match the distribution of case index dates over the 10-year period. We then examined the prevalence of TBI and PTSD diagnoses within the 5-year period prior to veterans’ index dates. TBI, PTSD, and intentional self-harm were identified using International Classification of Diseases diagnosis and external cause of injury codes from inpatient and outpatient VHA encounters. We stratified analyses by veterans’ average yearly VHA utilization in the 5-year period before their index date (low, medium, or high). Variations in prevalence and odds of intentional self-harm diagnoses were compared by veterans’ prior TBI and PTSD diagnosis status (TBI only, PTSD only, and comorbid TBI/PTSD) for each VHA utilization stratum. Multivariable models adjusted for age, sex, race, ethnicity, marital status, Department of Veterans Affairs service-connection status, and Charlson Comorbidity Index scores. Results About 6.7 million veterans with at least two VHA visits in the 5-year period before their index dates were included in the analyses; 86,644 had at least one intentional self-harm diagnosis during the study period. During the periods prior to veterans’ index dates, 93,866 were diagnosed with TBI only; 892,420 with PTSD only; and 102,549 with comorbid TBI/PTSD. Across all three VHA utilization strata, the prevalence of intentional self-harm diagnoses was higher among veterans diagnosed with TBI, PTSD, or TBI/PTSD than among veterans with neither diagnosis. The observed difference was most pronounced among veterans in the high VHA utilization stratum. The prevalence of intentional self-harm was six times higher among those with comorbid TBI/PTSD (6778/58,295, 11.63\%) than among veterans with neither TBI nor PTSD (21,979/1,144,991, 1.92\%). Adjusted odds ratios suggested that, after accounting for potential confounders, veterans with TBI, PTSD, or comorbid TBI/PTSD had higher odds of self-harm compared to veterans without these diagnoses. Among veterans with high VHA utilization, those with comorbid TBI/PTSD were 4.26 (95\% CI 4.15-4.38) times more likely to receive diagnoses for intentional self-harm than veterans with neither diagnosis. This pattern was similar for veterans with low and medium VHA utilization. Conclusions Veterans with TBI and/or PTSD diagnoses, compared to those with neither diagnosis, were substantially more likely to be subsequently diagnosed with intentional self-harm between 2008 and 2017. These associations were most pronounced among veterans who used VHA health care most frequently. These findings suggest a need for suicide prevention efforts targeted at veterans with these diagnoses.}, urldate = {2023-09-13}, journal = {JMIR Public Health and Surveillance}, author = {Rawat, Bhanu Pratap Singh and Reisman, Joel and Pogoda, Terri K and Liu, Weisong and Rongali, Subendhu and Aseltine Jr, Robert H and Chen, Kun and Tsai, Jack and Berlowitz, Dan and Yu, Hong and Carlson, Kathleen F}, month = jul, year = {2023}, pmid = {37486751}, pmcid = {PMC10407646}, pages = {e42803}, }
Background Veterans with a history of traumatic brain injury (TBI) and/or posttraumatic stress disorder (PTSD) may be at increased risk of suicide attempts and other forms of intentional self-harm as compared to veterans without TBI or PTSD. Objective Using administrative data from the US Veterans Health Administration (VHA), we studied associations between TBI and PTSD diagnoses, and subsequent diagnoses of intentional self-harm among US veterans who used VHA health care between 2008 and 2017. Methods All veterans with encounters or hospitalizations for intentional self-harm were assigned “index dates” corresponding to the date of the first related visit; among those without intentional self-harm, we randomly selected a date from among the veteran’s health care encounters to match the distribution of case index dates over the 10-year period. We then examined the prevalence of TBI and PTSD diagnoses within the 5-year period prior to veterans’ index dates. TBI, PTSD, and intentional self-harm were identified using International Classification of Diseases diagnosis and external cause of injury codes from inpatient and outpatient VHA encounters. We stratified analyses by veterans’ average yearly VHA utilization in the 5-year period before their index date (low, medium, or high). Variations in prevalence and odds of intentional self-harm diagnoses were compared by veterans’ prior TBI and PTSD diagnosis status (TBI only, PTSD only, and comorbid TBI/PTSD) for each VHA utilization stratum. Multivariable models adjusted for age, sex, race, ethnicity, marital status, Department of Veterans Affairs service-connection status, and Charlson Comorbidity Index scores. Results About 6.7 million veterans with at least two VHA visits in the 5-year period before their index dates were included in the analyses; 86,644 had at least one intentional self-harm diagnosis during the study period. During the periods prior to veterans’ index dates, 93,866 were diagnosed with TBI only; 892,420 with PTSD only; and 102,549 with comorbid TBI/PTSD. Across all three VHA utilization strata, the prevalence of intentional self-harm diagnoses was higher among veterans diagnosed with TBI, PTSD, or TBI/PTSD than among veterans with neither diagnosis. The observed difference was most pronounced among veterans in the high VHA utilization stratum. The prevalence of intentional self-harm was six times higher among those with comorbid TBI/PTSD (6778/58,295, 11.63%) than among veterans with neither TBI nor PTSD (21,979/1,144,991, 1.92%). Adjusted odds ratios suggested that, after accounting for potential confounders, veterans with TBI, PTSD, or comorbid TBI/PTSD had higher odds of self-harm compared to veterans without these diagnoses. Among veterans with high VHA utilization, those with comorbid TBI/PTSD were 4.26 (95% CI 4.15-4.38) times more likely to receive diagnoses for intentional self-harm than veterans with neither diagnosis. This pattern was similar for veterans with low and medium VHA utilization. Conclusions Veterans with TBI and/or PTSD diagnoses, compared to those with neither diagnosis, were substantially more likely to be subsequently diagnosed with intentional self-harm between 2008 and 2017. These associations were most pronounced among veterans who used VHA health care most frequently. These findings suggest a need for suicide prevention efforts targeted at veterans with these diagnoses.
PaniniQA: Enhancing Patient Education Through Interactive Question Answering.
Cai, P.; Yao, Z.; Liu, F.; Wang, D.; Reilly, M.; Zhou, H.; Li, L.; Cao, Y.; Kapoor, A.; Bajracharya, A.; Berlowitz, D.; and Yu, H.
Transactions of the Association for Computational Linguistics. August 2023.
Equal contributions for the first two authors.
Paper
link
bibtex
abstract
@article{cai_paniniqa_2023, title = {{PaniniQA}: {Enhancing} {Patient} {Education} {Through} {Interactive} {Question} {Answering}}, shorttitle = {{PaniniQA}}, url = {http://arxiv.org/abs/2308.03253}, abstract = {Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions}, urldate = {2023-08-08}, journal = {Transactions of the Association for Computational Linguistics}, author = {Cai, Pengshan and Yao, Zonghai and Liu, Fei and Wang, Dakuo and Reilly, Meghan and Zhou, Huixue and Li, Lingxi and Cao, Yi and Kapoor, Alok and Bajracharya, Adarsha and Berlowitz, Dan and Yu, Hong}, month = aug, year = {2023}, note = {Equal contributions for the first two authors.}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions
Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond.
Chang, H.; Yao, Z.; Gon, A.; Yu, H.; and McCallum, A.
July 2023.
ACL 2023, equal contribution from the first two authors.
Paper
link