Excellent! Next you can
create a new website with this list, or
embed it in an existing web page by copying & pasting
any of the following snippets.
JavaScript
(easiest)
PHP
iFrame
(not recommended)
<script src="https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1&jsonp=1"></script>
<?php
$contents = file_get_contents("https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1");
print_r($contents);
?>
<iframe src="https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1"></iframe>
For more details see the documention.
This is a preview! To use this list on your own web site
or create a new web site from it,
create a free account. The file will be added
and you will be able to edit it in the File Manager.
We will show you instructions once you've created your account.
To the site owner:
Action required! Mendeley is changing its API. In order to keep using Mendeley with BibBase past April 14th, you need to:
- renew the authorization for BibBase on Mendeley, and
- update the BibBase URL in your page the same way you did when you initially set up this page.
2024
(9)
LocalTweets to LocalHealth: A Mental Health Surveillance Framework Based on Twitter Data.
Deshpande, V.; Lee, M.; Yao, Z.; Zhang, Z.; Gibbons, J. B.; and Yu, H.
March 2024.
arXiv:2402.13452 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{deshpande_localtweets_2024, title = {{LocalTweets} to {LocalHealth}: {A} {Mental} {Health} {Surveillance} {Framework} {Based} on {Twitter} {Data}}, shorttitle = {{LocalTweets} to {LocalHealth}}, url = {http://arxiv.org/abs/2402.13452}, abstract = {Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78{\textbackslash}\%, respectively, a 59{\textbackslash}\% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC's estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Deshpande, Vijeta and Lee, Minhwa and Yao, Zonghai and Zhang, Zihao and Gibbons, Jason Brian and Yu, Hong}, month = mar, year = {2024}, note = {arXiv:2402.13452 [cs]}, keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Social and Information Networks}, }
Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78\%, respectively, a 59\% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC's estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.
SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization.
Mishra, P.; Yao, Z.; Vashisht, P.; Ouyang, F.; Wang, B.; Mody, V. D.; and Yu, H.
April 2024.
arXiv:2402.13919 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{mishra_synfac-edit_2024, title = {{SYNFAC}-{EDIT}: {Synthetic} {Imitation} {Edit} {Feedback} for {Factual} {Alignment} in {Clinical} {Summarization}}, shorttitle = {{SYNFAC}-{EDIT}}, url = {http://arxiv.org/abs/2402.13919}, abstract = {Large Language Models (LLMs) such as GPT \& Llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical NLP applications where errors could lead to serious consequences. To counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes {\textgreater}100B parameter GPT variants like GPT-3.5 \& GPT-4 to act as synthetic experts to generate high-quality synthetics feedback aimed at enhancing factual consistency in clinical note summarization. Our research primarily focuses on edit feedback generated by these synthetic feedback experts without additional human annotations, mirroring and optimizing the practical scenario in which medical professionals refine AI system outputs. Although such 100B+ parameter GPT variants have proven to demonstrate expertise in various clinical NLP tasks, such as the Medical Licensing Examination, there is scant research on their capacity to act as synthetic feedback experts and deliver expert-level edit feedback for improving the generation quality of weaker ({\textless}10B parameter) LLMs like GPT-2 (1.5B) \& Llama 2 (7B) in clinical domain. So in this work, we leverage 100B+ GPT variants to act as synthetic feedback experts offering expert-level edit feedback, that is used to reduce hallucinations and align weaker ({\textless}10B parameter) LLMs with medical facts using two distinct alignment algorithms (DPO \& SALT), endeavoring to narrow the divide between AI-generated content and factual accuracy. This highlights the substantial potential of LLM-based synthetic edits in enhancing the alignment of clinical factuality.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Mishra, Prakamya and Yao, Zonghai and Vashisht, Parth and Ouyang, Feiyun and Wang, Beining and Mody, Vidhi Dhaval and Yu, Hong}, month = apr, year = {2024}, note = {arXiv:2402.13919 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Large Language Models (LLMs) such as GPT & Llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical NLP applications where errors could lead to serious consequences. To counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes \textgreater100B parameter GPT variants like GPT-3.5 & GPT-4 to act as synthetic experts to generate high-quality synthetics feedback aimed at enhancing factual consistency in clinical note summarization. Our research primarily focuses on edit feedback generated by these synthetic feedback experts without additional human annotations, mirroring and optimizing the practical scenario in which medical professionals refine AI system outputs. Although such 100B+ parameter GPT variants have proven to demonstrate expertise in various clinical NLP tasks, such as the Medical Licensing Examination, there is scant research on their capacity to act as synthetic feedback experts and deliver expert-level edit feedback for improving the generation quality of weaker (\textless10B parameter) LLMs like GPT-2 (1.5B) & Llama 2 (7B) in clinical domain. So in this work, we leverage 100B+ GPT variants to act as synthetic feedback experts offering expert-level edit feedback, that is used to reduce hallucinations and align weaker (\textless10B parameter) LLMs with medical facts using two distinct alignment algorithms (DPO & SALT), endeavoring to narrow the divide between AI-generated content and factual accuracy. This highlights the substantial potential of LLM-based synthetic edits in enhancing the alignment of clinical factuality.
Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data.
Xu, X.; Yao, B.; Dong, Y.; Gabriel, S.; Yu, H.; Hendler, J.; Ghassemi, M.; Dey, A. K.; and Wang, D.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(1): 1–32. March 2024.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{xu_mental-llm_2024, title = {Mental-{LLM}: {Leveraging} {Large} {Language} {Models} for {Mental} {Health} {Prediction} via {Online} {Text} {Data}}, volume = {8}, issn = {2474-9567}, shorttitle = {Mental-{LLM}}, url = {https://dl.acm.org/doi/10.1145/3643540}, doi = {10.1145/3643540}, abstract = {Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9\% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8\%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.}, language = {en}, number = {1}, urldate = {2024-09-03}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies}, author = {Xu, Xuhai and Yao, Bingsheng and Dong, Yuanzhe and Gabriel, Saadia and Yu, Hong and Hendler, James and Ghassemi, Marzyeh and Dey, Anind K. and Wang, Dakuo}, month = mar, year = {2024}, pages = {1--32}, }
Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.
UMass-BioNLP at MEDIQA-M3G 2024: DermPrompt – A Systematic Exploration of Prompt Engineering with GPT-4V for Dermatological Diagnosis.
Vashisht, P.; Lodha, A.; Maddipatla, M.; Yao, Z.; Mitra, A.; Yang, Z.; Wang, J.; Kwon, S.; and Yu, H.
May 2024.
arXiv:2404.17749 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{vashisht_umass-bionlp_2024, title = {{UMass}-{BioNLP} at {MEDIQA}-{M3G} 2024: {DermPrompt} -- {A} {Systematic} {Exploration} of {Prompt} {Engineering} with {GPT}-{4V} for {Dermatological} {Diagnosis}}, shorttitle = {{UMass}-{BioNLP} at {MEDIQA}-{M3G} 2024}, url = {http://arxiv.org/abs/2404.17749}, abstract = {This paper presents our team's participation in the MEDIQA-ClinicalNLP2024 shared task B. We present a novel approach to diagnosing clinical dermatology cases by integrating large multimodal models, specifically leveraging the capabilities of GPT-4V under a retriever and a re-ranker framework. Our investigation reveals that GPT-4V, when used as a retrieval agent, can accurately retrieve the correct skin condition 85\% of the time using dermatological images and brief patient histories. Additionally, we empirically show that Naive Chain-of-Thought (CoT) works well for retrieval while Medical Guidelines Grounded CoT is required for accurate dermatological diagnosis. Further, we introduce a Multi-Agent Conversation (MAC) framework and show its superior performance and potential over the best CoT strategy. The experiments suggest that using naive CoT for retrieval and multi-agent conversation for critique-based diagnosis, GPT-4V can lead to an early and accurate diagnosis of dermatological conditions. The implications of this work extend to improving diagnostic workflows, supporting dermatological education, and enhancing patient care by providing a scalable, accessible, and accurate diagnostic tool.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Vashisht, Parth and Lodha, Abhilasha and Maddipatla, Mukta and Yao, Zonghai and Mitra, Avijit and Yang, Zhichao and Wang, Junda and Kwon, Sunjae and Yu, Hong}, month = may, year = {2024}, note = {arXiv:2404.17749 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
This paper presents our team's participation in the MEDIQA-ClinicalNLP2024 shared task B. We present a novel approach to diagnosing clinical dermatology cases by integrating large multimodal models, specifically leveraging the capabilities of GPT-4V under a retriever and a re-ranker framework. Our investigation reveals that GPT-4V, when used as a retrieval agent, can accurately retrieve the correct skin condition 85% of the time using dermatological images and brief patient histories. Additionally, we empirically show that Naive Chain-of-Thought (CoT) works well for retrieval while Medical Guidelines Grounded CoT is required for accurate dermatological diagnosis. Further, we introduce a Multi-Agent Conversation (MAC) framework and show its superior performance and potential over the best CoT strategy. The experiments suggest that using naive CoT for retrieval and multi-agent conversation for critique-based diagnosis, GPT-4V can lead to an early and accurate diagnosis of dermatological conditions. The implications of this work extend to improving diagnostic workflows, supporting dermatological education, and enhancing patient care by providing a scalable, accessible, and accurate diagnostic tool.
Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text.
Mitra, A.; Druhl, E.; Goodwin, R.; and Yu, H.
June 2024.
arXiv:2406.06056 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{mitra_synth-sbdh_2024, title = {Synth-{SBDH}: {A} {Synthetic} {Dataset} of {Social} and {Behavioral} {Determinants} of {Health} for {Clinical} {Text}}, shorttitle = {Synth-{SBDH}}, url = {http://arxiv.org/abs/2406.06056}, abstract = {Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 62.5\% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints. Human evaluation demonstrates a Human-LLM alignment of 71.06\% and uncovers areas for future refinements.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Mitra, Avijit and Druhl, Emily and Goodwin, Raelene and Yu, Hong}, month = jun, year = {2024}, note = {arXiv:2406.06056 [cs]}, keywords = {Computer Science - Computation and Language}, }
Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 62.5% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints. Human evaluation demonstrates a Human-LLM alignment of 71.06% and uncovers areas for future refinements.
ReadCtrl: Personalizing text generation with readability-controlled instruction learning.
Tran, H.; Yao, Z.; Li, L.; and Yu, H.
June 2024.
arXiv:2406.09205 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{tran_readctrl_2024, title = {{ReadCtrl}: {Personalizing} text generation with readability-controlled instruction learning}, shorttitle = {{ReadCtrl}}, url = {http://arxiv.org/abs/2406.09205}, abstract = {Content generation conditioning on users's readability is an important application for personalization. In an era of large language models (LLMs), readability-controlled text generation based on LLMs has become increasingly important. This paper introduces a novel methodology called "Readability-Controlled Instruction Learning (ReadCtrl)," which aims to instruction-tune LLMs to tailor users' readability levels. Unlike the traditional methods, which primarily focused on categorical readability adjustments typically classified as high, medium, and low or expert and layperson levels with limited success, ReadCtrl introduces a dynamic framework that enables LLMs to generate content at various (near continuous level) complexity levels, thereby enhancing their versatility across different applications. Our results show that the ReadCtrl-Mistral-7B models significantly outperformed strong baseline models such as GPT-4 and Claude-3, with a win rate of 52.1\%:35.7\% against GPT-4 in human evaluations. Furthermore, Read-Ctrl has shown significant improvements in automatic evaluations, as evidenced by better readability metrics (e.g., FOG, FKGL) and generation quality metrics (e.g., BLEU, SARI, SummaC-Factuality, UniEval-Consistency and Coherence). These results underscore Read-Ctrl's effectiveness and tenacity in producing high-quality, contextually appropriate outputs that closely align with targeted readability levels, marking a significant advancement in personalized content generation using LLMs.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Tran, Hieu and Yao, Zonghai and Li, Lingxi and Yu, Hong}, month = jun, year = {2024}, note = {arXiv:2406.09205 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Content generation conditioning on users's readability is an important application for personalization. In an era of large language models (LLMs), readability-controlled text generation based on LLMs has become increasingly important. This paper introduces a novel methodology called "Readability-Controlled Instruction Learning (ReadCtrl)," which aims to instruction-tune LLMs to tailor users' readability levels. Unlike the traditional methods, which primarily focused on categorical readability adjustments typically classified as high, medium, and low or expert and layperson levels with limited success, ReadCtrl introduces a dynamic framework that enables LLMs to generate content at various (near continuous level) complexity levels, thereby enhancing their versatility across different applications. Our results show that the ReadCtrl-Mistral-7B models significantly outperformed strong baseline models such as GPT-4 and Claude-3, with a win rate of 52.1%:35.7% against GPT-4 in human evaluations. Furthermore, Read-Ctrl has shown significant improvements in automatic evaluations, as evidenced by better readability metrics (e.g., FOG, FKGL) and generation quality metrics (e.g., BLEU, SARI, SummaC-Factuality, UniEval-Consistency and Coherence). These results underscore Read-Ctrl's effectiveness and tenacity in producing high-quality, contextually appropriate outputs that closely align with targeted readability levels, marking a significant advancement in personalized content generation using LLMs.
A Psychology-based Unified Dynamic Framework for Curriculum Learning.
Meng, G.; Zeng, Q.; Lalor, J. P.; and Yu, H.
August 2024.
arXiv:2408.05326 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{meng_psychology-based_2024, title = {A {Psychology}-based {Unified} {Dynamic} {Framework} for {Curriculum} {Learning}}, url = {http://arxiv.org/abs/2408.05326}, abstract = {Directly learning from examples of random difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order, from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. This paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF), drawing inspiration from psychometrics. We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a Dynamic Data Selection via Model Ability Estimation (DDS-MAE) strategy to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to a faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained language models with PUDF enhances their performance on the GLUE benchmark. Moreover, PUDF surpasses other state-of-the-art (SOTA) CL methods on the GLUE benchmark. We further explore the components of PUDF, namely the difficulty measurer (IRT-AC) and the training scheduler (DDS-MAE) qualitatively and quantitatively. Lastly, we conduct an ablation study to clarify which components of PUDF contribute to faster convergence and higher accuracy.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Meng, Guangyu and Zeng, Qingkai and Lalor, John P. and Yu, Hong}, month = aug, year = {2024}, note = {arXiv:2408.05326 [cs]}, keywords = {Computer Science - Computation and Language}, }
Directly learning from examples of random difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order, from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. This paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF), drawing inspiration from psychometrics. We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a Dynamic Data Selection via Model Ability Estimation (DDS-MAE) strategy to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to a faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained language models with PUDF enhances their performance on the GLUE benchmark. Moreover, PUDF surpasses other state-of-the-art (SOTA) CL methods on the GLUE benchmark. We further explore the components of PUDF, namely the difficulty measurer (IRT-AC) and the training scheduler (DDS-MAE) qualitatively and quantitatively. Lastly, we conduct an ablation study to clarify which components of PUDF contribute to faster convergence and higher accuracy.
Large Language Model-based Role-Playing for Personalized Medical Jargon Extraction.
Lim, J. H.; Kwon, S.; Yao, Z.; Lalor, J. P.; and Yu, H.
August 2024.
arXiv:2408.05555 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{lim_large_2024, title = {Large {Language} {Model}-based {Role}-{Playing} for {Personalized} {Medical} {Jargon} {Extraction}}, url = {http://arxiv.org/abs/2408.05555}, abstract = {Previous studies reveal that Electronic Health Records (EHR), which have been widely adopted in the U.S. to allow patients to access their personal medical information, do not have high readability to patients due to the prevalence of medical jargon. Tailoring medical notes to individual comprehension by identifying jargon that is difficult for each person will enhance the utility of generative models. We present the first quantitative analysis to measure the impact of role-playing in LLM in medical term extraction. By comparing the results of Mechanical Turk workers over 20 sentences, our study demonstrates that LLM role-playing improves F1 scores in 95\% of cases across 14 different socio-demographic backgrounds. Furthermore, applying role-playing with in-context learning outperformed the previous state-of-the-art models. Our research showed that ChatGPT can improve traditional medical term extraction systems by utilizing role-play to deliver personalized patient education, a potential that previous models had not achieved.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Lim, Jung Hoon and Kwon, Sunjae and Yao, Zonghai and Lalor, John P. and Yu, Hong}, month = aug, year = {2024}, note = {arXiv:2408.05555 [cs]}, keywords = {Computer Science - Computation and Language}, }
Previous studies reveal that Electronic Health Records (EHR), which have been widely adopted in the U.S. to allow patients to access their personal medical information, do not have high readability to patients due to the prevalence of medical jargon. Tailoring medical notes to individual comprehension by identifying jargon that is difficult for each person will enhance the utility of generative models. We present the first quantitative analysis to measure the impact of role-playing in LLM in medical term extraction. By comparing the results of Mechanical Turk workers over 20 sentences, our study demonstrates that LLM role-playing improves F1 scores in 95% of cases across 14 different socio-demographic backgrounds. Furthermore, applying role-playing with in-context learning outperformed the previous state-of-the-art models. Our research showed that ChatGPT can improve traditional medical term extraction systems by utilizing role-play to deliver personalized patient education, a potential that previous models had not achieved.
ODD: A Benchmark Dataset for the Natural Language Processing based Opioid Related Aberrant Behavior Detection.
Kwon, S.; Wang, X.; Liu, W.; Druhl, E.; Sung, M. L.; Reisman, J. I.; Li, W.; Kerns, R. D.; Becker, W.; and Yu, H.
In June 2024. arXiv
Number: arXiv:2307.02591 arXiv:2307.02591 [cs]
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{kwon_odd_2024, title = {{ODD}: {A} {Benchmark} {Dataset} for the {Natural} {Language} {Processing} based {Opioid} {Related} {Aberrant} {Behavior} {Detection}}, shorttitle = {{ODD}}, url = {http://arxiv.org/abs/2307.02591}, doi = {10.48550/arXiv.2307.02591}, abstract = {Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17\% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.}, urldate = {2024-05-21}, publisher = {arXiv}, author = {Kwon, Sunjae and Wang, Xun and Liu, Weisong and Druhl, Emily and Sung, Minhee L. and Reisman, Joel I. and Li, Wenjun and Kerns, Robert D. and Becker, William and Yu, Hong}, month = jun, year = {2024}, note = {Number: arXiv:2307.02591 arXiv:2307.02591 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.
2023
(22)
An Investigation of the Representation of Social Determinants of Health in the UMLS.
Rawat, B. P. S.; Keating, H.; Goodwin, R.; Druhl, E.; and Yu, H.
AMIA Annual Symposium Proceedings, 2022: 912–921. April 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@article{rawat_investigation_2023, title = {An {Investigation} of the {Representation} of {Social} {Determinants} of {Health} in the {UMLS}}, volume = {2022}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148271/}, abstract = {Social Determinants of Health (SDOH) are the conditions in which people are born, live, work, and age. Unified Medical Language System (UMLS) incorporates SDOH concepts; but few have evaluated its coverage and quality. With 15,649 expert-annotated SDOH mentions from 3176 randomly selected electronic health record (EHR) notes, we found that 100\% SDOH mentions can be mapped to at least one UMLS concept, indicating a good coverage of SDOH. However, we discovered a few challenges for the UMLS’s representation of SDOH. Next, we developed a multi-step framework to identify SDOH concepts from UMLS, and a clinical BERT-based classification algorithm to assign each identified SDOH concept to one of the six general categories. Our multi-step framework extracted a total of 198, 677 SDOH concepts from the UMLS and the SDOH category classification system attained an accuracy of 91\%. We also built EASE: an open-source tool to Extract SDOH from EHRs.}, urldate = {2024-04-10}, journal = {AMIA Annual Symposium Proceedings}, author = {Rawat, Bhanu Pratap Singh and Keating, Heather and Goodwin, Raelene and Druhl, Emily and Yu, Hong}, month = apr, year = {2023}, pmid = {37128364}, pmcid = {PMC10148271}, pages = {912--921}, }
Social Determinants of Health (SDOH) are the conditions in which people are born, live, work, and age. Unified Medical Language System (UMLS) incorporates SDOH concepts; but few have evaluated its coverage and quality. With 15,649 expert-annotated SDOH mentions from 3176 randomly selected electronic health record (EHR) notes, we found that 100% SDOH mentions can be mapped to at least one UMLS concept, indicating a good coverage of SDOH. However, we discovered a few challenges for the UMLS’s representation of SDOH. Next, we developed a multi-step framework to identify SDOH concepts from UMLS, and a clinical BERT-based classification algorithm to assign each identified SDOH concept to one of the six general categories. Our multi-step framework extracted a total of 198, 677 SDOH concepts from the UMLS and the SDOH category classification system attained an accuracy of 91%. We also built EASE: an open-source tool to Extract SDOH from EHRs.
H4H: A Comprehensive Repository of Housing Resources for Homelessness.
Osebe, S.; Tsai, J.; and Hong, Y.
AMIA Summits on Translational Science Proceedings, 2023: 427–437. June 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@article{osebe_h4h_2023, title = {{H4H}: {A} {Comprehensive} {Repository} of {Housing} {Resources} for {Homelessness}}, volume = {2023}, issn = {2153-4063}, shorttitle = {{H4H}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10283121/}, abstract = {More than half a million people were experiencing homelessness in America on any given night in 2021, yet only around 50\% of them used shelters. To address unmet needs in homelessness, we report the creation of housing for homeless (H4H), the largest comprehensive repository of emergency shelters and other housing resources, from which we deployed state-of-the-art natural language processing approaches to extract information vital to individuals experiencing homelessness, including admission process, service provided, duration of stay, and eligibility. We frame information extraction as a question-answer task. Using 2,055 question-answer pairs for training and evaluation, the best performing system was a two-step classification and question-answering Roberta model with prompting, achieving a macro-average of 75.83 for F1 score. H4H and the annotated entries are publicly available as a benchmark dataset.}, urldate = {2024-04-10}, journal = {AMIA Summits on Translational Science Proceedings}, author = {Osebe, Samue and Tsai, Jack and Hong, Yu}, month = jun, year = {2023}, pmid = {37350907}, pmcid = {PMC10283121}, pages = {427--437}, }
More than half a million people were experiencing homelessness in America on any given night in 2021, yet only around 50% of them used shelters. To address unmet needs in homelessness, we report the creation of housing for homeless (H4H), the largest comprehensive repository of emergency shelters and other housing resources, from which we deployed state-of-the-art natural language processing approaches to extract information vital to individuals experiencing homelessness, including admission process, service provided, duration of stay, and eligibility. We frame information extraction as a question-answer task. Using 2,055 question-answer pairs for training and evaluation, the best performing system was a two-step classification and question-answering Roberta model with prompting, achieving a macro-average of 75.83 for F1 score. H4H and the annotated entries are publicly available as a benchmark dataset.
Multi-label Few-shot ICD Coding as Autoregressive Generation with Prompt.
Yang, Z.; Kwon, S.; Yao, Z.; and Yu, H.
Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 37(4): 5366–5374. June 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yang_multi-label_2023, title = {Multi-label {Few}-shot {ICD} {Coding} as {Autoregressive} {Generation} with {Prompt}}, volume = {37}, issn = {2159-5399}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457101/}, doi = {10.1609/aaai.v37i4.25668}, abstract = {Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with an average of 3,000+ tokens. This task is challenging due to the high-dimensional space of multi-label assignment (155,000+ ICD code candidates) and the long-tail challenge - Many ICD codes are infrequently assigned yet infrequent ICD codes are important clinically. This study addresses the long-tail challenge by transforming this multi-label classification task into an autoregressive generation task. Specifically, we first introduce a novel pretraining objective to generate free text diagnoses and procedures using the SOAP structure, the medical logic physicians use for note documentation. Second, instead of directly predicting the high dimensional space of ICD codes, our model generates the lower dimension of text descriptions, which then infers ICD codes. Third, we designed a novel prompt template for multi-label classification. We evaluate our Generation with Prompt (GPsoap) model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMIC-III-few show that our model performs with a marco F130.2, which substantially outperforms the previous MIMIC-III-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7). Finally, we design a novel ensemble learner, a cross-attention reranker with prompts, to integrate previous SOTA and our best few-shot coding predictions. Experiments on MIMIC-III-full show that our ensemble learner substantially improves both macro and micro F1, from 10.4 to 14.6 and from 58.2 to 59.1, respectively.}, number = {4}, urldate = {2024-04-10}, journal = {Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence}, author = {Yang, Zhichao and Kwon, Sunjae and Yao, Zonghai and Yu, Hong}, month = jun, year = {2023}, pmid = {37635946}, pmcid = {PMC10457101}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, pages = {5366--5374}, }
Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with an average of 3,000+ tokens. This task is challenging due to the high-dimensional space of multi-label assignment (155,000+ ICD code candidates) and the long-tail challenge - Many ICD codes are infrequently assigned yet infrequent ICD codes are important clinically. This study addresses the long-tail challenge by transforming this multi-label classification task into an autoregressive generation task. Specifically, we first introduce a novel pretraining objective to generate free text diagnoses and procedures using the SOAP structure, the medical logic physicians use for note documentation. Second, instead of directly predicting the high dimensional space of ICD codes, our model generates the lower dimension of text descriptions, which then infers ICD codes. Third, we designed a novel prompt template for multi-label classification. We evaluate our Generation with Prompt (GPsoap) model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMIC-III-few show that our model performs with a marco F130.2, which substantially outperforms the previous MIMIC-III-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7). Finally, we design a novel ensemble learner, a cross-attention reranker with prompts, to integrate previous SOTA and our best few-shot coding predictions. Experiments on MIMIC-III-full show that our ensemble learner substantially improves both macro and micro F1, from 10.4 to 14.6 and from 58.2 to 59.1, respectively.
Evaluating the Efficacy of NoteAid on EHR Note Comprehension among US Veterans through Amazon Mechanical Turk.
Lalor, J. P.; Wu, H.; Mazor, K. M.; and Yu, H.
International journal of medical informatics, 172: 105006. April 2023.
Paper doi link bibtex
Paper doi link bibtex
@article{lalor_evaluating_2023, title = {Evaluating the {Efficacy} of {NoteAid} on {EHR} {Note} {Comprehension} among {US} {Veterans} through {Amazon} {Mechanical} {Turk}}, volume = {172}, issn = {1386-5056}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9992155/}, doi = {10.1016/j.ijmedinf.2023.105006}, urldate = {2024-04-10}, journal = {International journal of medical informatics}, author = {Lalor, John P. and Wu, Hao and Mazor, Kathleen M. and Yu, Hong}, month = apr, year = {2023}, pmid = {36780789}, pmcid = {PMC9992155}, keywords = {Electronic health records, Health information technology, Health literacy}, pages = {105006}, }
Associations Between Natural Language Processing–Enriched Social Determinants of Health and Suicide Death Among US Veterans.
Mitra, A.; Pradhan, R.; Melamed, R. D.; Chen, K.; Hoaglin, D. C.; Tucker, K. L.; Reisman, J. I.; Yang, Z.; Liu, W.; Tsai, J.; and Yu, H.
JAMA Network Open, 6(3). March 2023.
Publisher: American Medical Association
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{mitra_associations_2023, title = {Associations {Between} {Natural} {Language} {Processing}–{Enriched} {Social} {Determinants} of {Health} and {Suicide} {Death} {Among} {US} {Veterans}}, volume = {6}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10018322/}, doi = {10.1001/jamanetworkopen.2023.3079}, abstract = {Are social determinants of health (SDOHs), extracted from both structured and unstructured clinical data, associated with an increased risk of suicide death among US veterans?In this case-control study of 8821 cases and 35 284 matched controls, ...}, language = {en}, number = {3}, urldate = {2024-04-10}, journal = {JAMA Network Open}, author = {Mitra, Avijit and Pradhan, Richeek and Melamed, Rachel D. and Chen, Kun and Hoaglin, David C. and Tucker, Katherine L. and Reisman, Joel I. and Yang, Zhichao and Liu, Weisong and Tsai, Jack and Yu, Hong}, month = mar, year = {2023}, pmid = {36920391}, note = {Publisher: American Medical Association}, }
Are social determinants of health (SDOHs), extracted from both structured and unstructured clinical data, associated with an increased risk of suicide death among US veterans?In this case-control study of 8821 cases and 35 284 matched controls, ...
Intentional Self-Harm Among US Veterans With Traumatic Brain Injury or Posttraumatic Stress Disorder: Retrospective Cohort Study From 2008 to 2017.
Rawat, B. P. S.; Reisman, J.; Pogoda, T. K; Liu, W.; Rongali, S.; Aseltine Jr, R. H; Chen, K.; Tsai, J.; Berlowitz, D.; Yu, H.; and Carlson, K. F
JMIR Public Health and Surveillance, 9: e42803. July 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{rawat_intentional_2023, title = {Intentional {Self}-{Harm} {Among} {US} {Veterans} {With} {Traumatic} {Brain} {Injury} or {Posttraumatic} {Stress} {Disorder}: {Retrospective} {Cohort} {Study} {From} 2008 to 2017}, volume = {9}, issn = {2369-2960}, shorttitle = {Intentional {Self}-{Harm} {Among} {US} {Veterans} {With} {Traumatic} {Brain} {Injury} or {Posttraumatic} {Stress} {Disorder}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10407646/}, doi = {10.2196/42803}, abstract = {Background Veterans with a history of traumatic brain injury (TBI) and/or posttraumatic stress disorder (PTSD) may be at increased risk of suicide attempts and other forms of intentional self-harm as compared to veterans without TBI or PTSD. Objective Using administrative data from the US Veterans Health Administration (VHA), we studied associations between TBI and PTSD diagnoses, and subsequent diagnoses of intentional self-harm among US veterans who used VHA health care between 2008 and 2017. Methods All veterans with encounters or hospitalizations for intentional self-harm were assigned “index dates” corresponding to the date of the first related visit; among those without intentional self-harm, we randomly selected a date from among the veteran’s health care encounters to match the distribution of case index dates over the 10-year period. We then examined the prevalence of TBI and PTSD diagnoses within the 5-year period prior to veterans’ index dates. TBI, PTSD, and intentional self-harm were identified using International Classification of Diseases diagnosis and external cause of injury codes from inpatient and outpatient VHA encounters. We stratified analyses by veterans’ average yearly VHA utilization in the 5-year period before their index date (low, medium, or high). Variations in prevalence and odds of intentional self-harm diagnoses were compared by veterans’ prior TBI and PTSD diagnosis status (TBI only, PTSD only, and comorbid TBI/PTSD) for each VHA utilization stratum. Multivariable models adjusted for age, sex, race, ethnicity, marital status, Department of Veterans Affairs service-connection status, and Charlson Comorbidity Index scores. Results About 6.7 million veterans with at least two VHA visits in the 5-year period before their index dates were included in the analyses; 86,644 had at least one intentional self-harm diagnosis during the study period. During the periods prior to veterans’ index dates, 93,866 were diagnosed with TBI only; 892,420 with PTSD only; and 102,549 with comorbid TBI/PTSD. Across all three VHA utilization strata, the prevalence of intentional self-harm diagnoses was higher among veterans diagnosed with TBI, PTSD, or TBI/PTSD than among veterans with neither diagnosis. The observed difference was most pronounced among veterans in the high VHA utilization stratum. The prevalence of intentional self-harm was six times higher among those with comorbid TBI/PTSD (6778/58,295, 11.63\%) than among veterans with neither TBI nor PTSD (21,979/1,144,991, 1.92\%). Adjusted odds ratios suggested that, after accounting for potential confounders, veterans with TBI, PTSD, or comorbid TBI/PTSD had higher odds of self-harm compared to veterans without these diagnoses. Among veterans with high VHA utilization, those with comorbid TBI/PTSD were 4.26 (95\% CI 4.15-4.38) times more likely to receive diagnoses for intentional self-harm than veterans with neither diagnosis. This pattern was similar for veterans with low and medium VHA utilization. Conclusions Veterans with TBI and/or PTSD diagnoses, compared to those with neither diagnosis, were substantially more likely to be subsequently diagnosed with intentional self-harm between 2008 and 2017. These associations were most pronounced among veterans who used VHA health care most frequently. These findings suggest a need for suicide prevention efforts targeted at veterans with these diagnoses.}, urldate = {2024-04-10}, journal = {JMIR Public Health and Surveillance}, author = {Rawat, Bhanu Pratap Singh and Reisman, Joel and Pogoda, Terri K and Liu, Weisong and Rongali, Subendhu and Aseltine Jr, Robert H and Chen, Kun and Tsai, Jack and Berlowitz, Dan and Yu, Hong and Carlson, Kathleen F}, month = jul, year = {2023}, pmid = {37486751}, pmcid = {PMC10407646}, pages = {e42803}, }
Background Veterans with a history of traumatic brain injury (TBI) and/or posttraumatic stress disorder (PTSD) may be at increased risk of suicide attempts and other forms of intentional self-harm as compared to veterans without TBI or PTSD. Objective Using administrative data from the US Veterans Health Administration (VHA), we studied associations between TBI and PTSD diagnoses, and subsequent diagnoses of intentional self-harm among US veterans who used VHA health care between 2008 and 2017. Methods All veterans with encounters or hospitalizations for intentional self-harm were assigned “index dates” corresponding to the date of the first related visit; among those without intentional self-harm, we randomly selected a date from among the veteran’s health care encounters to match the distribution of case index dates over the 10-year period. We then examined the prevalence of TBI and PTSD diagnoses within the 5-year period prior to veterans’ index dates. TBI, PTSD, and intentional self-harm were identified using International Classification of Diseases diagnosis and external cause of injury codes from inpatient and outpatient VHA encounters. We stratified analyses by veterans’ average yearly VHA utilization in the 5-year period before their index date (low, medium, or high). Variations in prevalence and odds of intentional self-harm diagnoses were compared by veterans’ prior TBI and PTSD diagnosis status (TBI only, PTSD only, and comorbid TBI/PTSD) for each VHA utilization stratum. Multivariable models adjusted for age, sex, race, ethnicity, marital status, Department of Veterans Affairs service-connection status, and Charlson Comorbidity Index scores. Results About 6.7 million veterans with at least two VHA visits in the 5-year period before their index dates were included in the analyses; 86,644 had at least one intentional self-harm diagnosis during the study period. During the periods prior to veterans’ index dates, 93,866 were diagnosed with TBI only; 892,420 with PTSD only; and 102,549 with comorbid TBI/PTSD. Across all three VHA utilization strata, the prevalence of intentional self-harm diagnoses was higher among veterans diagnosed with TBI, PTSD, or TBI/PTSD than among veterans with neither diagnosis. The observed difference was most pronounced among veterans in the high VHA utilization stratum. The prevalence of intentional self-harm was six times higher among those with comorbid TBI/PTSD (6778/58,295, 11.63%) than among veterans with neither TBI nor PTSD (21,979/1,144,991, 1.92%). Adjusted odds ratios suggested that, after accounting for potential confounders, veterans with TBI, PTSD, or comorbid TBI/PTSD had higher odds of self-harm compared to veterans without these diagnoses. Among veterans with high VHA utilization, those with comorbid TBI/PTSD were 4.26 (95% CI 4.15-4.38) times more likely to receive diagnoses for intentional self-harm than veterans with neither diagnosis. This pattern was similar for veterans with low and medium VHA utilization. Conclusions Veterans with TBI and/or PTSD diagnoses, compared to those with neither diagnosis, were substantially more likely to be subsequently diagnosed with intentional self-harm between 2008 and 2017. These associations were most pronounced among veterans who used VHA health care most frequently. These findings suggest a need for suicide prevention efforts targeted at veterans with these diagnoses.
Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing.
Yao, Z.; Cao, Y.; Yang, Z.; and Yu, H.
AMIA Summits on Translational Science Proceedings, 2023: 592–601. June 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@article{yao_context_2023, title = {Context {Variance} {Evaluation} of {Pretrained} {Language} {Models} for {Prompt}-based {Biomedical} {Knowledge} {Probing}}, volume = {2023}, issn = {2153-4063}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10283095/}, abstract = {Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs’ knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduced context variance into the prompt generation and proposed a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we proposed the concept of ”Misunderstand” in LAMA for the first time. Through experiments on 12 PLMs, we showed that our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric make BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle ”understand” from just ”read and copy”.}, urldate = {2023-11-14}, journal = {AMIA Summits on Translational Science Proceedings}, author = {Yao, Zonghai and Cao, Yi and Yang, Zhichao and Yu, Hong}, month = jun, year = {2023}, pmid = {37350903}, pmcid = {PMC10283095}, pages = {592--601}, }
Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs’ knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduced context variance into the prompt generation and proposed a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we proposed the concept of ”Misunderstand” in LAMA for the first time. Through experiments on 12 PLMs, we showed that our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric make BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle ”understand” from just ”read and copy”.
Extracting Biomedical Factual Knowledge Using Pretrained Language Model and Electronic Health Record Context.
Yao, Z.; Cao, Y.; Yang, Z.; Deshpande, V.; and Yu, H.
AMIA Annual Symposium Proceedings, 2022: 1188–1197. April 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@article{yao_extracting_2023, title = {Extracting {Biomedical} {Factual} {Knowledge} {Using} {Pretrained} {Language} {Model} and {Electronic} {Health} {Record} {Context}}, volume = {2022}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148358/}, abstract = {Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.}, urldate = {2024-04-10}, journal = {AMIA Annual Symposium Proceedings}, author = {Yao, Zonghai and Cao, Yi and Yang, Zhichao and Deshpande, Vijeta and Yu, Hong}, month = apr, year = {2023}, pmid = {37128373}, pmcid = {PMC10148358}, pages = {1188--1197}, }
Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.
TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.
Yang, Z.; Mitra, A.; Liu, W.; Berlowitz, D.; and Yu, H.
Nature Communications, 14: 7857. November 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yang_transformehr_2023, title = {{TransformEHR}: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records}, volume = {14}, issn = {2041-1723}, shorttitle = {{TransformEHR}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10687211/}, doi = {10.1038/s41467-023-43715-z}, abstract = {Deep learning transformer-based models using longitudinal electronic health records (EHRs) have shown a great success in prediction of clinical diseases or outcomes. Pretraining on a large dataset can help such models map the input space better and boost their performance on relevant tasks through finetuning with limited data. In this study, we present TransformEHR, a generative encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future visit from previous visits. TransformEHR’s encoder-decoder framework, paired with the novel pretraining objective, helps it achieve the new state-of-the-art performance on multiple clinical prediction tasks. Comparing with the previous model, TransformEHR improves area under the precision–recall curve by 2\% (p {\textless} 0.001) for pancreatic cancer onset and by 24\% (p = 0.007) for intentional self-harm in patients with post-traumatic stress disorder. The high performance in predicting intentional self-harm shows the potential of TransformEHR in building effective clinical intervention systems. TransformEHR is also generalizable and can be easily finetuned for clinical prediction tasks with limited data., Using AI to predict disease can improve interventions slow down or prevent disease. Here, the authors show that generative AI models built on the framework of Transformer, the model that also empowers ChatGPT, can achieve state-of-the-art performance on disease predictions based on longitudinal electronic records.}, urldate = {2024-04-10}, journal = {Nature Communications}, author = {Yang, Zhichao and Mitra, Avijit and Liu, Weisong and Berlowitz, Dan and Yu, Hong}, month = nov, year = {2023}, pmid = {38030638}, pmcid = {PMC10687211}, keywords = {Computer science, Disease prevention, Experimental models of disease}, pages = {7857}, }
Deep learning transformer-based models using longitudinal electronic health records (EHRs) have shown a great success in prediction of clinical diseases or outcomes. Pretraining on a large dataset can help such models map the input space better and boost their performance on relevant tasks through finetuning with limited data. In this study, we present TransformEHR, a generative encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future visit from previous visits. TransformEHR’s encoder-decoder framework, paired with the novel pretraining objective, helps it achieve the new state-of-the-art performance on multiple clinical prediction tasks. Comparing with the previous model, TransformEHR improves area under the precision–recall curve by 2% (p \textless 0.001) for pancreatic cancer onset and by 24% (p = 0.007) for intentional self-harm in patients with post-traumatic stress disorder. The high performance in predicting intentional self-harm shows the potential of TransformEHR in building effective clinical intervention systems. TransformEHR is also generalizable and can be easily finetuned for clinical prediction tasks with limited data., Using AI to predict disease can improve interventions slow down or prevent disease. Here, the authors show that generative AI models built on the framework of Transformer, the model that also empowers ChatGPT, can achieve state-of-the-art performance on disease predictions based on longitudinal electronic records.
NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes.
Wang, J.; Yao, Z.; Yang, Z.; Zhou, H.; Li, R.; Wang, X.; Xu, Y.; and Yu, H.
October 2023.
Number: arXiv:2310.15959 arXiv:2310.15959 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{wang_notechat_2023, title = {{NoteChat}: {A} {Dataset} of {Synthetic} {Doctor}-{Patient} {Conversations} {Conditioned} on {Clinical} {Notes}}, shorttitle = {{NoteChat}}, url = {http://arxiv.org/abs/2310.15959}, abstract = {The detailed clinical records drafted by doctors after each patient's visit are crucial for medical practitioners and researchers. Automating the creation of these notes with language models can reduce the workload of doctors. However, training such models can be difficult due to the limited public availability of conversations between patients and doctors. In this paper, we introduce NoteChat, a cooperative multi-agent framework leveraging Large Language Models (LLMs) for generating synthetic doctor-patient conversations conditioned on clinical notes. NoteChat consists of Planning, Roleplay, and Polish modules. We provide a comprehensive automatic and human evaluation of NoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT and GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic doctor-patient conversations, underscoring the untapped potential of LLMs in healthcare. This work represents the first instance of multiple LLMs cooperating to complete a doctor-patient conversation conditioned on clinical notes, offering promising avenues for the intersection of AI and healthcare}, urldate = {2023-11-15}, publisher = {arXiv}, author = {Wang, Junda and Yao, Zonghai and Yang, Zhichao and Zhou, Huixue and Li, Rumeng and Wang, Xun and Xu, Yucheng and Yu, Hong}, month = oct, year = {2023}, note = {Number: arXiv:2310.15959 arXiv:2310.15959 [cs]}, keywords = {Computer Science - Computation and Language}, }
The detailed clinical records drafted by doctors after each patient's visit are crucial for medical practitioners and researchers. Automating the creation of these notes with language models can reduce the workload of doctors. However, training such models can be difficult due to the limited public availability of conversations between patients and doctors. In this paper, we introduce NoteChat, a cooperative multi-agent framework leveraging Large Language Models (LLMs) for generating synthetic doctor-patient conversations conditioned on clinical notes. NoteChat consists of Planning, Roleplay, and Polish modules. We provide a comprehensive automatic and human evaluation of NoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT and GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic doctor-patient conversations, underscoring the untapped potential of LLMs in healthcare. This work represents the first instance of multiple LLMs cooperating to complete a doctor-patient conversation conditioned on clinical notes, offering promising avenues for the intersection of AI and healthcare
Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations.
Yang, Z.; Yao, Z.; Tasmin, M.; Vashisht, P.; Jang, W. S.; Ouyang, F.; Wang, B.; Berlowitz, D.; and Yu, H.
November 2023.
Pages: 2023.10.26.23297629
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@misc{yang_performance_2023, title = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}: {Potential} for {Imaging} {Diagnostic} {Support} with {Explanations}}, copyright = {© 2023, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/}, shorttitle = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}}, url = {https://www.medrxiv.org/content/10.1101/2023.10.26.23297629v2}, doi = {10.1101/2023.10.26.23297629}, abstract = {Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2\%), AMBOSS (62.0\%), and DRQCE (73.1\%), outperforming ChatGPT and GPT-4 by relative increase of 131.8\% and 64.5\% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7\%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.}, language = {en}, urldate = {2023-11-14}, publisher = {medRxiv}, author = {Yang, Zhichao and Yao, Zonghai and Tasmin, Mahbuba and Vashisht, Parth and Jang, Won Seok and Ouyang, Feiyun and Wang, Beining and Berlowitz, Dan and Yu, Hong}, month = nov, year = {2023}, note = {Pages: 2023.10.26.23297629}, }
Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2%), AMBOSS (62.0%), and DRQCE (73.1%), outperforming ChatGPT and GPT-4 by relative increase of 131.8% and 64.5% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.
SELF-EXPLAIN: Teaching Large Language Models to Reason Complex Questions by Themselves.
Zhao, J.; Yao, Z.; Yang, Z.; and Yu, H.
November 2023.
Number: arXiv:2311.06985 arXiv:2311.06985 [cs] R0-FoMo: Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@misc{zhao_self-explain_2023, title = {{SELF}-{EXPLAIN}: {Teaching} {Large} {Language} {Models} to {Reason} {Complex} {Questions} by {Themselves}}, shorttitle = {{SELF}-{EXPLAIN}}, url = {http://arxiv.org/abs/2311.06985}, abstract = {Large language models (LLMs) can generate intermediate reasoning steps. To elicit the reliable reasoning, the common practice is to employ few-shot chain-of-thought prompting, where several in-context demonstrations for reasoning are prepended to the question. However, such chain-of-thought examples are expensive to craft, especially for professional domains, and can have high variance depending on human annotators. Therefore, this work investigates whether LLMs can teach themselves to reason without human-crafted demonstrations. We propose SELF-EXPLAIN to generate CoT examples by LLMs inspired by "encoding specificity" in human memory retrieval. We find using self-explanations makes LLMs more confident, more calibrated and less biased when answering complex questions. Moreover, we find prompting with self-explanations can even significantly outperform using human-crafted CoTs on several complex question answering dataset.}, urldate = {2023-11-14}, publisher = {arXiv}, author = {Zhao, Jiachen and Yao, Zonghai and Yang, Zhichao and Yu, Hong}, month = nov, year = {2023}, note = {Number: arXiv:2311.06985 arXiv:2311.06985 [cs] R0-FoMo: Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023.}, keywords = {Computer Science - Computation and Language}, }
Large language models (LLMs) can generate intermediate reasoning steps. To elicit the reliable reasoning, the common practice is to employ few-shot chain-of-thought prompting, where several in-context demonstrations for reasoning are prepended to the question. However, such chain-of-thought examples are expensive to craft, especially for professional domains, and can have high variance depending on human annotators. Therefore, this work investigates whether LLMs can teach themselves to reason without human-crafted demonstrations. We propose SELF-EXPLAIN to generate CoT examples by LLMs inspired by "encoding specificity" in human memory retrieval. We find using self-explanations makes LLMs more confident, more calibrated and less biased when answering complex questions. Moreover, we find prompting with self-explanations can even significantly outperform using human-crafted CoTs on several complex question answering dataset.
EHRTutor: Enhancing Patient Understanding of Discharge Instructions.
Zhang, Z.; Yao, Z.; Zhou, H.; ouyang , F.; and Yu, H.
October 2023.
To appear in NeurIPS'23 Workshop on Generative AI for Education (GAIED), December, New Orleans
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@misc{zhang_ehrtutor_2023, title = {{EHRTutor}: {Enhancing} {Patient} {Understanding} of {Discharge} {Instructions}}, shorttitle = {{EHRTutor}}, url = {http://arxiv.org/abs/2310.19212}, doi = {10.48550/arXiv.2310.19212}, abstract = {Large language models have shown success as a tutor in education in various fields. Educating patients about their clinical visits plays a pivotal role in patients' adherence to their treatment plans post-discharge. This paper presents EHRTutor, an innovative multi-component framework leveraging the Large Language Model (LLM) for patient education through conversational question-answering. EHRTutor first formulates questions pertaining to the electronic health record discharge instructions. It then educates the patient through conversation by administering each question as a test. Finally, it generates a summary at the end of the conversation. Evaluation results using LLMs and domain experts have shown a clear preference for EHRTutor over the baseline. Moreover, EHRTutor also offers a framework for generating synthetic patient education dialogues that can be used for future in-house system training.}, urldate = {2023-11-01}, publisher = {arXiv}, author = {Zhang, Zihao and Yao, Zonghai and Zhou, Huixue and ouyang, Feiyun and Yu, Hong}, month = oct, year = {2023}, note = {To appear in NeurIPS'23 Workshop on Generative AI for Education (GAIED), December, New Orleans}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Large language models have shown success as a tutor in education in various fields. Educating patients about their clinical visits plays a pivotal role in patients' adherence to their treatment plans post-discharge. This paper presents EHRTutor, an innovative multi-component framework leveraging the Large Language Model (LLM) for patient education through conversational question-answering. EHRTutor first formulates questions pertaining to the electronic health record discharge instructions. It then educates the patient through conversation by administering each question as a test. Finally, it generates a summary at the end of the conversation. Evaluation results using LLMs and domain experts have shown a clear preference for EHRTutor over the baseline. Moreover, EHRTutor also offers a framework for generating synthetic patient education dialogues that can be used for future in-house system training.
Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization.
Mishra, P.; Yao, Z.; Chen, S.; Wang, B.; Mittal, R.; and Yu, H.
October 2023.
NeurIPS 2023 Workshop SyntheticData4ML Accepted
Paper link bibtex abstract
Paper link bibtex abstract
@misc{mishra_synthetic_2023, title = {Synthetic {Imitation} {Edit} {Feedback} for {Factual} {Alignment} in {Clinical} {Summarization}}, url = {http://arxiv.org/abs/2310.20033}, abstract = {Large Language Models (LLMs) like the GPT and LLaMA families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. However, community concerns about these models' hallucination issues continue to rise. LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain NLP tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using human feedback has shown the promise of aligning LLMs to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. In this work, we propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. We focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical NLP tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. In addition, although GPT has reached the expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much previous work discussing whether GPT can generate expert-level edit feedback for LMs in the clinical note summarization task. We hope to fill this gap. Finally, our evaluations demonstrate the potential use of GPT edits in human alignment, especially from a factuality perspective.}, urldate = {2023-11-01}, publisher = {arXiv}, author = {Mishra, Prakamya and Yao, Zonghai and Chen, Shuwei and Wang, Beining and Mittal, Rohan and Yu, Hong}, month = oct, year = {2023}, note = {NeurIPS 2023 Workshop SyntheticData4ML Accepted}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Large Language Models (LLMs) like the GPT and LLaMA families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. However, community concerns about these models' hallucination issues continue to rise. LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain NLP tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using human feedback has shown the promise of aligning LLMs to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. In this work, we propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. We focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical NLP tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. In addition, although GPT has reached the expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much previous work discussing whether GPT can generate expert-level edit feedback for LMs in the clinical note summarization task. We hope to fill this gap. Finally, our evaluations demonstrate the potential use of GPT edits in human alignment, especially from a factuality perspective.
Improving Summarization with Human Edits.
Yao, Z.; Schloss, B. J.; and Selvaraj, S. P.
December 2023.
EMNLP 2023
Paper link bibtex abstract
Paper link bibtex abstract
@misc{yao_improving_2023, title = {Improving {Summarization} with {Human} {Edits}}, url = {http://arxiv.org/abs/2310.05857}, abstract = {Recent work has shown the promise of learning with human feedback paradigms to produce human-determined high-quality text. Existing works use human feedback to train large language models (LLMs) in general domain abstractive summarization and have obtained summary quality exceeding traditional likelihood training. In this paper, we focus on a less explored form of human feedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training (SALT), a novel technique to use both the human-edited and model-generated data together in the training loop. In addition, we demonstrate simulating Human Edits with ground truth summaries coming from existing training data -- Imitation edits, along with the model-generated summaries obtained after the training, to reduce the need for expensive human-edit data. In our experiments, we extend human feedback exploration from general domain summarization to medical domain summarization. Our results demonstrate the effectiveness of SALT to improve the summary quality with Human and Imitation Edits.}, urldate = {2023-10-10}, publisher = {arXiv}, author = {Yao, Zonghai and Schloss, Benjamin J. and Selvaraj, Sai P.}, month = dec, year = {2023}, note = {EMNLP 2023}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning}, }
Recent work has shown the promise of learning with human feedback paradigms to produce human-determined high-quality text. Existing works use human feedback to train large language models (LLMs) in general domain abstractive summarization and have obtained summary quality exceeding traditional likelihood training. In this paper, we focus on a less explored form of human feedback – Human Edits. We propose Sequence Alignment (un)Likelihood Training (SALT), a novel technique to use both the human-edited and model-generated data together in the training loop. In addition, we demonstrate simulating Human Edits with ground truth summaries coming from existing training data – Imitation edits, along with the model-generated summaries obtained after the training, to reduce the need for expensive human-edit data. In our experiments, we extend human feedback exploration from general domain summarization to medical domain summarization. Our results demonstrate the effectiveness of SALT to improve the summary quality with Human and Imitation Edits.
PaniniQA: Enhancing Patient Education Through Interactive Question Answering.
Cai, P.; Yao, Z.; Liu, F.; Wang, D.; Reilly, M.; Zhou, H.; Li, L.; Cao, Y.; Kapoor, A.; Bajracharya, A.; Berlowitz, D.; and Yu, H.
Transactions of the Association for Computational Linguistics. August 2023.
Equal contributions for the first two authors.
Paper link bibtex abstract
Paper link bibtex abstract
@article{cai_paniniqa_2023, title = {{PaniniQA}: {Enhancing} {Patient} {Education} {Through} {Interactive} {Question} {Answering}}, shorttitle = {{PaniniQA}}, url = {http://arxiv.org/abs/2308.03253}, abstract = {Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions}, urldate = {2023-08-08}, journal = {Transactions of the Association for Computational Linguistics}, author = {Cai, Pengshan and Yao, Zonghai and Liu, Fei and Wang, Dakuo and Reilly, Meghan and Zhou, Huixue and Li, Lingxi and Cao, Yi and Kapoor, Alok and Bajracharya, Adarsha and Berlowitz, Dan and Yu, Hong}, month = aug, year = {2023}, note = {Equal contributions for the first two authors.}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions
Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond.
Chang, H.; Yao, Z.; Gon, A.; Yu, H.; and McCallum, A.
July 2023.
ACL 2023, equal contribution from the first two authors.
Paper link bibtex abstract
Paper link bibtex abstract
@misc{chang_revisiting_2023, address = {Canada}, title = {Revisiting the {Architectures} like {Pointer} {Networks} to {Efficiently} {Improve} the {Next} {Word} {Distribution}, {Summarization} {Factuality}, and {Beyond}}, url = {http://arxiv.org/abs/2305.12289}, abstract = {Is the output softmax layer, which is adopted by most language models (LMs), always the best way to compute the next word probability? Given so many attention layers in a modern transformer-based LM, are the pointer networks redundant nowadays? In this study, we discover that the answers to both questions are no. This is because the softmax bottleneck sometimes prevents the LMs from predicting the desired distribution and the pointer networks can be used to break the bottleneck efficiently. Based on the finding, we propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers. In GPT-2, our proposals are significantly better and more efficient than mixture of softmax, a state-of-the-art softmax alternative. In summarization experiments, without significantly decreasing its training/testing speed, our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30\% in BookSum paragraph-level dataset.}, urldate = {2023-05-23}, publisher = {arXiv}, author = {Chang, Haw-Shiuan and Yao, Zonghai and Gon, Alolika and Yu, Hong and McCallum, Andrew}, month = jul, year = {2023}, note = {ACL 2023, equal contribution from the first two authors.}, keywords = {Computer Science - Computation and Language}, }
Is the output softmax layer, which is adopted by most language models (LMs), always the best way to compute the next word probability? Given so many attention layers in a modern transformer-based LM, are the pointer networks redundant nowadays? In this study, we discover that the answers to both questions are no. This is because the softmax bottleneck sometimes prevents the LMs from predicting the desired distribution and the pointer networks can be used to break the bottleneck efficiently. Based on the finding, we propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers. In GPT-2, our proposals are significantly better and more efficient than mixture of softmax, a state-of-the-art softmax alternative. In summarization experiments, without significantly decreasing its training/testing speed, our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.
Automated identification of eviction status from electronic health record notes.
Yao, Z.; Tsai, J.; Liu, W.; Levy, D. A; Druhl, E.; Reisman, J. I; and Yu, H.
Journal of the American Medical Informatics Association,ocad081. May 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yao_automated_2023, title = {Automated identification of eviction status from electronic health record notes}, issn = {1527-974X}, url = {https://doi.org/10.1093/jamia/ocad081}, doi = {10.1093/jamia/ocad081}, abstract = {Evictions are important social and behavioral determinants of health. Evictions are associated with a cascade of negative events that can lead to unemployment, housing insecurity/homelessness, long-term poverty, and mental health problems. In this study, we developed a natural language processing system to automatically detect eviction status from electronic health record (EHR) notes.We first defined eviction status (eviction presence and eviction period) and then annotated eviction status in 5000 EHR notes from the Veterans Health Administration (VHA). We developed a novel model, KIRESH, that has shown to substantially outperform other state-of-the-art models such as fine-tuning pretrained language models like BioBERT and Bio\_ClinicalBERT. Moreover, we designed a novel prompt to further improve the model performance by using the intrinsic connection between the 2 subtasks of eviction presence and period prediction. Finally, we used the Temperature Scaling-based Calibration on our KIRESH-Prompt method to avoid overconfidence issues arising from the imbalance dataset.KIRESH-Prompt substantially outperformed strong baseline models including fine-tuning the Bio\_ClinicalBERT model to achieve 0.74672 MCC, 0.71153 Macro-F1, and 0.83396 Micro-F1 in predicting eviction period and 0.66827 MCC, 0.62734 Macro-F1, and 0.7863 Micro-F1 in predicting eviction presence. We also conducted additional experiments on a benchmark social determinants of health (SBDH) dataset to demonstrate the generalizability of our methods.KIRESH-Prompt has substantially improved eviction status classification. We plan to deploy KIRESH-Prompt to the VHA EHRs as an eviction surveillance system to help address the US Veterans’ housing insecurity.}, urldate = {2023-05-19}, journal = {Journal of the American Medical Informatics Association}, author = {Yao, Zonghai and Tsai, Jack and Liu, Weisong and