Excellent! Next you can
create a new website with this list, or
embed it in an existing web page by copying & pasting
any of the following snippets.
JavaScript
(easiest)
PHP
iFrame
(not recommended)
<script src="https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1&jsonp=1"></script>
<?php
$contents = file_get_contents("https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1");
print_r($contents);
?>
<iframe src="https://bibbase.org/show?bib=http://fenway.cs.uml.edu/papers/pubs-all.bib&nocache=1&simplegroups=1&groupby=year&proxy=1"></iframe>
For more details see the documention.
This is a preview! To use this list on your own web site
or create a new web site from it,
create a free account. The file will be added
and you will be able to edit it in the File Manager.
We will show you instructions once you've created your account.
To the site owner:
Action required! Mendeley is changing its API. In order to keep using Mendeley with BibBase past April 14th, you need to:
- renew the authorization for BibBase on Mendeley, and
- update the BibBase URL in your page the same way you did when you initially set up this page.
2024
(9)
LocalTweets to LocalHealth: A Mental Health Surveillance Framework Based on Twitter Data.
Deshpande, V.; Lee, M.; Yao, Z.; Zhang, Z.; Gibbons, J. B.; and Yu, H.
March 2024.
arXiv:2402.13452 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{deshpande_localtweets_2024, title = {{LocalTweets} to {LocalHealth}: {A} {Mental} {Health} {Surveillance} {Framework} {Based} on {Twitter} {Data}}, shorttitle = {{LocalTweets} to {LocalHealth}}, url = {http://arxiv.org/abs/2402.13452}, abstract = {Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78{\textbackslash}\%, respectively, a 59{\textbackslash}\% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC's estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Deshpande, Vijeta and Lee, Minhwa and Yao, Zonghai and Zhang, Zihao and Gibbons, Jason Brian and Yu, Hong}, month = mar, year = {2024}, note = {arXiv:2402.13452 [cs]}, keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Social and Information Networks}, }
Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78\%, respectively, a 59\% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC's estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.
SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization.
Mishra, P.; Yao, Z.; Vashisht, P.; Ouyang, F.; Wang, B.; Mody, V. D.; and Yu, H.
April 2024.
arXiv:2402.13919 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{mishra_synfac-edit_2024, title = {{SYNFAC}-{EDIT}: {Synthetic} {Imitation} {Edit} {Feedback} for {Factual} {Alignment} in {Clinical} {Summarization}}, shorttitle = {{SYNFAC}-{EDIT}}, url = {http://arxiv.org/abs/2402.13919}, abstract = {Large Language Models (LLMs) such as GPT \& Llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical NLP applications where errors could lead to serious consequences. To counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes {\textgreater}100B parameter GPT variants like GPT-3.5 \& GPT-4 to act as synthetic experts to generate high-quality synthetics feedback aimed at enhancing factual consistency in clinical note summarization. Our research primarily focuses on edit feedback generated by these synthetic feedback experts without additional human annotations, mirroring and optimizing the practical scenario in which medical professionals refine AI system outputs. Although such 100B+ parameter GPT variants have proven to demonstrate expertise in various clinical NLP tasks, such as the Medical Licensing Examination, there is scant research on their capacity to act as synthetic feedback experts and deliver expert-level edit feedback for improving the generation quality of weaker ({\textless}10B parameter) LLMs like GPT-2 (1.5B) \& Llama 2 (7B) in clinical domain. So in this work, we leverage 100B+ GPT variants to act as synthetic feedback experts offering expert-level edit feedback, that is used to reduce hallucinations and align weaker ({\textless}10B parameter) LLMs with medical facts using two distinct alignment algorithms (DPO \& SALT), endeavoring to narrow the divide between AI-generated content and factual accuracy. This highlights the substantial potential of LLM-based synthetic edits in enhancing the alignment of clinical factuality.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Mishra, Prakamya and Yao, Zonghai and Vashisht, Parth and Ouyang, Feiyun and Wang, Beining and Mody, Vidhi Dhaval and Yu, Hong}, month = apr, year = {2024}, note = {arXiv:2402.13919 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Large Language Models (LLMs) such as GPT & Llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical NLP applications where errors could lead to serious consequences. To counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes \textgreater100B parameter GPT variants like GPT-3.5 & GPT-4 to act as synthetic experts to generate high-quality synthetics feedback aimed at enhancing factual consistency in clinical note summarization. Our research primarily focuses on edit feedback generated by these synthetic feedback experts without additional human annotations, mirroring and optimizing the practical scenario in which medical professionals refine AI system outputs. Although such 100B+ parameter GPT variants have proven to demonstrate expertise in various clinical NLP tasks, such as the Medical Licensing Examination, there is scant research on their capacity to act as synthetic feedback experts and deliver expert-level edit feedback for improving the generation quality of weaker (\textless10B parameter) LLMs like GPT-2 (1.5B) & Llama 2 (7B) in clinical domain. So in this work, we leverage 100B+ GPT variants to act as synthetic feedback experts offering expert-level edit feedback, that is used to reduce hallucinations and align weaker (\textless10B parameter) LLMs with medical facts using two distinct alignment algorithms (DPO & SALT), endeavoring to narrow the divide between AI-generated content and factual accuracy. This highlights the substantial potential of LLM-based synthetic edits in enhancing the alignment of clinical factuality.
Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data.
Xu, X.; Yao, B.; Dong, Y.; Gabriel, S.; Yu, H.; Hendler, J.; Ghassemi, M.; Dey, A. K.; and Wang, D.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(1): 1–32. March 2024.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{xu_mental-llm_2024, title = {Mental-{LLM}: {Leveraging} {Large} {Language} {Models} for {Mental} {Health} {Prediction} via {Online} {Text} {Data}}, volume = {8}, issn = {2474-9567}, shorttitle = {Mental-{LLM}}, url = {https://dl.acm.org/doi/10.1145/3643540}, doi = {10.1145/3643540}, abstract = {Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9\% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8\%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.}, language = {en}, number = {1}, urldate = {2024-09-03}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies}, author = {Xu, Xuhai and Yao, Bingsheng and Dong, Yuanzhe and Gabriel, Saadia and Yu, Hong and Hendler, James and Ghassemi, Marzyeh and Dey, Anind K. and Wang, Dakuo}, month = mar, year = {2024}, pages = {1--32}, }
Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.
UMass-BioNLP at MEDIQA-M3G 2024: DermPrompt – A Systematic Exploration of Prompt Engineering with GPT-4V for Dermatological Diagnosis.
Vashisht, P.; Lodha, A.; Maddipatla, M.; Yao, Z.; Mitra, A.; Yang, Z.; Wang, J.; Kwon, S.; and Yu, H.
May 2024.
arXiv:2404.17749 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{vashisht_umass-bionlp_2024, title = {{UMass}-{BioNLP} at {MEDIQA}-{M3G} 2024: {DermPrompt} -- {A} {Systematic} {Exploration} of {Prompt} {Engineering} with {GPT}-{4V} for {Dermatological} {Diagnosis}}, shorttitle = {{UMass}-{BioNLP} at {MEDIQA}-{M3G} 2024}, url = {http://arxiv.org/abs/2404.17749}, abstract = {This paper presents our team's participation in the MEDIQA-ClinicalNLP2024 shared task B. We present a novel approach to diagnosing clinical dermatology cases by integrating large multimodal models, specifically leveraging the capabilities of GPT-4V under a retriever and a re-ranker framework. Our investigation reveals that GPT-4V, when used as a retrieval agent, can accurately retrieve the correct skin condition 85\% of the time using dermatological images and brief patient histories. Additionally, we empirically show that Naive Chain-of-Thought (CoT) works well for retrieval while Medical Guidelines Grounded CoT is required for accurate dermatological diagnosis. Further, we introduce a Multi-Agent Conversation (MAC) framework and show its superior performance and potential over the best CoT strategy. The experiments suggest that using naive CoT for retrieval and multi-agent conversation for critique-based diagnosis, GPT-4V can lead to an early and accurate diagnosis of dermatological conditions. The implications of this work extend to improving diagnostic workflows, supporting dermatological education, and enhancing patient care by providing a scalable, accessible, and accurate diagnostic tool.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Vashisht, Parth and Lodha, Abhilasha and Maddipatla, Mukta and Yao, Zonghai and Mitra, Avijit and Yang, Zhichao and Wang, Junda and Kwon, Sunjae and Yu, Hong}, month = may, year = {2024}, note = {arXiv:2404.17749 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
This paper presents our team's participation in the MEDIQA-ClinicalNLP2024 shared task B. We present a novel approach to diagnosing clinical dermatology cases by integrating large multimodal models, specifically leveraging the capabilities of GPT-4V under a retriever and a re-ranker framework. Our investigation reveals that GPT-4V, when used as a retrieval agent, can accurately retrieve the correct skin condition 85% of the time using dermatological images and brief patient histories. Additionally, we empirically show that Naive Chain-of-Thought (CoT) works well for retrieval while Medical Guidelines Grounded CoT is required for accurate dermatological diagnosis. Further, we introduce a Multi-Agent Conversation (MAC) framework and show its superior performance and potential over the best CoT strategy. The experiments suggest that using naive CoT for retrieval and multi-agent conversation for critique-based diagnosis, GPT-4V can lead to an early and accurate diagnosis of dermatological conditions. The implications of this work extend to improving diagnostic workflows, supporting dermatological education, and enhancing patient care by providing a scalable, accessible, and accurate diagnostic tool.
Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text.
Mitra, A.; Druhl, E.; Goodwin, R.; and Yu, H.
June 2024.
arXiv:2406.06056 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{mitra_synth-sbdh_2024, title = {Synth-{SBDH}: {A} {Synthetic} {Dataset} of {Social} and {Behavioral} {Determinants} of {Health} for {Clinical} {Text}}, shorttitle = {Synth-{SBDH}}, url = {http://arxiv.org/abs/2406.06056}, abstract = {Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 62.5\% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints. Human evaluation demonstrates a Human-LLM alignment of 71.06\% and uncovers areas for future refinements.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Mitra, Avijit and Druhl, Emily and Goodwin, Raelene and Yu, Hong}, month = jun, year = {2024}, note = {arXiv:2406.06056 [cs]}, keywords = {Computer Science - Computation and Language}, }
Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 62.5% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints. Human evaluation demonstrates a Human-LLM alignment of 71.06% and uncovers areas for future refinements.
ReadCtrl: Personalizing text generation with readability-controlled instruction learning.
Tran, H.; Yao, Z.; Li, L.; and Yu, H.
June 2024.
arXiv:2406.09205 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{tran_readctrl_2024, title = {{ReadCtrl}: {Personalizing} text generation with readability-controlled instruction learning}, shorttitle = {{ReadCtrl}}, url = {http://arxiv.org/abs/2406.09205}, abstract = {Content generation conditioning on users's readability is an important application for personalization. In an era of large language models (LLMs), readability-controlled text generation based on LLMs has become increasingly important. This paper introduces a novel methodology called "Readability-Controlled Instruction Learning (ReadCtrl)," which aims to instruction-tune LLMs to tailor users' readability levels. Unlike the traditional methods, which primarily focused on categorical readability adjustments typically classified as high, medium, and low or expert and layperson levels with limited success, ReadCtrl introduces a dynamic framework that enables LLMs to generate content at various (near continuous level) complexity levels, thereby enhancing their versatility across different applications. Our results show that the ReadCtrl-Mistral-7B models significantly outperformed strong baseline models such as GPT-4 and Claude-3, with a win rate of 52.1\%:35.7\% against GPT-4 in human evaluations. Furthermore, Read-Ctrl has shown significant improvements in automatic evaluations, as evidenced by better readability metrics (e.g., FOG, FKGL) and generation quality metrics (e.g., BLEU, SARI, SummaC-Factuality, UniEval-Consistency and Coherence). These results underscore Read-Ctrl's effectiveness and tenacity in producing high-quality, contextually appropriate outputs that closely align with targeted readability levels, marking a significant advancement in personalized content generation using LLMs.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Tran, Hieu and Yao, Zonghai and Li, Lingxi and Yu, Hong}, month = jun, year = {2024}, note = {arXiv:2406.09205 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Content generation conditioning on users's readability is an important application for personalization. In an era of large language models (LLMs), readability-controlled text generation based on LLMs has become increasingly important. This paper introduces a novel methodology called "Readability-Controlled Instruction Learning (ReadCtrl)," which aims to instruction-tune LLMs to tailor users' readability levels. Unlike the traditional methods, which primarily focused on categorical readability adjustments typically classified as high, medium, and low or expert and layperson levels with limited success, ReadCtrl introduces a dynamic framework that enables LLMs to generate content at various (near continuous level) complexity levels, thereby enhancing their versatility across different applications. Our results show that the ReadCtrl-Mistral-7B models significantly outperformed strong baseline models such as GPT-4 and Claude-3, with a win rate of 52.1%:35.7% against GPT-4 in human evaluations. Furthermore, Read-Ctrl has shown significant improvements in automatic evaluations, as evidenced by better readability metrics (e.g., FOG, FKGL) and generation quality metrics (e.g., BLEU, SARI, SummaC-Factuality, UniEval-Consistency and Coherence). These results underscore Read-Ctrl's effectiveness and tenacity in producing high-quality, contextually appropriate outputs that closely align with targeted readability levels, marking a significant advancement in personalized content generation using LLMs.
A Psychology-based Unified Dynamic Framework for Curriculum Learning.
Meng, G.; Zeng, Q.; Lalor, J. P.; and Yu, H.
August 2024.
arXiv:2408.05326 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{meng_psychology-based_2024, title = {A {Psychology}-based {Unified} {Dynamic} {Framework} for {Curriculum} {Learning}}, url = {http://arxiv.org/abs/2408.05326}, abstract = {Directly learning from examples of random difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order, from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. This paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF), drawing inspiration from psychometrics. We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a Dynamic Data Selection via Model Ability Estimation (DDS-MAE) strategy to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to a faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained language models with PUDF enhances their performance on the GLUE benchmark. Moreover, PUDF surpasses other state-of-the-art (SOTA) CL methods on the GLUE benchmark. We further explore the components of PUDF, namely the difficulty measurer (IRT-AC) and the training scheduler (DDS-MAE) qualitatively and quantitatively. Lastly, we conduct an ablation study to clarify which components of PUDF contribute to faster convergence and higher accuracy.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Meng, Guangyu and Zeng, Qingkai and Lalor, John P. and Yu, Hong}, month = aug, year = {2024}, note = {arXiv:2408.05326 [cs]}, keywords = {Computer Science - Computation and Language}, }
Directly learning from examples of random difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order, from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. This paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF), drawing inspiration from psychometrics. We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a Dynamic Data Selection via Model Ability Estimation (DDS-MAE) strategy to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to a faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained language models with PUDF enhances their performance on the GLUE benchmark. Moreover, PUDF surpasses other state-of-the-art (SOTA) CL methods on the GLUE benchmark. We further explore the components of PUDF, namely the difficulty measurer (IRT-AC) and the training scheduler (DDS-MAE) qualitatively and quantitatively. Lastly, we conduct an ablation study to clarify which components of PUDF contribute to faster convergence and higher accuracy.
Large Language Model-based Role-Playing for Personalized Medical Jargon Extraction.
Lim, J. H.; Kwon, S.; Yao, Z.; Lalor, J. P.; and Yu, H.
August 2024.
arXiv:2408.05555 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{lim_large_2024, title = {Large {Language} {Model}-based {Role}-{Playing} for {Personalized} {Medical} {Jargon} {Extraction}}, url = {http://arxiv.org/abs/2408.05555}, abstract = {Previous studies reveal that Electronic Health Records (EHR), which have been widely adopted in the U.S. to allow patients to access their personal medical information, do not have high readability to patients due to the prevalence of medical jargon. Tailoring medical notes to individual comprehension by identifying jargon that is difficult for each person will enhance the utility of generative models. We present the first quantitative analysis to measure the impact of role-playing in LLM in medical term extraction. By comparing the results of Mechanical Turk workers over 20 sentences, our study demonstrates that LLM role-playing improves F1 scores in 95\% of cases across 14 different socio-demographic backgrounds. Furthermore, applying role-playing with in-context learning outperformed the previous state-of-the-art models. Our research showed that ChatGPT can improve traditional medical term extraction systems by utilizing role-play to deliver personalized patient education, a potential that previous models had not achieved.}, urldate = {2024-09-03}, publisher = {arXiv}, author = {Lim, Jung Hoon and Kwon, Sunjae and Yao, Zonghai and Lalor, John P. and Yu, Hong}, month = aug, year = {2024}, note = {arXiv:2408.05555 [cs]}, keywords = {Computer Science - Computation and Language}, }
Previous studies reveal that Electronic Health Records (EHR), which have been widely adopted in the U.S. to allow patients to access their personal medical information, do not have high readability to patients due to the prevalence of medical jargon. Tailoring medical notes to individual comprehension by identifying jargon that is difficult for each person will enhance the utility of generative models. We present the first quantitative analysis to measure the impact of role-playing in LLM in medical term extraction. By comparing the results of Mechanical Turk workers over 20 sentences, our study demonstrates that LLM role-playing improves F1 scores in 95% of cases across 14 different socio-demographic backgrounds. Furthermore, applying role-playing with in-context learning outperformed the previous state-of-the-art models. Our research showed that ChatGPT can improve traditional medical term extraction systems by utilizing role-play to deliver personalized patient education, a potential that previous models had not achieved.
ODD: A Benchmark Dataset for the Natural Language Processing based Opioid Related Aberrant Behavior Detection.
Kwon, S.; Wang, X.; Liu, W.; Druhl, E.; Sung, M. L.; Reisman, J. I.; Li, W.; Kerns, R. D.; Becker, W.; and Yu, H.
In June 2024. arXiv
Number: arXiv:2307.02591 arXiv:2307.02591 [cs]
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{kwon_odd_2024, title = {{ODD}: {A} {Benchmark} {Dataset} for the {Natural} {Language} {Processing} based {Opioid} {Related} {Aberrant} {Behavior} {Detection}}, shorttitle = {{ODD}}, url = {http://arxiv.org/abs/2307.02591}, doi = {10.48550/arXiv.2307.02591}, abstract = {Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17\% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.}, urldate = {2024-05-21}, publisher = {arXiv}, author = {Kwon, Sunjae and Wang, Xun and Liu, Weisong and Druhl, Emily and Sung, Minhee L. and Reisman, Joel I. and Li, Wenjun and Kerns, Robert D. and Becker, William and Yu, Hong}, month = jun, year = {2024}, note = {Number: arXiv:2307.02591 arXiv:2307.02591 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.
2023
(22)
An Investigation of the Representation of Social Determinants of Health in the UMLS.
Rawat, B. P. S.; Keating, H.; Goodwin, R.; Druhl, E.; and Yu, H.
AMIA Annual Symposium Proceedings, 2022: 912–921. April 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@article{rawat_investigation_2023, title = {An {Investigation} of the {Representation} of {Social} {Determinants} of {Health} in the {UMLS}}, volume = {2022}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148271/}, abstract = {Social Determinants of Health (SDOH) are the conditions in which people are born, live, work, and age. Unified Medical Language System (UMLS) incorporates SDOH concepts; but few have evaluated its coverage and quality. With 15,649 expert-annotated SDOH mentions from 3176 randomly selected electronic health record (EHR) notes, we found that 100\% SDOH mentions can be mapped to at least one UMLS concept, indicating a good coverage of SDOH. However, we discovered a few challenges for the UMLS’s representation of SDOH. Next, we developed a multi-step framework to identify SDOH concepts from UMLS, and a clinical BERT-based classification algorithm to assign each identified SDOH concept to one of the six general categories. Our multi-step framework extracted a total of 198, 677 SDOH concepts from the UMLS and the SDOH category classification system attained an accuracy of 91\%. We also built EASE: an open-source tool to Extract SDOH from EHRs.}, urldate = {2024-04-10}, journal = {AMIA Annual Symposium Proceedings}, author = {Rawat, Bhanu Pratap Singh and Keating, Heather and Goodwin, Raelene and Druhl, Emily and Yu, Hong}, month = apr, year = {2023}, pmid = {37128364}, pmcid = {PMC10148271}, pages = {912--921}, }
Social Determinants of Health (SDOH) are the conditions in which people are born, live, work, and age. Unified Medical Language System (UMLS) incorporates SDOH concepts; but few have evaluated its coverage and quality. With 15,649 expert-annotated SDOH mentions from 3176 randomly selected electronic health record (EHR) notes, we found that 100% SDOH mentions can be mapped to at least one UMLS concept, indicating a good coverage of SDOH. However, we discovered a few challenges for the UMLS’s representation of SDOH. Next, we developed a multi-step framework to identify SDOH concepts from UMLS, and a clinical BERT-based classification algorithm to assign each identified SDOH concept to one of the six general categories. Our multi-step framework extracted a total of 198, 677 SDOH concepts from the UMLS and the SDOH category classification system attained an accuracy of 91%. We also built EASE: an open-source tool to Extract SDOH from EHRs.
H4H: A Comprehensive Repository of Housing Resources for Homelessness.
Osebe, S.; Tsai, J.; and Hong, Y.
AMIA Summits on Translational Science Proceedings, 2023: 427–437. June 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@article{osebe_h4h_2023, title = {{H4H}: {A} {Comprehensive} {Repository} of {Housing} {Resources} for {Homelessness}}, volume = {2023}, issn = {2153-4063}, shorttitle = {{H4H}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10283121/}, abstract = {More than half a million people were experiencing homelessness in America on any given night in 2021, yet only around 50\% of them used shelters. To address unmet needs in homelessness, we report the creation of housing for homeless (H4H), the largest comprehensive repository of emergency shelters and other housing resources, from which we deployed state-of-the-art natural language processing approaches to extract information vital to individuals experiencing homelessness, including admission process, service provided, duration of stay, and eligibility. We frame information extraction as a question-answer task. Using 2,055 question-answer pairs for training and evaluation, the best performing system was a two-step classification and question-answering Roberta model with prompting, achieving a macro-average of 75.83 for F1 score. H4H and the annotated entries are publicly available as a benchmark dataset.}, urldate = {2024-04-10}, journal = {AMIA Summits on Translational Science Proceedings}, author = {Osebe, Samue and Tsai, Jack and Hong, Yu}, month = jun, year = {2023}, pmid = {37350907}, pmcid = {PMC10283121}, pages = {427--437}, }
More than half a million people were experiencing homelessness in America on any given night in 2021, yet only around 50% of them used shelters. To address unmet needs in homelessness, we report the creation of housing for homeless (H4H), the largest comprehensive repository of emergency shelters and other housing resources, from which we deployed state-of-the-art natural language processing approaches to extract information vital to individuals experiencing homelessness, including admission process, service provided, duration of stay, and eligibility. We frame information extraction as a question-answer task. Using 2,055 question-answer pairs for training and evaluation, the best performing system was a two-step classification and question-answering Roberta model with prompting, achieving a macro-average of 75.83 for F1 score. H4H and the annotated entries are publicly available as a benchmark dataset.
Multi-label Few-shot ICD Coding as Autoregressive Generation with Prompt.
Yang, Z.; Kwon, S.; Yao, Z.; and Yu, H.
Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 37(4): 5366–5374. June 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yang_multi-label_2023, title = {Multi-label {Few}-shot {ICD} {Coding} as {Autoregressive} {Generation} with {Prompt}}, volume = {37}, issn = {2159-5399}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457101/}, doi = {10.1609/aaai.v37i4.25668}, abstract = {Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with an average of 3,000+ tokens. This task is challenging due to the high-dimensional space of multi-label assignment (155,000+ ICD code candidates) and the long-tail challenge - Many ICD codes are infrequently assigned yet infrequent ICD codes are important clinically. This study addresses the long-tail challenge by transforming this multi-label classification task into an autoregressive generation task. Specifically, we first introduce a novel pretraining objective to generate free text diagnoses and procedures using the SOAP structure, the medical logic physicians use for note documentation. Second, instead of directly predicting the high dimensional space of ICD codes, our model generates the lower dimension of text descriptions, which then infers ICD codes. Third, we designed a novel prompt template for multi-label classification. We evaluate our Generation with Prompt (GPsoap) model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMIC-III-few show that our model performs with a marco F130.2, which substantially outperforms the previous MIMIC-III-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7). Finally, we design a novel ensemble learner, a cross-attention reranker with prompts, to integrate previous SOTA and our best few-shot coding predictions. Experiments on MIMIC-III-full show that our ensemble learner substantially improves both macro and micro F1, from 10.4 to 14.6 and from 58.2 to 59.1, respectively.}, number = {4}, urldate = {2024-04-10}, journal = {Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence}, author = {Yang, Zhichao and Kwon, Sunjae and Yao, Zonghai and Yu, Hong}, month = jun, year = {2023}, pmid = {37635946}, pmcid = {PMC10457101}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, pages = {5366--5374}, }
Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with an average of 3,000+ tokens. This task is challenging due to the high-dimensional space of multi-label assignment (155,000+ ICD code candidates) and the long-tail challenge - Many ICD codes are infrequently assigned yet infrequent ICD codes are important clinically. This study addresses the long-tail challenge by transforming this multi-label classification task into an autoregressive generation task. Specifically, we first introduce a novel pretraining objective to generate free text diagnoses and procedures using the SOAP structure, the medical logic physicians use for note documentation. Second, instead of directly predicting the high dimensional space of ICD codes, our model generates the lower dimension of text descriptions, which then infers ICD codes. Third, we designed a novel prompt template for multi-label classification. We evaluate our Generation with Prompt (GPsoap) model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMIC-III-few show that our model performs with a marco F130.2, which substantially outperforms the previous MIMIC-III-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7). Finally, we design a novel ensemble learner, a cross-attention reranker with prompts, to integrate previous SOTA and our best few-shot coding predictions. Experiments on MIMIC-III-full show that our ensemble learner substantially improves both macro and micro F1, from 10.4 to 14.6 and from 58.2 to 59.1, respectively.
Evaluating the Efficacy of NoteAid on EHR Note Comprehension among US Veterans through Amazon Mechanical Turk.
Lalor, J. P.; Wu, H.; Mazor, K. M.; and Yu, H.
International journal of medical informatics, 172: 105006. April 2023.
Paper doi link bibtex
Paper doi link bibtex
@article{lalor_evaluating_2023, title = {Evaluating the {Efficacy} of {NoteAid} on {EHR} {Note} {Comprehension} among {US} {Veterans} through {Amazon} {Mechanical} {Turk}}, volume = {172}, issn = {1386-5056}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9992155/}, doi = {10.1016/j.ijmedinf.2023.105006}, urldate = {2024-04-10}, journal = {International journal of medical informatics}, author = {Lalor, John P. and Wu, Hao and Mazor, Kathleen M. and Yu, Hong}, month = apr, year = {2023}, pmid = {36780789}, pmcid = {PMC9992155}, keywords = {Electronic health records, Health information technology, Health literacy}, pages = {105006}, }
Associations Between Natural Language Processing–Enriched Social Determinants of Health and Suicide Death Among US Veterans.
Mitra, A.; Pradhan, R.; Melamed, R. D.; Chen, K.; Hoaglin, D. C.; Tucker, K. L.; Reisman, J. I.; Yang, Z.; Liu, W.; Tsai, J.; and Yu, H.
JAMA Network Open, 6(3). March 2023.
Publisher: American Medical Association
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{mitra_associations_2023, title = {Associations {Between} {Natural} {Language} {Processing}–{Enriched} {Social} {Determinants} of {Health} and {Suicide} {Death} {Among} {US} {Veterans}}, volume = {6}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10018322/}, doi = {10.1001/jamanetworkopen.2023.3079}, abstract = {Are social determinants of health (SDOHs), extracted from both structured and unstructured clinical data, associated with an increased risk of suicide death among US veterans?In this case-control study of 8821 cases and 35 284 matched controls, ...}, language = {en}, number = {3}, urldate = {2024-04-10}, journal = {JAMA Network Open}, author = {Mitra, Avijit and Pradhan, Richeek and Melamed, Rachel D. and Chen, Kun and Hoaglin, David C. and Tucker, Katherine L. and Reisman, Joel I. and Yang, Zhichao and Liu, Weisong and Tsai, Jack and Yu, Hong}, month = mar, year = {2023}, pmid = {36920391}, note = {Publisher: American Medical Association}, }
Are social determinants of health (SDOHs), extracted from both structured and unstructured clinical data, associated with an increased risk of suicide death among US veterans?In this case-control study of 8821 cases and 35 284 matched controls, ...
Intentional Self-Harm Among US Veterans With Traumatic Brain Injury or Posttraumatic Stress Disorder: Retrospective Cohort Study From 2008 to 2017.
Rawat, B. P. S.; Reisman, J.; Pogoda, T. K; Liu, W.; Rongali, S.; Aseltine Jr, R. H; Chen, K.; Tsai, J.; Berlowitz, D.; Yu, H.; and Carlson, K. F
JMIR Public Health and Surveillance, 9: e42803. July 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{rawat_intentional_2023, title = {Intentional {Self}-{Harm} {Among} {US} {Veterans} {With} {Traumatic} {Brain} {Injury} or {Posttraumatic} {Stress} {Disorder}: {Retrospective} {Cohort} {Study} {From} 2008 to 2017}, volume = {9}, issn = {2369-2960}, shorttitle = {Intentional {Self}-{Harm} {Among} {US} {Veterans} {With} {Traumatic} {Brain} {Injury} or {Posttraumatic} {Stress} {Disorder}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10407646/}, doi = {10.2196/42803}, abstract = {Background Veterans with a history of traumatic brain injury (TBI) and/or posttraumatic stress disorder (PTSD) may be at increased risk of suicide attempts and other forms of intentional self-harm as compared to veterans without TBI or PTSD. Objective Using administrative data from the US Veterans Health Administration (VHA), we studied associations between TBI and PTSD diagnoses, and subsequent diagnoses of intentional self-harm among US veterans who used VHA health care between 2008 and 2017. Methods All veterans with encounters or hospitalizations for intentional self-harm were assigned “index dates” corresponding to the date of the first related visit; among those without intentional self-harm, we randomly selected a date from among the veteran’s health care encounters to match the distribution of case index dates over the 10-year period. We then examined the prevalence of TBI and PTSD diagnoses within the 5-year period prior to veterans’ index dates. TBI, PTSD, and intentional self-harm were identified using International Classification of Diseases diagnosis and external cause of injury codes from inpatient and outpatient VHA encounters. We stratified analyses by veterans’ average yearly VHA utilization in the 5-year period before their index date (low, medium, or high). Variations in prevalence and odds of intentional self-harm diagnoses were compared by veterans’ prior TBI and PTSD diagnosis status (TBI only, PTSD only, and comorbid TBI/PTSD) for each VHA utilization stratum. Multivariable models adjusted for age, sex, race, ethnicity, marital status, Department of Veterans Affairs service-connection status, and Charlson Comorbidity Index scores. Results About 6.7 million veterans with at least two VHA visits in the 5-year period before their index dates were included in the analyses; 86,644 had at least one intentional self-harm diagnosis during the study period. During the periods prior to veterans’ index dates, 93,866 were diagnosed with TBI only; 892,420 with PTSD only; and 102,549 with comorbid TBI/PTSD. Across all three VHA utilization strata, the prevalence of intentional self-harm diagnoses was higher among veterans diagnosed with TBI, PTSD, or TBI/PTSD than among veterans with neither diagnosis. The observed difference was most pronounced among veterans in the high VHA utilization stratum. The prevalence of intentional self-harm was six times higher among those with comorbid TBI/PTSD (6778/58,295, 11.63\%) than among veterans with neither TBI nor PTSD (21,979/1,144,991, 1.92\%). Adjusted odds ratios suggested that, after accounting for potential confounders, veterans with TBI, PTSD, or comorbid TBI/PTSD had higher odds of self-harm compared to veterans without these diagnoses. Among veterans with high VHA utilization, those with comorbid TBI/PTSD were 4.26 (95\% CI 4.15-4.38) times more likely to receive diagnoses for intentional self-harm than veterans with neither diagnosis. This pattern was similar for veterans with low and medium VHA utilization. Conclusions Veterans with TBI and/or PTSD diagnoses, compared to those with neither diagnosis, were substantially more likely to be subsequently diagnosed with intentional self-harm between 2008 and 2017. These associations were most pronounced among veterans who used VHA health care most frequently. These findings suggest a need for suicide prevention efforts targeted at veterans with these diagnoses.}, urldate = {2024-04-10}, journal = {JMIR Public Health and Surveillance}, author = {Rawat, Bhanu Pratap Singh and Reisman, Joel and Pogoda, Terri K and Liu, Weisong and Rongali, Subendhu and Aseltine Jr, Robert H and Chen, Kun and Tsai, Jack and Berlowitz, Dan and Yu, Hong and Carlson, Kathleen F}, month = jul, year = {2023}, pmid = {37486751}, pmcid = {PMC10407646}, pages = {e42803}, }
Background Veterans with a history of traumatic brain injury (TBI) and/or posttraumatic stress disorder (PTSD) may be at increased risk of suicide attempts and other forms of intentional self-harm as compared to veterans without TBI or PTSD. Objective Using administrative data from the US Veterans Health Administration (VHA), we studied associations between TBI and PTSD diagnoses, and subsequent diagnoses of intentional self-harm among US veterans who used VHA health care between 2008 and 2017. Methods All veterans with encounters or hospitalizations for intentional self-harm were assigned “index dates” corresponding to the date of the first related visit; among those without intentional self-harm, we randomly selected a date from among the veteran’s health care encounters to match the distribution of case index dates over the 10-year period. We then examined the prevalence of TBI and PTSD diagnoses within the 5-year period prior to veterans’ index dates. TBI, PTSD, and intentional self-harm were identified using International Classification of Diseases diagnosis and external cause of injury codes from inpatient and outpatient VHA encounters. We stratified analyses by veterans’ average yearly VHA utilization in the 5-year period before their index date (low, medium, or high). Variations in prevalence and odds of intentional self-harm diagnoses were compared by veterans’ prior TBI and PTSD diagnosis status (TBI only, PTSD only, and comorbid TBI/PTSD) for each VHA utilization stratum. Multivariable models adjusted for age, sex, race, ethnicity, marital status, Department of Veterans Affairs service-connection status, and Charlson Comorbidity Index scores. Results About 6.7 million veterans with at least two VHA visits in the 5-year period before their index dates were included in the analyses; 86,644 had at least one intentional self-harm diagnosis during the study period. During the periods prior to veterans’ index dates, 93,866 were diagnosed with TBI only; 892,420 with PTSD only; and 102,549 with comorbid TBI/PTSD. Across all three VHA utilization strata, the prevalence of intentional self-harm diagnoses was higher among veterans diagnosed with TBI, PTSD, or TBI/PTSD than among veterans with neither diagnosis. The observed difference was most pronounced among veterans in the high VHA utilization stratum. The prevalence of intentional self-harm was six times higher among those with comorbid TBI/PTSD (6778/58,295, 11.63%) than among veterans with neither TBI nor PTSD (21,979/1,144,991, 1.92%). Adjusted odds ratios suggested that, after accounting for potential confounders, veterans with TBI, PTSD, or comorbid TBI/PTSD had higher odds of self-harm compared to veterans without these diagnoses. Among veterans with high VHA utilization, those with comorbid TBI/PTSD were 4.26 (95% CI 4.15-4.38) times more likely to receive diagnoses for intentional self-harm than veterans with neither diagnosis. This pattern was similar for veterans with low and medium VHA utilization. Conclusions Veterans with TBI and/or PTSD diagnoses, compared to those with neither diagnosis, were substantially more likely to be subsequently diagnosed with intentional self-harm between 2008 and 2017. These associations were most pronounced among veterans who used VHA health care most frequently. These findings suggest a need for suicide prevention efforts targeted at veterans with these diagnoses.
Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing.
Yao, Z.; Cao, Y.; Yang, Z.; and Yu, H.
AMIA Summits on Translational Science Proceedings, 2023: 592–601. June 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@article{yao_context_2023, title = {Context {Variance} {Evaluation} of {Pretrained} {Language} {Models} for {Prompt}-based {Biomedical} {Knowledge} {Probing}}, volume = {2023}, issn = {2153-4063}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10283095/}, abstract = {Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs’ knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduced context variance into the prompt generation and proposed a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we proposed the concept of ”Misunderstand” in LAMA for the first time. Through experiments on 12 PLMs, we showed that our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric make BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle ”understand” from just ”read and copy”.}, urldate = {2023-11-14}, journal = {AMIA Summits on Translational Science Proceedings}, author = {Yao, Zonghai and Cao, Yi and Yang, Zhichao and Yu, Hong}, month = jun, year = {2023}, pmid = {37350903}, pmcid = {PMC10283095}, pages = {592--601}, }
Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs’ knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduced context variance into the prompt generation and proposed a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we proposed the concept of ”Misunderstand” in LAMA for the first time. Through experiments on 12 PLMs, we showed that our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric make BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle ”understand” from just ”read and copy”.
Extracting Biomedical Factual Knowledge Using Pretrained Language Model and Electronic Health Record Context.
Yao, Z.; Cao, Y.; Yang, Z.; Deshpande, V.; and Yu, H.
AMIA Annual Symposium Proceedings, 2022: 1188–1197. April 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@article{yao_extracting_2023, title = {Extracting {Biomedical} {Factual} {Knowledge} {Using} {Pretrained} {Language} {Model} and {Electronic} {Health} {Record} {Context}}, volume = {2022}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148358/}, abstract = {Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.}, urldate = {2024-04-10}, journal = {AMIA Annual Symposium Proceedings}, author = {Yao, Zonghai and Cao, Yi and Yang, Zhichao and Deshpande, Vijeta and Yu, Hong}, month = apr, year = {2023}, pmid = {37128373}, pmcid = {PMC10148358}, pages = {1188--1197}, }
Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.
TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.
Yang, Z.; Mitra, A.; Liu, W.; Berlowitz, D.; and Yu, H.
Nature Communications, 14: 7857. November 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yang_transformehr_2023, title = {{TransformEHR}: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records}, volume = {14}, issn = {2041-1723}, shorttitle = {{TransformEHR}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10687211/}, doi = {10.1038/s41467-023-43715-z}, abstract = {Deep learning transformer-based models using longitudinal electronic health records (EHRs) have shown a great success in prediction of clinical diseases or outcomes. Pretraining on a large dataset can help such models map the input space better and boost their performance on relevant tasks through finetuning with limited data. In this study, we present TransformEHR, a generative encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future visit from previous visits. TransformEHR’s encoder-decoder framework, paired with the novel pretraining objective, helps it achieve the new state-of-the-art performance on multiple clinical prediction tasks. Comparing with the previous model, TransformEHR improves area under the precision–recall curve by 2\% (p {\textless} 0.001) for pancreatic cancer onset and by 24\% (p = 0.007) for intentional self-harm in patients with post-traumatic stress disorder. The high performance in predicting intentional self-harm shows the potential of TransformEHR in building effective clinical intervention systems. TransformEHR is also generalizable and can be easily finetuned for clinical prediction tasks with limited data., Using AI to predict disease can improve interventions slow down or prevent disease. Here, the authors show that generative AI models built on the framework of Transformer, the model that also empowers ChatGPT, can achieve state-of-the-art performance on disease predictions based on longitudinal electronic records.}, urldate = {2024-04-10}, journal = {Nature Communications}, author = {Yang, Zhichao and Mitra, Avijit and Liu, Weisong and Berlowitz, Dan and Yu, Hong}, month = nov, year = {2023}, pmid = {38030638}, pmcid = {PMC10687211}, keywords = {Computer science, Disease prevention, Experimental models of disease}, pages = {7857}, }
Deep learning transformer-based models using longitudinal electronic health records (EHRs) have shown a great success in prediction of clinical diseases or outcomes. Pretraining on a large dataset can help such models map the input space better and boost their performance on relevant tasks through finetuning with limited data. In this study, we present TransformEHR, a generative encoder-decoder model with transformer that is pretrained using a new pretraining objective—predicting all diseases and outcomes of a patient at a future visit from previous visits. TransformEHR’s encoder-decoder framework, paired with the novel pretraining objective, helps it achieve the new state-of-the-art performance on multiple clinical prediction tasks. Comparing with the previous model, TransformEHR improves area under the precision–recall curve by 2% (p \textless 0.001) for pancreatic cancer onset and by 24% (p = 0.007) for intentional self-harm in patients with post-traumatic stress disorder. The high performance in predicting intentional self-harm shows the potential of TransformEHR in building effective clinical intervention systems. TransformEHR is also generalizable and can be easily finetuned for clinical prediction tasks with limited data., Using AI to predict disease can improve interventions slow down or prevent disease. Here, the authors show that generative AI models built on the framework of Transformer, the model that also empowers ChatGPT, can achieve state-of-the-art performance on disease predictions based on longitudinal electronic records.
NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes.
Wang, J.; Yao, Z.; Yang, Z.; Zhou, H.; Li, R.; Wang, X.; Xu, Y.; and Yu, H.
October 2023.
Number: arXiv:2310.15959 arXiv:2310.15959 [cs]
Paper link bibtex abstract
Paper link bibtex abstract
@misc{wang_notechat_2023, title = {{NoteChat}: {A} {Dataset} of {Synthetic} {Doctor}-{Patient} {Conversations} {Conditioned} on {Clinical} {Notes}}, shorttitle = {{NoteChat}}, url = {http://arxiv.org/abs/2310.15959}, abstract = {The detailed clinical records drafted by doctors after each patient's visit are crucial for medical practitioners and researchers. Automating the creation of these notes with language models can reduce the workload of doctors. However, training such models can be difficult due to the limited public availability of conversations between patients and doctors. In this paper, we introduce NoteChat, a cooperative multi-agent framework leveraging Large Language Models (LLMs) for generating synthetic doctor-patient conversations conditioned on clinical notes. NoteChat consists of Planning, Roleplay, and Polish modules. We provide a comprehensive automatic and human evaluation of NoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT and GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic doctor-patient conversations, underscoring the untapped potential of LLMs in healthcare. This work represents the first instance of multiple LLMs cooperating to complete a doctor-patient conversation conditioned on clinical notes, offering promising avenues for the intersection of AI and healthcare}, urldate = {2023-11-15}, publisher = {arXiv}, author = {Wang, Junda and Yao, Zonghai and Yang, Zhichao and Zhou, Huixue and Li, Rumeng and Wang, Xun and Xu, Yucheng and Yu, Hong}, month = oct, year = {2023}, note = {Number: arXiv:2310.15959 arXiv:2310.15959 [cs]}, keywords = {Computer Science - Computation and Language}, }
The detailed clinical records drafted by doctors after each patient's visit are crucial for medical practitioners and researchers. Automating the creation of these notes with language models can reduce the workload of doctors. However, training such models can be difficult due to the limited public availability of conversations between patients and doctors. In this paper, we introduce NoteChat, a cooperative multi-agent framework leveraging Large Language Models (LLMs) for generating synthetic doctor-patient conversations conditioned on clinical notes. NoteChat consists of Planning, Roleplay, and Polish modules. We provide a comprehensive automatic and human evaluation of NoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT and GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic doctor-patient conversations, underscoring the untapped potential of LLMs in healthcare. This work represents the first instance of multiple LLMs cooperating to complete a doctor-patient conversation conditioned on clinical notes, offering promising avenues for the intersection of AI and healthcare
Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations.
Yang, Z.; Yao, Z.; Tasmin, M.; Vashisht, P.; Jang, W. S.; Ouyang, F.; Wang, B.; Berlowitz, D.; and Yu, H.
November 2023.
Pages: 2023.10.26.23297629
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@misc{yang_performance_2023, title = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}: {Potential} for {Imaging} {Diagnostic} {Support} with {Explanations}}, copyright = {© 2023, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/}, shorttitle = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}}, url = {https://www.medrxiv.org/content/10.1101/2023.10.26.23297629v2}, doi = {10.1101/2023.10.26.23297629}, abstract = {Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2\%), AMBOSS (62.0\%), and DRQCE (73.1\%), outperforming ChatGPT and GPT-4 by relative increase of 131.8\% and 64.5\% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7\%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.}, language = {en}, urldate = {2023-11-14}, publisher = {medRxiv}, author = {Yang, Zhichao and Yao, Zonghai and Tasmin, Mahbuba and Vashisht, Parth and Jang, Won Seok and Ouyang, Feiyun and Wang, Beining and Berlowitz, Dan and Yu, Hong}, month = nov, year = {2023}, note = {Pages: 2023.10.26.23297629}, }
Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2%), AMBOSS (62.0%), and DRQCE (73.1%), outperforming ChatGPT and GPT-4 by relative increase of 131.8% and 64.5% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.
SELF-EXPLAIN: Teaching Large Language Models to Reason Complex Questions by Themselves.
Zhao, J.; Yao, Z.; Yang, Z.; and Yu, H.
November 2023.
Number: arXiv:2311.06985 arXiv:2311.06985 [cs] R0-FoMo: Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023.
Paper link bibtex abstract
Paper link bibtex abstract
@misc{zhao_self-explain_2023, title = {{SELF}-{EXPLAIN}: {Teaching} {Large} {Language} {Models} to {Reason} {Complex} {Questions} by {Themselves}}, shorttitle = {{SELF}-{EXPLAIN}}, url = {http://arxiv.org/abs/2311.06985}, abstract = {Large language models (LLMs) can generate intermediate reasoning steps. To elicit the reliable reasoning, the common practice is to employ few-shot chain-of-thought prompting, where several in-context demonstrations for reasoning are prepended to the question. However, such chain-of-thought examples are expensive to craft, especially for professional domains, and can have high variance depending on human annotators. Therefore, this work investigates whether LLMs can teach themselves to reason without human-crafted demonstrations. We propose SELF-EXPLAIN to generate CoT examples by LLMs inspired by "encoding specificity" in human memory retrieval. We find using self-explanations makes LLMs more confident, more calibrated and less biased when answering complex questions. Moreover, we find prompting with self-explanations can even significantly outperform using human-crafted CoTs on several complex question answering dataset.}, urldate = {2023-11-14}, publisher = {arXiv}, author = {Zhao, Jiachen and Yao, Zonghai and Yang, Zhichao and Yu, Hong}, month = nov, year = {2023}, note = {Number: arXiv:2311.06985 arXiv:2311.06985 [cs] R0-FoMo: Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023.}, keywords = {Computer Science - Computation and Language}, }
Large language models (LLMs) can generate intermediate reasoning steps. To elicit the reliable reasoning, the common practice is to employ few-shot chain-of-thought prompting, where several in-context demonstrations for reasoning are prepended to the question. However, such chain-of-thought examples are expensive to craft, especially for professional domains, and can have high variance depending on human annotators. Therefore, this work investigates whether LLMs can teach themselves to reason without human-crafted demonstrations. We propose SELF-EXPLAIN to generate CoT examples by LLMs inspired by "encoding specificity" in human memory retrieval. We find using self-explanations makes LLMs more confident, more calibrated and less biased when answering complex questions. Moreover, we find prompting with self-explanations can even significantly outperform using human-crafted CoTs on several complex question answering dataset.
EHRTutor: Enhancing Patient Understanding of Discharge Instructions.
Zhang, Z.; Yao, Z.; Zhou, H.; ouyang , F.; and Yu, H.
October 2023.
To appear in NeurIPS'23 Workshop on Generative AI for Education (GAIED), December, New Orleans
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@misc{zhang_ehrtutor_2023, title = {{EHRTutor}: {Enhancing} {Patient} {Understanding} of {Discharge} {Instructions}}, shorttitle = {{EHRTutor}}, url = {http://arxiv.org/abs/2310.19212}, doi = {10.48550/arXiv.2310.19212}, abstract = {Large language models have shown success as a tutor in education in various fields. Educating patients about their clinical visits plays a pivotal role in patients' adherence to their treatment plans post-discharge. This paper presents EHRTutor, an innovative multi-component framework leveraging the Large Language Model (LLM) for patient education through conversational question-answering. EHRTutor first formulates questions pertaining to the electronic health record discharge instructions. It then educates the patient through conversation by administering each question as a test. Finally, it generates a summary at the end of the conversation. Evaluation results using LLMs and domain experts have shown a clear preference for EHRTutor over the baseline. Moreover, EHRTutor also offers a framework for generating synthetic patient education dialogues that can be used for future in-house system training.}, urldate = {2023-11-01}, publisher = {arXiv}, author = {Zhang, Zihao and Yao, Zonghai and Zhou, Huixue and ouyang, Feiyun and Yu, Hong}, month = oct, year = {2023}, note = {To appear in NeurIPS'23 Workshop on Generative AI for Education (GAIED), December, New Orleans}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Large language models have shown success as a tutor in education in various fields. Educating patients about their clinical visits plays a pivotal role in patients' adherence to their treatment plans post-discharge. This paper presents EHRTutor, an innovative multi-component framework leveraging the Large Language Model (LLM) for patient education through conversational question-answering. EHRTutor first formulates questions pertaining to the electronic health record discharge instructions. It then educates the patient through conversation by administering each question as a test. Finally, it generates a summary at the end of the conversation. Evaluation results using LLMs and domain experts have shown a clear preference for EHRTutor over the baseline. Moreover, EHRTutor also offers a framework for generating synthetic patient education dialogues that can be used for future in-house system training.
Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization.
Mishra, P.; Yao, Z.; Chen, S.; Wang, B.; Mittal, R.; and Yu, H.
October 2023.
NeurIPS 2023 Workshop SyntheticData4ML Accepted
Paper link bibtex abstract
Paper link bibtex abstract
@misc{mishra_synthetic_2023, title = {Synthetic {Imitation} {Edit} {Feedback} for {Factual} {Alignment} in {Clinical} {Summarization}}, url = {http://arxiv.org/abs/2310.20033}, abstract = {Large Language Models (LLMs) like the GPT and LLaMA families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. However, community concerns about these models' hallucination issues continue to rise. LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain NLP tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using human feedback has shown the promise of aligning LLMs to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. In this work, we propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. We focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical NLP tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. In addition, although GPT has reached the expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much previous work discussing whether GPT can generate expert-level edit feedback for LMs in the clinical note summarization task. We hope to fill this gap. Finally, our evaluations demonstrate the potential use of GPT edits in human alignment, especially from a factuality perspective.}, urldate = {2023-11-01}, publisher = {arXiv}, author = {Mishra, Prakamya and Yao, Zonghai and Chen, Shuwei and Wang, Beining and Mittal, Rohan and Yu, Hong}, month = oct, year = {2023}, note = {NeurIPS 2023 Workshop SyntheticData4ML Accepted}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Large Language Models (LLMs) like the GPT and LLaMA families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. However, community concerns about these models' hallucination issues continue to rise. LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain NLP tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using human feedback has shown the promise of aligning LLMs to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. In this work, we propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. We focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical NLP tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. In addition, although GPT has reached the expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much previous work discussing whether GPT can generate expert-level edit feedback for LMs in the clinical note summarization task. We hope to fill this gap. Finally, our evaluations demonstrate the potential use of GPT edits in human alignment, especially from a factuality perspective.
Improving Summarization with Human Edits.
Yao, Z.; Schloss, B. J.; and Selvaraj, S. P.
December 2023.
EMNLP 2023
Paper link bibtex abstract
Paper link bibtex abstract
@misc{yao_improving_2023, title = {Improving {Summarization} with {Human} {Edits}}, url = {http://arxiv.org/abs/2310.05857}, abstract = {Recent work has shown the promise of learning with human feedback paradigms to produce human-determined high-quality text. Existing works use human feedback to train large language models (LLMs) in general domain abstractive summarization and have obtained summary quality exceeding traditional likelihood training. In this paper, we focus on a less explored form of human feedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training (SALT), a novel technique to use both the human-edited and model-generated data together in the training loop. In addition, we demonstrate simulating Human Edits with ground truth summaries coming from existing training data -- Imitation edits, along with the model-generated summaries obtained after the training, to reduce the need for expensive human-edit data. In our experiments, we extend human feedback exploration from general domain summarization to medical domain summarization. Our results demonstrate the effectiveness of SALT to improve the summary quality with Human and Imitation Edits.}, urldate = {2023-10-10}, publisher = {arXiv}, author = {Yao, Zonghai and Schloss, Benjamin J. and Selvaraj, Sai P.}, month = dec, year = {2023}, note = {EMNLP 2023}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning}, }
Recent work has shown the promise of learning with human feedback paradigms to produce human-determined high-quality text. Existing works use human feedback to train large language models (LLMs) in general domain abstractive summarization and have obtained summary quality exceeding traditional likelihood training. In this paper, we focus on a less explored form of human feedback – Human Edits. We propose Sequence Alignment (un)Likelihood Training (SALT), a novel technique to use both the human-edited and model-generated data together in the training loop. In addition, we demonstrate simulating Human Edits with ground truth summaries coming from existing training data – Imitation edits, along with the model-generated summaries obtained after the training, to reduce the need for expensive human-edit data. In our experiments, we extend human feedback exploration from general domain summarization to medical domain summarization. Our results demonstrate the effectiveness of SALT to improve the summary quality with Human and Imitation Edits.
PaniniQA: Enhancing Patient Education Through Interactive Question Answering.
Cai, P.; Yao, Z.; Liu, F.; Wang, D.; Reilly, M.; Zhou, H.; Li, L.; Cao, Y.; Kapoor, A.; Bajracharya, A.; Berlowitz, D.; and Yu, H.
Transactions of the Association for Computational Linguistics. August 2023.
Equal contributions for the first two authors.
Paper link bibtex abstract
Paper link bibtex abstract
@article{cai_paniniqa_2023, title = {{PaniniQA}: {Enhancing} {Patient} {Education} {Through} {Interactive} {Question} {Answering}}, shorttitle = {{PaniniQA}}, url = {http://arxiv.org/abs/2308.03253}, abstract = {Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions}, urldate = {2023-08-08}, journal = {Transactions of the Association for Computational Linguistics}, author = {Cai, Pengshan and Yao, Zonghai and Liu, Fei and Wang, Dakuo and Reilly, Meghan and Zhou, Huixue and Li, Lingxi and Cao, Yi and Kapoor, Alok and Bajracharya, Adarsha and Berlowitz, Dan and Yu, Hong}, month = aug, year = {2023}, note = {Equal contributions for the first two authors.}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language}, }
Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions
Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond.
Chang, H.; Yao, Z.; Gon, A.; Yu, H.; and McCallum, A.
July 2023.
ACL 2023, equal contribution from the first two authors.
Paper link bibtex abstract
Paper link bibtex abstract
@misc{chang_revisiting_2023, address = {Canada}, title = {Revisiting the {Architectures} like {Pointer} {Networks} to {Efficiently} {Improve} the {Next} {Word} {Distribution}, {Summarization} {Factuality}, and {Beyond}}, url = {http://arxiv.org/abs/2305.12289}, abstract = {Is the output softmax layer, which is adopted by most language models (LMs), always the best way to compute the next word probability? Given so many attention layers in a modern transformer-based LM, are the pointer networks redundant nowadays? In this study, we discover that the answers to both questions are no. This is because the softmax bottleneck sometimes prevents the LMs from predicting the desired distribution and the pointer networks can be used to break the bottleneck efficiently. Based on the finding, we propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers. In GPT-2, our proposals are significantly better and more efficient than mixture of softmax, a state-of-the-art softmax alternative. In summarization experiments, without significantly decreasing its training/testing speed, our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30\% in BookSum paragraph-level dataset.}, urldate = {2023-05-23}, publisher = {arXiv}, author = {Chang, Haw-Shiuan and Yao, Zonghai and Gon, Alolika and Yu, Hong and McCallum, Andrew}, month = jul, year = {2023}, note = {ACL 2023, equal contribution from the first two authors.}, keywords = {Computer Science - Computation and Language}, }
Is the output softmax layer, which is adopted by most language models (LMs), always the best way to compute the next word probability? Given so many attention layers in a modern transformer-based LM, are the pointer networks redundant nowadays? In this study, we discover that the answers to both questions are no. This is because the softmax bottleneck sometimes prevents the LMs from predicting the desired distribution and the pointer networks can be used to break the bottleneck efficiently. Based on the finding, we propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers. In GPT-2, our proposals are significantly better and more efficient than mixture of softmax, a state-of-the-art softmax alternative. In summarization experiments, without significantly decreasing its training/testing speed, our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.
Automated identification of eviction status from electronic health record notes.
Yao, Z.; Tsai, J.; Liu, W.; Levy, D. A; Druhl, E.; Reisman, J. I; and Yu, H.
Journal of the American Medical Informatics Association,ocad081. May 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yao_automated_2023, title = {Automated identification of eviction status from electronic health record notes}, issn = {1527-974X}, url = {https://doi.org/10.1093/jamia/ocad081}, doi = {10.1093/jamia/ocad081}, abstract = {Evictions are important social and behavioral determinants of health. Evictions are associated with a cascade of negative events that can lead to unemployment, housing insecurity/homelessness, long-term poverty, and mental health problems. In this study, we developed a natural language processing system to automatically detect eviction status from electronic health record (EHR) notes.We first defined eviction status (eviction presence and eviction period) and then annotated eviction status in 5000 EHR notes from the Veterans Health Administration (VHA). We developed a novel model, KIRESH, that has shown to substantially outperform other state-of-the-art models such as fine-tuning pretrained language models like BioBERT and Bio\_ClinicalBERT. Moreover, we designed a novel prompt to further improve the model performance by using the intrinsic connection between the 2 subtasks of eviction presence and period prediction. Finally, we used the Temperature Scaling-based Calibration on our KIRESH-Prompt method to avoid overconfidence issues arising from the imbalance dataset.KIRESH-Prompt substantially outperformed strong baseline models including fine-tuning the Bio\_ClinicalBERT model to achieve 0.74672 MCC, 0.71153 Macro-F1, and 0.83396 Micro-F1 in predicting eviction period and 0.66827 MCC, 0.62734 Macro-F1, and 0.7863 Micro-F1 in predicting eviction presence. We also conducted additional experiments on a benchmark social determinants of health (SBDH) dataset to demonstrate the generalizability of our methods.KIRESH-Prompt has substantially improved eviction status classification. We plan to deploy KIRESH-Prompt to the VHA EHRs as an eviction surveillance system to help address the US Veterans’ housing insecurity.}, urldate = {2023-05-19}, journal = {Journal of the American Medical Informatics Association}, author = {Yao, Zonghai and Tsai, Jack and Liu, Weisong and Levy, David A and Druhl, Emily and Reisman, Joel I and Yu, Hong}, month = may, year = {2023}, keywords = {Computer Science - Computation and Language}, pages = {ocad081}, }
Evictions are important social and behavioral determinants of health. Evictions are associated with a cascade of negative events that can lead to unemployment, housing insecurity/homelessness, long-term poverty, and mental health problems. In this study, we developed a natural language processing system to automatically detect eviction status from electronic health record (EHR) notes.We first defined eviction status (eviction presence and eviction period) and then annotated eviction status in 5000 EHR notes from the Veterans Health Administration (VHA). We developed a novel model, KIRESH, that has shown to substantially outperform other state-of-the-art models such as fine-tuning pretrained language models like BioBERT and Bio_ClinicalBERT. Moreover, we designed a novel prompt to further improve the model performance by using the intrinsic connection between the 2 subtasks of eviction presence and period prediction. Finally, we used the Temperature Scaling-based Calibration on our KIRESH-Prompt method to avoid overconfidence issues arising from the imbalance dataset.KIRESH-Prompt substantially outperformed strong baseline models including fine-tuning the Bio_ClinicalBERT model to achieve 0.74672 MCC, 0.71153 Macro-F1, and 0.83396 Micro-F1 in predicting eviction period and 0.66827 MCC, 0.62734 Macro-F1, and 0.7863 Micro-F1 in predicting eviction presence. We also conducted additional experiments on a benchmark social determinants of health (SBDH) dataset to demonstrate the generalizability of our methods.KIRESH-Prompt has substantially improved eviction status classification. We plan to deploy KIRESH-Prompt to the VHA EHRs as an eviction surveillance system to help address the US Veterans’ housing insecurity.
Buprenorphine use and courses of care for opioid use disorder treatment within the Veterans Health Administration.
Gordon, A. J.; Saxon, A. J.; Kertesz, S.; Wyse, J. J.; Manhapra, A.; Lin, L. A.; Chen, W.; Hansen, J.; Pinnell, D.; Huynh, T.; Baylis, J. D.; Cunningham, F. E.; Ghitza, U. E.; Bart, G.; Yu, H.; and Sauer, B. C.
Drug and Alcohol Dependence, 248: 109902. July 2023.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{gordon_buprenorphine_2023, title = {Buprenorphine use and courses of care for opioid use disorder treatment within the {Veterans} {Health} {Administration}}, volume = {248}, issn = {0376-8716}, url = {https://www.sciencedirect.com/science/article/pii/S0376871623001400}, doi = {10.1016/j.drugalcdep.2023.109902}, abstract = {Background Retention of patients in buprenorphine medication treatment for opioid use disorder (B-MOUD) reduces harms associated with opioid use disorder (OUD). We sought to characterize the patients receiving B-MOUD and courses of B-MOUD in a large healthcare system. Methods We conducted a retrospective, open cohort study of patients with OUD who either did or did not receive B-MOUD courses within the Veterans Health Administration (VHA) from January 2006 through July 2019, using VHA clinical data. We compared patients receiving or not receiving B-MOUD, characterized B-MOUD courses (e.g., length and doses), and examined persistence, across patient characteristics, over time. We used analyses for normally or non-normally distributed continuous variables, categorical data, and persistence over time (Kaplan-Meier persistence curves). Results We identified 255,726 Veterans with OUD; 40,431 (15.8\%) had received 63,929 B-MOUD courses. Compared to patients with OUD without B-MOUD, patients with B-MOUD were younger, more often of white race, and had more co-morbidities. The frequency of new B-MOUD starts and prevalent B-MOUD patients ranged from 1550 and 1989 in 2007 to 8146 and 16,505 in 2018, respectively. The median duration of B-MOUD was 157 (IQR: 37–537) days for all courses and 33.8\% patients had more than one course. The average proportion days covered was 90\% (SD: 0.15), and the average prescribed daily dose was 13.44 (SD: 6.5). Conclusions Within a VHA B-MOUD cohort, courses increased more than 10-fold from 2006 to 2016 with nearly half of patients experiencing multiple courses. Patient demographics seem to dictate the length of courses.}, language = {en}, urldate = {2023-05-15}, journal = {Drug and Alcohol Dependence}, author = {Gordon, Adam J. and Saxon, Andrew J. and Kertesz, Stefan and Wyse, Jessica J. and Manhapra, Ajay and Lin, Lewei A. and Chen, Wei and Hansen, Jared and Pinnell, Derek and Huynh, Tina and Baylis, Jacob D. and Cunningham, Francesca E. and Ghitza, Udi E. and Bart, Gavin and Yu, Hong and Sauer, Brian C.}, month = jul, year = {2023}, keywords = {Buprenorphine, Opioid-Related Disorders}, pages = {109902}, }
Background Retention of patients in buprenorphine medication treatment for opioid use disorder (B-MOUD) reduces harms associated with opioid use disorder (OUD). We sought to characterize the patients receiving B-MOUD and courses of B-MOUD in a large healthcare system. Methods We conducted a retrospective, open cohort study of patients with OUD who either did or did not receive B-MOUD courses within the Veterans Health Administration (VHA) from January 2006 through July 2019, using VHA clinical data. We compared patients receiving or not receiving B-MOUD, characterized B-MOUD courses (e.g., length and doses), and examined persistence, across patient characteristics, over time. We used analyses for normally or non-normally distributed continuous variables, categorical data, and persistence over time (Kaplan-Meier persistence curves). Results We identified 255,726 Veterans with OUD; 40,431 (15.8%) had received 63,929 B-MOUD courses. Compared to patients with OUD without B-MOUD, patients with B-MOUD were younger, more often of white race, and had more co-morbidities. The frequency of new B-MOUD starts and prevalent B-MOUD patients ranged from 1550 and 1989 in 2007 to 8146 and 16,505 in 2018, respectively. The median duration of B-MOUD was 157 (IQR: 37–537) days for all courses and 33.8% patients had more than one course. The average proportion days covered was 90% (SD: 0.15), and the average prescribed daily dose was 13.44 (SD: 6.5). Conclusions Within a VHA B-MOUD cohort, courses increased more than 10-fold from 2006 to 2016 with nearly half of patients experiencing multiple courses. Patient demographics seem to dictate the length of courses.
Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information.
Kwon, S.; Garodia, R.; Lee, M.; Yang, Z.; and Yu, H.
In Toronto Canada, July 2023.
ACL 2023
link bibtex
link bibtex
@inproceedings{kwon_vision_2023, address = {Toronto Canada}, title = {Vision {Meets} {Definitions}: {Unsupervised} {Visual} {Word} {Sense} {Disambiguation} {Incorporating} {Gloss} {Information}}, author = {Kwon, Sunjae and Garodia, Rishabh and Lee, Minhwa and Yang, Zhichao and Yu, Hong}, month = jul, year = {2023}, note = {ACL 2023}, }
Generating User-Engaging News Headlines.
Cai, P.; Song, K.; Cho, S.; Wang, H.; Wang, X.; Yu, H.; Liu, F.; and Yu, D.
In Toronto, Canada, July 2023.
ACL 2023
link bibtex
link bibtex
@inproceedings{cai_generating_2023, address = {Toronto, Canada}, title = {Generating {User}-{Engaging} {News} {Headlines}}, shorttitle = {{ACL} 2023}, author = {Cai, Pengshan and Song, Kaiqiang and Cho, Sangwoo and Wang, Hongwei and Wang, Xiaoyang and Yu, Hong and Liu, Fei and Yu, Dong}, month = jul, year = {2023}, note = {ACL 2023}, }
Web Information Extraction for Social Good: Food Pantry Answering As an Example.
Chen, H.; and Yu, H.
In Austin, TX, May 2023. ACM
The Web Conference 2023, Austin TX
doi link bibtex abstract
doi link bibtex abstract
@inproceedings{chen_web_2023, address = {Austin, TX}, title = {Web {Information} {Extraction} for {Social} {Good}: {Food} {Pantry} {Answering} {As} an {Example}}, doi = {10.1145/3543507.3583880}, abstract = {Social Determinants of Health (SDH) have more influence on health outcome than clinical care or the physical environment, namely food insecurity, housing instability, and health literacy. Many researchers design applications as a bridge to connect between resource providers and the deprived population. In this study, we take food pantries as a solution to mitigate food insecurity as an example to illustrate an automatic system combining location-aware information retrieval, web information extraction and domain-specific answering. To acquire the latest knowledge, our proposed framework first retrieves pantry candidates based on geolocation of the user, and utilizes structural information from markup language to extract semantic chunks related to six common requests. We use BERT and RoBERTa as information extraction models and compare three different web page segmentation methods in the experiments.}, publisher = {ACM}, author = {Chen, Huan-Yuan and Yu, Hong}, month = may, year = {2023}, note = {The Web Conference 2023, Austin TX}, }
Social Determinants of Health (SDH) have more influence on health outcome than clinical care or the physical environment, namely food insecurity, housing instability, and health literacy. Many researchers design applications as a bridge to connect between resource providers and the deprived population. In this study, we take food pantries as a solution to mitigate food insecurity as an example to illustrate an automatic system combining location-aware information retrieval, web information extraction and domain-specific answering. To acquire the latest knowledge, our proposed framework first retrieves pantry candidates based on geolocation of the user, and utilizes structural information from markup language to extract semantic chunks related to six common requests. We use BERT and RoBERTa as information extraction models and compare three different web page segmentation methods in the experiments.
2022
(21)
ScAN: Suicide Attempt and Ideation Events Dataset.
Rawat, B. P. S.; Kovaly, S.; Pigeon, W. R.; and Yu, H.
Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, 2022: 1029–1040. July 2022.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{rawat_scan_2022, title = {{ScAN}: {Suicide} {Attempt} and {Ideation} {Events} {Dataset}}, volume = {2022}, shorttitle = {{ScAN}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9958515/}, doi = {10.18653/v1/2022.naacl-main.75}, abstract = {Suicide is an important public health concern and one of the leading causes of death worldwide. Suicidal behaviors, including suicide attempts (SA) and suicide ideations (SI), are leading risk factors for death by suicide. Information related to patients’ previous and current SA and SI are frequently documented in the electronic health record (EHR) notes. Accurate detection of such documentation may help improve surveillance and predictions of patients’ suicidal behaviors and alert medical professionals for suicide prevention efforts. In this study, we first built Suicide Attempt and Ideation Events (ScAN) dataset, a subset of the publicly available MIMIC III dataset spanning over 12k+ EHR notes with 19k+ annotated SA and SI events information. The annotations also contain attributes such as method of suicide attempt. We also provide a strong baseline model ScANER (Suicide Attempt and Ideation Events Retreiver), a multi-task RoBERTa-based model with a retrieval module to extract all the relevant suicidal behavioral evidences from EHR notes of an hospital-stay and, and a prediction module to identify the type of suicidal behavior (SA and SI) concluded during the patient’s stay at the hospital. ScANER achieved a macro-weighted F1-score of 0.83 for identifying suicidal behavioral evidences and a macro F1-score of 0.78 and 0.60 for classification of SA and SI for the patient’s hospital-stay, respectively. ScAN and ScANER are publicly available.}, urldate = {2024-04-10}, journal = {Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting}, author = {Rawat, Bhanu Pratap Singh and Kovaly, Samuel and Pigeon, Wilfred R. and Yu, Hong}, month = jul, year = {2022}, pmid = {36848299}, pmcid = {PMC9958515}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning}, pages = {1029--1040}, }
Suicide is an important public health concern and one of the leading causes of death worldwide. Suicidal behaviors, including suicide attempts (SA) and suicide ideations (SI), are leading risk factors for death by suicide. Information related to patients’ previous and current SA and SI are frequently documented in the electronic health record (EHR) notes. Accurate detection of such documentation may help improve surveillance and predictions of patients’ suicidal behaviors and alert medical professionals for suicide prevention efforts. In this study, we first built Suicide Attempt and Ideation Events (ScAN) dataset, a subset of the publicly available MIMIC III dataset spanning over 12k+ EHR notes with 19k+ annotated SA and SI events information. The annotations also contain attributes such as method of suicide attempt. We also provide a strong baseline model ScANER (Suicide Attempt and Ideation Events Retreiver), a multi-task RoBERTa-based model with a retrieval module to extract all the relevant suicidal behavioral evidences from EHR notes of an hospital-stay and, and a prediction module to identify the type of suicidal behavior (SA and SI) concluded during the patient’s stay at the hospital. ScANER achieved a macro-weighted F1-score of 0.83 for identifying suicidal behavioral evidences and a macro F1-score of 0.78 and 0.60 for classification of SA and SI for the patient’s hospital-stay, respectively. ScAN and ScANER are publicly available.
Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding.
Yang, Z.; Wang, S.; Rawat, B. P. S.; Mitra, A.; and Yu, H.
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, 2022: 1767. December 2022.
Publisher: NIH Public Access
Paper link bibtex abstract
Paper link bibtex abstract
@article{yang_knowledge_2022, title = {Knowledge {Injected} {Prompt} {Based} {Fine}-tuning for {Multi}-label {Few}-shot {ICD} {Coding}}, volume = {2022}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9958514/}, abstract = {Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with average length of 3,000+ tokens. This task is challenging due to a high-dimensional space of multi-label assignment (tens of thousands ...}, language = {en}, urldate = {2024-04-10}, journal = {Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing}, author = {Yang, Zhichao and Wang, Shufan and Rawat, Bhanu Pratap Singh and Mitra, Avijit and Yu, Hong}, month = dec, year = {2022}, pmid = {36848298}, note = {Publisher: NIH Public Access}, pages = {1767}, }
Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with average length of 3,000+ tokens. This task is challenging due to a high-dimensional space of multi-label assignment (tens of thousands ...
Learning as Conversation: Dialogue Systems Reinforced for Information Acquisition.
Cai, P.; Wan, H.; Liu, F.; Yu, M.; Yu, H.; and Joshi, S.
In Carpuat, M.; de Marneffe, M.; and Meza Ruiz, I. V., editor(s), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4781–4796, Seattle, United States, July 2022. Association for Computational Linguistics
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{cai_learning_2022, address = {Seattle, United States}, title = {Learning as {Conversation}: {Dialogue} {Systems} {Reinforced} for {Information} {Acquisition}}, shorttitle = {Learning as {Conversation}}, url = {https://aclanthology.org/2022.naacl-main.352}, doi = {10.18653/v1/2022.naacl-main.352}, abstract = {We propose novel AI-empowered chat bots for learning as conversation where a user does not read a passage but gains information and knowledge through conversation with a teacher bot. Our information acquisition-oriented dialogue system employs a novel adaptation of reinforced self-play so that the system can be transferred to various domains without in-domain dialogue data, and can carry out conversations both informative and attentive to users.}, urldate = {2023-11-14}, booktitle = {Proceedings of the 2022 {Conference} of the {North} {American} {Chapter} of the {Association} for {Computational} {Linguistics}: {Human} {Language} {Technologies}}, publisher = {Association for Computational Linguistics}, author = {Cai, Pengshan and Wan, Hui and Liu, Fei and Yu, Mo and Yu, Hong and Joshi, Sachindra}, editor = {Carpuat, Marine and de Marneffe, Marie-Catherine and Meza Ruiz, Ivan Vladimir}, month = jul, year = {2022}, pages = {4781--4796}, }
We propose novel AI-empowered chat bots for learning as conversation where a user does not read a passage but gains information and knowledge through conversation with a teacher bot. Our information acquisition-oriented dialogue system employs a novel adaptation of reinforced self-play so that the system can be transferred to various domains without in-domain dialogue data, and can carry out conversations both informative and attentive to users.
Generation of Patient After-Visit Summaries to Support Physicians.
Cai, P.; Liu, F.; Bajracharya, A.; Sills, J.; Kapoor, A.; Liu, W.; Berlowitz, D.; Levy, D.; Pradhan, R.; and Yu, H.
In Proceedings of the 29th International Conference on Computational Linguistics, pages 6234–6247, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics
Paper link bibtex abstract
Paper link bibtex abstract
@inproceedings{cai_generation_2022, address = {Gyeongju, Republic of Korea}, title = {Generation of {Patient} {After}-{Visit} {Summaries} to {Support} {Physicians}}, url = {https://aclanthology.org/2022.coling-1.544}, abstract = {An after-visit summary (AVS) is a summary note given to patients after their clinical visit. It recaps what happened during their clinical visit and guides patients' disease self-management. Studies have shown that a majority of patients found after-visit summaries useful. However, many physicians face excessive workloads and do not have time to write clear and informative summaries. In this paper, we study the problem of automatic generation of after-visit summaries and examine whether those summaries can convey the gist of clinical visits. We report our findings on a new clinical dataset that contains a large number of electronic health record (EHR) notes and their associated summaries. Our results suggest that generation of lay language after-visit summaries remains a challenging task. Crucially, we introduce a feedback mechanism that alerts physicians when an automatic summary fails to capture the important details of the clinical notes or when it contains hallucinated facts that are potentially detrimental to the summary quality. Automatic and human evaluation demonstrates the effectiveness of our approach in providing writing feedback and supporting physicians.}, urldate = {2022-12-18}, booktitle = {Proceedings of the 29th {International} {Conference} on {Computational} {Linguistics}}, publisher = {International Committee on Computational Linguistics}, author = {Cai, Pengshan and Liu, Fei and Bajracharya, Adarsha and Sills, Joe and Kapoor, Alok and Liu, Weisong and Berlowitz, Dan and Levy, David and Pradhan, Richeek and Yu, Hong}, month = oct, year = {2022}, pages = {6234--6247}, }
An after-visit summary (AVS) is a summary note given to patients after their clinical visit. It recaps what happened during their clinical visit and guides patients' disease self-management. Studies have shown that a majority of patients found after-visit summaries useful. However, many physicians face excessive workloads and do not have time to write clear and informative summaries. In this paper, we study the problem of automatic generation of after-visit summaries and examine whether those summaries can convey the gist of clinical visits. We report our findings on a new clinical dataset that contains a large number of electronic health record (EHR) notes and their associated summaries. Our results suggest that generation of lay language after-visit summaries remains a challenging task. Crucially, we introduce a feedback mechanism that alerts physicians when an automatic summary fails to capture the important details of the clinical notes or when it contains hallucinated facts that are potentially detrimental to the summary quality. Automatic and human evaluation demonstrates the effectiveness of our approach in providing writing feedback and supporting physicians.
Enhancing the prediction of disease outcomes using electronic health records and pretrained deep learning models.
Yang, Z.; Liu, W.; Berlowitz, D.; and Yu, H.
December 2022.
arXiv:2212.12067 [cs]
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@misc{yang_enhancing_2022, title = {Enhancing the prediction of disease outcomes using electronic health records and pretrained deep learning models}, url = {http://arxiv.org/abs/2212.12067}, doi = {10.48550/arXiv.2212.12067}, abstract = {Question: Can an encoder-decoder architecture pretrained on a large dataset of longitudinal electronic health records improves patient outcome predictions? Findings: In this prognostic study of 6.8 million patients, our denoising sequence-to-sequence prediction model of multiple outcomes outperformed state-of-the-art models scuh pretrained BERT on a broad range of patient outcomes, including intentional self-harm and pancreatic cancer. Meaning: Deep bidirectional and autoregressive representation improves patient outcome prediction.}, urldate = {2023-02-19}, publisher = {arXiv}, author = {Yang, Zhichao and Liu, Weisong and Berlowitz, Dan and Yu, Hong}, month = dec, year = {2022}, note = {arXiv:2212.12067 [cs]}, keywords = {Computer Science - Artificial Intelligence, Computer Science - Computers and Society, Computer Science - Machine Learning}, }
Question: Can an encoder-decoder architecture pretrained on a large dataset of longitudinal electronic health records improves patient outcome predictions? Findings: In this prognostic study of 6.8 million patients, our denoising sequence-to-sequence prediction model of multiple outcomes outperformed state-of-the-art models scuh pretrained BERT on a broad range of patient outcomes, including intentional self-harm and pancreatic cancer. Meaning: Deep bidirectional and autoregressive representation improves patient outcome prediction.
Geographic Disparities in Prevalence of Opioid Use Disorders in US Veterans.
Li, W.; Leon, C.; Liu, W.; Sung, M. L.; Kerns, R. D.; Becker, W. C.; and Yu, H.
In Boston MA, November 2022.
APHA 2022 Annual Meeting and Expo
link bibtex
link bibtex
@inproceedings{li_geographic_2022, address = {Boston MA}, title = {Geographic {Disparities} in {Prevalence} of {Opioid} {Use} {Disorders} in {US} {Veterans}}, author = {Li, Weijun and Leon, Casey and Liu, Weisong and Sung, Minhee L. and Kerns, Robert D. and Becker, William C. and Yu, Hong}, month = nov, year = {2022}, note = {APHA 2022 Annual Meeting and Expo}, }
Prevalence of Frailty and Associations with Oral Anticoagulant Prescribing in Atrial Fibrillation.
Sanghai, S. R.; Liu, W.; Wang, W.; Rongali, S.; Orkaby, A. R.; Saczynski, J. S.; Rose, A. J.; Kapoor, A.; Li, W.; Yu, H.; and McManus, D. D.
Journal of General Internal Medicine, 37(4): 730–736. March 2022.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{sanghai_prevalence_2022, title = {Prevalence of {Frailty} and {Associations} with {Oral} {Anticoagulant} {Prescribing} in {Atrial} {Fibrillation}}, volume = {37}, issn = {1525-1497}, url = {https://doi.org/10.1007/s11606-021-06834-1}, doi = {10.1007/s11606-021-06834-1}, abstract = {Frailty is often cited as a factor influencing oral anticoagulation (OAC) prescription in patients with non-valvular atrial fibrillation (NVAF). We sought to determine the prevalence of frailty and its association with OAC prescription in older veterans with NVAF.}, language = {en}, number = {4}, urldate = {2022-12-13}, journal = {Journal of General Internal Medicine}, author = {Sanghai, Saket R. and Liu, Weisong and Wang, Weijia and Rongali, Subendhu and Orkaby, Ariela R. and Saczynski, Jane S. and Rose, Adam J. and Kapoor, Alok and Li, Wenjun and Yu, Hong and McManus, David D.}, month = mar, year = {2022}, keywords = {atrial fibrillation, frailty, oral anticoagulation}, pages = {730--736}, }
Frailty is often cited as a factor influencing oral anticoagulation (OAC) prescription in patients with non-valvular atrial fibrillation (NVAF). We sought to determine the prevalence of frailty and its association with OAC prescription in older veterans with NVAF.
Using data science to improve outcomes for persons with opioid use disorder.
Hayes, C. J.; Cucciare, M. A.; Martin, B. C.; Hudson, T. J.; Bush, K.; Lo-Ciganic, W.; Yu, H.; Charron, E.; and Gordon, A. J.
Substance Abuse, 43(1): 956–963. 2022.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{hayes_using_2022, title = {Using data science to improve outcomes for persons with opioid use disorder}, volume = {43}, issn = {1547-0164}, url = {https://pubmed.ncbi.nlm.nih.gov/35420927/}, doi = {10.1080/08897077.2022.2060446}, abstract = {Medication treatment for opioid use disorder (MOUD) is an effective evidence-based therapy for decreasing opioid-related adverse outcomes. Effective strategies for retaining persons on MOUD, an essential step to improving outcomes, are needed as roughly half of all persons initiating MOUD discontinue within a year. Data science may be valuable and promising for improving MOUD retention by using "big data" (e.g., electronic health record data, claims data mobile/sensor data, social media data) and specific machine learning techniques (e.g., predictive modeling, natural language processing, reinforcement learning) to individualize patient care. Maximizing the utility of data science to improve MOUD retention requires a three-pronged approach: (1) increasing funding for data science research for OUD, (2) integrating data from multiple sources including treatment for OUD and general medical care as well as data not specific to medical care (e.g., mobile, sensor, and social media data), and (3) applying multiple data science approaches with integrated big data to provide insights and optimize advances in the OUD and overall addiction fields.}, language = {eng}, number = {1}, journal = {Substance Abuse}, author = {Hayes, Corey J. and Cucciare, Michael A. and Martin, Bradley C. and Hudson, Teresa J. and Bush, Keith and Lo-Ciganic, Weihsuan and Yu, Hong and Charron, Elizabeth and Gordon, Adam J.}, year = {2022}, pmid = {35420927 PMCID: PMC9705076}, keywords = {Opioid-related disorders, big data, machine learning}, pages = {956--963}, }
Medication treatment for opioid use disorder (MOUD) is an effective evidence-based therapy for decreasing opioid-related adverse outcomes. Effective strategies for retaining persons on MOUD, an essential step to improving outcomes, are needed as roughly half of all persons initiating MOUD discontinue within a year. Data science may be valuable and promising for improving MOUD retention by using "big data" (e.g., electronic health record data, claims data mobile/sensor data, social media data) and specific machine learning techniques (e.g., predictive modeling, natural language processing, reinforcement learning) to individualize patient care. Maximizing the utility of data science to improve MOUD retention requires a three-pronged approach: (1) increasing funding for data science research for OUD, (2) integrating data from multiple sources including treatment for OUD and general medical care as well as data not specific to medical care (e.g., mobile, sensor, and social media data), and (3) applying multiple data science approaches with integrated big data to provide insights and optimize advances in the OUD and overall addiction fields.
An Investigation of Social Determinants of Health in UMLS.
Rawat, B. P. S.; and Yu, H.
In Houston TX USA, May 2022.
AMIA Clinical Informatics 2022
link bibtex
link bibtex
@inproceedings{rawat_investigation_2022, address = {Houston TX USA}, title = {An {Investigation} of {Social} {Determinants} of {Health} in {UMLS}}, author = {Rawat, Bhanu Pratap Singh and Yu, Hong}, month = may, year = {2022}, note = {AMIA Clinical Informatics 2022}, }
Generating Coherent Narratives with Subtopic Planning to Answer How-to Questions.
Cai, P.; Yu, M.; Liu, F.; and Yu, H.
In Abu Dhabi, December 2022.
The GEM Workshop at EMNLP 2022
link bibtex
link bibtex
@inproceedings{cai_generating_2022, address = {Abu Dhabi}, title = {Generating {Coherent} {Narratives} with {Subtopic} {Planning} to {Answer} {How}-to {Questions}}, author = {Cai, Pengshan and Yu, Mo and Liu, Fei and Yu, Hong}, month = dec, year = {2022}, note = {The GEM Workshop at EMNLP 2022}, }
UMass A&P: An Assessment and Plan Reasoning System of UMass in the 2022 N2C2 Challenge.
Kwon, S.; Yang, Z.; and Yu, H.
November 2022.
2022 n2c2 Workshop, Washington DC
link bibtex
link bibtex
@misc{kwon_umass_2022, address = {Washington DC USA}, title = {{UMass} {A}\&{P}: {An} {Assessment} and {Plan} {Reasoning} {System} of {UMass} in the 2022 {N2C2} {Challenge}}, author = {Kwon, Sunjae and Yang, Zhichao and Yu, Hong}, month = nov, year = {2022}, note = {2022 n2c2 Workshop, Washington DC}, }
Racial differences in receipt of medications for opioid use disorder before and during the COVID-19 pandemic in the Veterans Health Administration.
Sung, M. L.; Li, W.; León, C.; Reisman, J.; Liu, W.; Kerns, R. D.; Yu, H.; and Becker, W. C.
November 2022.
APHA 2022 Annual Meeting and Expo, Boston MA
link bibtex
link bibtex
@misc{sung_racial_2022, address = {Boston MA, USA}, title = {Racial differences in receipt of medications for opioid use disorder before and during the {COVID}-19 pandemic in the {Veterans} {Health} {Administration}}, author = {Sung, Minhee L. and Li, Wenjun and León, Casey and Reisman, Joel and Liu, Weisong and Kerns, Robert D. and Yu, Hong and Becker, William C.}, month = nov, year = {2022}, note = {APHA 2022 Annual Meeting and Expo, Boston MA}, }
Using Machine Learning to Predict Opioid Overdose Using Electronic Health Record.
Wang, X.; Li, R.; Druhl, E.; Li, W.; Sung, M. L.; Kerns, R. D.; Becker, W. C.; and Yu, H.
November 2022.
APHA 2022 Annual Meeting and Expo, Boston MA
link bibtex
link bibtex
@misc{wang_using_2022, address = {Boston MA, USA}, title = {Using {Machine} {Learning} to {Predict} {Opioid} {Overdose} {Using} {Electronic} {Health} {Record}}, author = {Wang, Xun and Li, Rumeng and Druhl, Emily and Li, Wenjun and Sung, Minhee L. and Kerns, Robert D. and Becker, William C. and Yu, Hong}, month = nov, year = {2022}, note = {APHA 2022 Annual Meeting and Expo, Boston MA}, }
Automatically Detecting Opioid-Related Aberrant Behaviors from Electronic Health Records.
Wang, X.; Li, R.; Lingeman, J. M.; Druhl, E.; Li, W.; Sung, M. L.; Kerns, R. D.; Becker, W. C.; and Yu, H.
November 2022.
APHA 2022 Annual Meeting and Expo, Boston MA
link bibtex
link bibtex
@misc{wang_automatically_2022, address = {Boston MA, USA}, title = {Automatically {Detecting} {Opioid}-{Related} {Aberrant} {Behaviors} from {Electronic} {Health} {Records}}, author = {Wang, Xun and Li, Rumeng and Lingeman, Jesse M. and Druhl, Emily and Li, Wenjun and Sung, Minhee L. and Kerns, Robert D. and Becker, William C. and Yu, Hong}, month = nov, year = {2022}, note = {APHA 2022 Annual Meeting and Expo, Boston MA}, }
Pretraining of Patient Representations On Structured Electronic Health Records for Patient Outcome Prediction: case study as self-harm screening tool.
Yang, Z.; and Hong, Y.
In Washington DC USA, June 2022.
ARM2022
link bibtex
link bibtex
@inproceedings{yang_pretraining_2022, address = {Washington DC USA}, title = {Pretraining of {Patient} {Representations} {On} {Structured} {Electronic} {Health} {Records} for {Patient} {Outcome} {Prediction}: case study as self-harm screening tool}, shorttitle = {{ARM} 2022}, author = {Yang, Zhichao and Hong, Yu}, month = jun, year = {2022}, note = {ARM2022}, }
SBDH and Suicide: A Multi-Task Learning Framework for SBDH Detection in Electronic Health Records Using NLP.
Mitra, A.; Rawat, B. P. S.; Druhl, E. B.; Keating, H.; Goodwin, R.; Hu, W.; Liu, W.; Tsai, J.; Smelson, D. A.; and Yu, H.
In Washington DC USA, June 2022.
ARM 2022
link bibtex
link bibtex
@inproceedings{mitra_sbdh_2022, address = {Washington DC USA}, title = {{SBDH} and {Suicide}: {A} {Multi}-{Task} {Learning} {Framework} for {SBDH} {Detection} in {Electronic} {Health} {Records} {Using} {NLP}}, shorttitle = {{ARM} 2022}, author = {Mitra, Avijit and Rawat, Bhanu Pratap Singh and Druhl, Emily B. and Keating, Heather and Goodwin, Raelene and Hu, Wen and Liu, Weisong and Tsai, Jack and Smelson, David A. and Yu, Hong}, month = jun, year = {2022}, note = {ARM 2022}, }
Studying Association of Traumatic Brain Injury and Posttraumatic Stress Disorder Diagnoses with Hospitalized Self-Harm Among US Veterans, 2008-2017.
Rawat, B. P. S.; Reisman, J.; Rongali, S.; Liu, W.; Yu, H.; and Carlson, K.
In Washington DC USA, June 2022.
ARM 2022 (Poster)
link bibtex
link bibtex
@inproceedings{rawat_studying_2022, address = {Washington DC USA}, title = {Studying {Association} of {Traumatic} {Brain} {Injury} and {Posttraumatic} {Stress} {Disorder} {Diagnoses} with {Hospitalized} {Self}-{Harm} {Among} {US} {Veterans}, 2008-2017}, shorttitle = {{ARM} 2022}, author = {Rawat, Bhanu Pratap Singh and Reisman, Joel and Rongali, Subendhu and Liu, Weisong and Yu, Hong and Carlson, Kathleen}, month = jun, year = {2022}, note = {ARM 2022 (Poster)}, }
NLP and Annie App for Social Determinants of Health.
Mahapatra, S.; Chen, H.; Tsai, J.; and Yu, H.
In Houston TX USA, May 2022.
AMIA Clinical Informatics 2022
link bibtex
link bibtex
@inproceedings{mahapatra_nlp_2022, address = {Houston TX USA}, title = {{NLP} and {Annie} {App} for {Social} {Determinants} of {Health}}, author = {Mahapatra, Sneha and Chen, Huan-Yuan and Tsai, Jack and Yu, Hong}, month = may, year = {2022}, note = {AMIA Clinical Informatics 2022}, }
EASE: A Tool to Extract Social Determinants of Health from Electronic Health Records.
Rawat, B. P. S.; and Yu, H.
In Houston TX USA, May 2022.
AMIA Clinical Informatics 2022 (System Demo)
link bibtex
link bibtex
@inproceedings{rawat_ease_2022, address = {Houston TX USA}, title = {{EASE}: {A} {Tool} to {Extract} {Social} {Determinants} of {Health} from {Electronic} {Health} {Records}}, author = {Rawat, Bhanu Pratap Singh and Yu, Hong}, month = may, year = {2022}, note = {AMIA Clinical Informatics 2022 (System Demo)}, }
The association of prescribed long-acting versus short-acting opioids and mortality among older adults.
Sung, M.; Smirnova, J.; Li, W.; Liu, W.; Kerns, R. D.; Reisman, J. I.; Yu, H.; and Becker, W. C.
In Society of General Internal Medicine Annual National Meeting, Orlando, Florida, USA, April 2022.
link bibtex
link bibtex
@inproceedings{sung_association_2022, address = {Orlando, Florida, USA}, title = {The association of prescribed long-acting versus short-acting opioids and mortality among older adults}, booktitle = {Society of {General} {Internal} {Medicine} {Annual} {National} {Meeting}}, author = {Sung, Minhee and Smirnova, Jimin and Li, Wenjun and Liu, Weisong and Kerns, Robert D. and Reisman, Joel I. and Yu, Hong and Becker, William C.}, month = apr, year = {2022}, }
EHR Cohort Development Using Natural Language Processing For Identifying Symptoms Of Alzheimer's Disease.
Yu, H.; Mitra, A.; Keating, H.; Liu, W.; Hu, W.; Xia, W.; Morin, P.; Berlowitz, D. R.; Bray, M.; Monfared, A.; and Zhang, Q.
In Barcelona, Spain (Online), March 2022.
AD/PD 2022
link bibtex
link bibtex
@inproceedings{yu_ehr_2022, address = {Barcelona, Spain (Online)}, title = {{EHR} {Cohort} {Development} {Using} {Natural} {Language} {Processing} {For} {Identifying} {Symptoms} {Of} {Alzheimer}'s {Disease}}, shorttitle = {{AD}/{PD} 2022}, author = {Yu, Hong and Mitra, Avijit and Keating, Heather and Liu, Weisong and Hu, Wen and Xia, Weiming and Morin, Peter and Berlowitz, Dan R. and Bray, Margaret and Monfared, Amir and Zhang, Quanwu}, month = mar, year = {2022}, note = {AD/PD 2022}, }
2021
(9)
MIMIC-SBDH: A Dataset for Social and Behavioral Determinants of Health.
Ahsan, H.; Ohnuki, E.; Mitra, A.; and Yu, H.
Proceedings of machine learning research, 149: 391–413. August 2021.
Paper link bibtex abstract
Paper link bibtex abstract
@article{ahsan_mimic-sbdh_2021, title = {{MIMIC}-{SBDH}: {A} {Dataset} for {Social} and {Behavioral} {Determinants} of {Health}}, volume = {149}, issn = {2640-3498}, shorttitle = {{MIMIC}-{SBDH}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8734043/}, abstract = {Social and Behavioral Determinants of Health (SBDHs) are environmental and behavioral factors that have a profound impact on health and related outcomes. Given their importance, physicians document SBDHs of their patients in Electronic Health Records (EHRs). However, SBDHs are mostly documented in unstructured EHR notes. Determining the status of the SBDHs requires manually reviewing the notes which can be a tedious process. Therefore, there is a need to automate identifying the patients’ SBDH status in EHR notes. In this work, we created MIMIC-SBDH, the first publicly available dataset of EHR notes annotated for patients’ SBDH status. Specifically, we annotated 7,025 discharge summary notes for the status of 7 SBDHs as well as marked SBDH-related keywords. Using this annotated data for training and evaluation, we evaluated the performance of three machine learning models (Random Forest, XGBoost, and Bio-ClinicalBERT) on the task of identifying SBDH status in EHR notes. The performance ranged from the lowest 0.69 F1 score for Drug Use to the highest 0.96 F1 score for Community-Present. In addition to standard evaluation metrics such as the F1 score, we evaluated four capabilities that a model must possess to perform well on the task using the CheckList tool (). The results revealed several shortcomings of the models. Our results highlighted the need to perform more capability-centric evaluations in addition to standard metric comparisons.}, urldate = {2024-04-10}, journal = {Proceedings of machine learning research}, author = {Ahsan, Hiba and Ohnuki, Emmie and Mitra, Avijit and Yu, Hong}, month = aug, year = {2021}, pmid = {35005628}, pmcid = {PMC8734043}, pages = {391--413}, }
Social and Behavioral Determinants of Health (SBDHs) are environmental and behavioral factors that have a profound impact on health and related outcomes. Given their importance, physicians document SBDHs of their patients in Electronic Health Records (EHRs). However, SBDHs are mostly documented in unstructured EHR notes. Determining the status of the SBDHs requires manually reviewing the notes which can be a tedious process. Therefore, there is a need to automate identifying the patients’ SBDH status in EHR notes. In this work, we created MIMIC-SBDH, the first publicly available dataset of EHR notes annotated for patients’ SBDH status. Specifically, we annotated 7,025 discharge summary notes for the status of 7 SBDHs as well as marked SBDH-related keywords. Using this annotated data for training and evaluation, we evaluated the performance of three machine learning models (Random Forest, XGBoost, and Bio-ClinicalBERT) on the task of identifying SBDH status in EHR notes. The performance ranged from the lowest 0.69 F1 score for Drug Use to the highest 0.96 F1 score for Community-Present. In addition to standard evaluation metrics such as the F1 score, we evaluated four capabilities that a model must possess to perform well on the task using the CheckList tool (). The results revealed several shortcomings of the models. Our results highlighted the need to perform more capability-centric evaluations in addition to standard metric comparisons.
Risk Factors Associated With Nonfatal Opioid Overdose Leading to Intensive Care Unit Admission: A Cross-sectional Study.
Mitra, A.; Ahsan, H.; Li, W.; Liu, W.; Kerns, R. D.; Tsai, J.; Becker, W.; Smelson, D. A.; and Yu, H.
JMIR medical informatics, 9(11): e32851. November 2021.
doi link bibtex abstract
doi link bibtex abstract
@article{mitra_risk_2021, title = {Risk {Factors} {Associated} {With} {Nonfatal} {Opioid} {Overdose} {Leading} to {Intensive} {Care} {Unit} {Admission}: {A} {Cross}-sectional {Study}}, volume = {9}, issn = {2291-9694}, shorttitle = {Risk {Factors} {Associated} {With} {Nonfatal} {Opioid} {Overdose} {Leading} to {Intensive} {Care} {Unit} {Admission}}, doi = {10.2196/32851}, abstract = {BACKGROUND: Opioid overdose (OD) and related deaths have significantly increased in the United States over the last 2 decades. Existing studies have mostly focused on demographic and clinical risk factors in noncritical care settings. Social and behavioral determinants of health (SBDH) are infrequently coded in the electronic health record (EHR) and usually buried in unstructured EHR notes, reflecting possible gaps in clinical care and observational research. Therefore, SBDH often receive less attention despite being important risk factors for OD. Natural language processing (NLP) can alleviate this problem. OBJECTIVE: The objectives of this study were two-fold: First, we examined the usefulness of NLP for SBDH extraction from unstructured EHR text, and second, for intensive care unit (ICU) admissions, we investigated risk factors including SBDH for nonfatal OD. METHODS: We performed a cross-sectional analysis of admission data from the EHR of patients in the ICU of Beth Israel Deaconess Medical Center between 2001 and 2012. We used patient admission data and International Classification of Diseases, Ninth Revision (ICD-9) diagnoses to extract demographics, nonfatal OD, SBDH, and other clinical variables. In addition to obtaining SBDH information from the ICD codes, an NLP model was developed to extract 6 SBDH variables from EHR notes, namely, housing insecurity, unemployment, social isolation, alcohol use, smoking, and illicit drug use. We adopted a sequential forward selection process to select relevant clinical variables. Multivariable logistic regression analysis was used to evaluate the associations with nonfatal OD, and relative risks were quantified as covariate-adjusted odds ratios (aOR). RESULTS: The strongest association with nonfatal OD was found to be drug use disorder (aOR 8.17, 95\% CI 5.44-12.27), followed by bipolar disorder (aOR 2.69, 95\% CI 1.68-4.29). Among others, major depressive disorder (aOR 2.57, 95\% CI 1.12-5.88), being on a Medicaid health insurance program (aOR 2.26, 95\% CI 1.43-3.58), history of illicit drug use (aOR 2.09, 95\% CI 1.15-3.79), and current use of illicit drugs (aOR 2.06, 95\% CI 1.20-3.55) were strongly associated with increased risk of nonfatal OD. Conversely, Blacks (aOR 0.51, 95\% CI 0.28-0.94), older age groups (40-64 years: aOR 0.65, 95\% CI 0.44-0.96; {\textgreater}64 years: aOR 0.16, 95\% CI 0.08-0.34) and those with tobacco use disorder (aOR 0.53, 95\% CI 0.32-0.89) or alcohol use disorder (aOR 0.64, 95\% CI 0.42-1.00) had decreased risk of nonfatal OD. Moreover, 99.82\% of all SBDH information was identified by the NLP model, in contrast to only 0.18\% identified by the ICD codes. CONCLUSIONS: This is the first study to analyze the risk factors for nonfatal OD in an ICU setting using NLP-extracted SBDH from EHR notes. We found several risk factors associated with nonfatal OD including SBDH. SBDH are richly described in EHR notes, supporting the importance of integrating NLP-derived SBDH into OD risk assessment. More studies in ICU settings can help health care systems better understand and respond to the opioid epidemic.}, language = {eng}, number = {11}, journal = {JMIR medical informatics}, author = {Mitra, Avijit and Ahsan, Hiba and Li, Wenjun and Liu, Weisong and Kerns, Robert D. and Tsai, Jack and Becker, William and Smelson, David A. and Yu, Hong}, month = nov, year = {2021}, pmid = {34747714}, pmcid = {PMC8663596}, keywords = {electronic health records, intensive care unit, natural language processing, opioids, overdose, risk factors, social and behavioral determinants of health}, pages = {e32851}, }
BACKGROUND: Opioid overdose (OD) and related deaths have significantly increased in the United States over the last 2 decades. Existing studies have mostly focused on demographic and clinical risk factors in noncritical care settings. Social and behavioral determinants of health (SBDH) are infrequently coded in the electronic health record (EHR) and usually buried in unstructured EHR notes, reflecting possible gaps in clinical care and observational research. Therefore, SBDH often receive less attention despite being important risk factors for OD. Natural language processing (NLP) can alleviate this problem. OBJECTIVE: The objectives of this study were two-fold: First, we examined the usefulness of NLP for SBDH extraction from unstructured EHR text, and second, for intensive care unit (ICU) admissions, we investigated risk factors including SBDH for nonfatal OD. METHODS: We performed a cross-sectional analysis of admission data from the EHR of patients in the ICU of Beth Israel Deaconess Medical Center between 2001 and 2012. We used patient admission data and International Classification of Diseases, Ninth Revision (ICD-9) diagnoses to extract demographics, nonfatal OD, SBDH, and other clinical variables. In addition to obtaining SBDH information from the ICD codes, an NLP model was developed to extract 6 SBDH variables from EHR notes, namely, housing insecurity, unemployment, social isolation, alcohol use, smoking, and illicit drug use. We adopted a sequential forward selection process to select relevant clinical variables. Multivariable logistic regression analysis was used to evaluate the associations with nonfatal OD, and relative risks were quantified as covariate-adjusted odds ratios (aOR). RESULTS: The strongest association with nonfatal OD was found to be drug use disorder (aOR 8.17, 95% CI 5.44-12.27), followed by bipolar disorder (aOR 2.69, 95% CI 1.68-4.29). Among others, major depressive disorder (aOR 2.57, 95% CI 1.12-5.88), being on a Medicaid health insurance program (aOR 2.26, 95% CI 1.43-3.58), history of illicit drug use (aOR 2.09, 95% CI 1.15-3.79), and current use of illicit drugs (aOR 2.06, 95% CI 1.20-3.55) were strongly associated with increased risk of nonfatal OD. Conversely, Blacks (aOR 0.51, 95% CI 0.28-0.94), older age groups (40-64 years: aOR 0.65, 95% CI 0.44-0.96; \textgreater64 years: aOR 0.16, 95% CI 0.08-0.34) and those with tobacco use disorder (aOR 0.53, 95% CI 0.32-0.89) or alcohol use disorder (aOR 0.64, 95% CI 0.42-1.00) had decreased risk of nonfatal OD. Moreover, 99.82% of all SBDH information was identified by the NLP model, in contrast to only 0.18% identified by the ICD codes. CONCLUSIONS: This is the first study to analyze the risk factors for nonfatal OD in an ICU setting using NLP-extracted SBDH from EHR notes. We found several risk factors associated with nonfatal OD including SBDH. SBDH are richly described in EHR notes, supporting the importance of integrating NLP-derived SBDH into OD risk assessment. More studies in ICU settings can help health care systems better understand and respond to the opioid epidemic.
Relation Classification for Bleeding Events From Electronic Health Records Using Deep Learning Systems: An Empirical Study.
Mitra, A.; Rawat, B. P. S.; McManus, D. D.; and Yu, H.
JMIR medical informatics, 9(7): e27527. July 2021.
doi link bibtex abstract
doi link bibtex abstract
@article{mitra_relation_2021, title = {Relation {Classification} for {Bleeding} {Events} {From} {Electronic} {Health} {Records} {Using} {Deep} {Learning} {Systems}: {An} {Empirical} {Study}}, volume = {9}, issn = {2291-9694}, shorttitle = {Relation {Classification} for {Bleeding} {Events} {From} {Electronic} {Health} {Records} {Using} {Deep} {Learning} {Systems}}, doi = {10.2196/27527}, abstract = {BACKGROUND: Accurate detection of bleeding events from electronic health records (EHRs) is crucial for identifying and characterizing different common and serious medical problems. To extract such information from EHRs, it is essential to identify the relations between bleeding events and related clinical entities (eg, bleeding anatomic sites and lab tests). With the advent of natural language processing (NLP) and deep learning (DL)-based techniques, many studies have focused on their applicability for various clinical applications. However, no prior work has utilized DL to extract relations between bleeding events and relevant entities. OBJECTIVE: In this study, we aimed to evaluate multiple DL systems on a novel EHR data set for bleeding event-related relation classification. METHODS: We first expert annotated a new data set of 1046 deidentified EHR notes for bleeding events and their attributes. On this data set, we evaluated three state-of-the-art DL architectures for the bleeding event relation classification task, namely, convolutional neural network (CNN), attention-guided graph convolutional network (AGGCN), and Bidirectional Encoder Representations from Transformers (BERT). We used three BERT-based models, namely, BERT pretrained on biomedical data (BioBERT), BioBERT pretrained on clinical text (Bio+Clinical BERT), and BioBERT pretrained on EHR notes (EhrBERT). RESULTS: Our experiments showed that the BERT-based models significantly outperformed the CNN and AGGCN models. Specifically, BioBERT achieved a macro F1 score of 0.842, outperforming both the AGGCN (macro F1 score, 0.828) and CNN models (macro F1 score, 0.763) by 1.4\% (P{\textless}.001) and 7.9\% (P{\textless}.001), respectively. CONCLUSIONS: In this comprehensive study, we explored and compared different DL systems to classify relations between bleeding events and other medical concepts. On our corpus, BERT-based models outperformed other DL models for identifying the relations of bleeding-related entities. In addition to pretrained contextualized word representation, BERT-based models benefited from the use of target entity representation over traditional sequence representation.}, language = {eng}, number = {7}, journal = {JMIR medical informatics}, author = {Mitra, Avijit and Rawat, Bhanu Pratap Singh and McManus, David D. and Yu, Hong}, month = jul, year = {2021}, pmid = {34255697}, pmcid = {PMC8285744}, keywords = {BERT, CNN, GCN, bleeding, electronic health records, relation classification}, pages = {e27527}, }
BACKGROUND: Accurate detection of bleeding events from electronic health records (EHRs) is crucial for identifying and characterizing different common and serious medical problems. To extract such information from EHRs, it is essential to identify the relations between bleeding events and related clinical entities (eg, bleeding anatomic sites and lab tests). With the advent of natural language processing (NLP) and deep learning (DL)-based techniques, many studies have focused on their applicability for various clinical applications. However, no prior work has utilized DL to extract relations between bleeding events and relevant entities. OBJECTIVE: In this study, we aimed to evaluate multiple DL systems on a novel EHR data set for bleeding event-related relation classification. METHODS: We first expert annotated a new data set of 1046 deidentified EHR notes for bleeding events and their attributes. On this data set, we evaluated three state-of-the-art DL architectures for the bleeding event relation classification task, namely, convolutional neural network (CNN), attention-guided graph convolutional network (AGGCN), and Bidirectional Encoder Representations from Transformers (BERT). We used three BERT-based models, namely, BERT pretrained on biomedical data (BioBERT), BioBERT pretrained on clinical text (Bio+Clinical BERT), and BioBERT pretrained on EHR notes (EhrBERT). RESULTS: Our experiments showed that the BERT-based models significantly outperformed the CNN and AGGCN models. Specifically, BioBERT achieved a macro F1 score of 0.842, outperforming both the AGGCN (macro F1 score, 0.828) and CNN models (macro F1 score, 0.763) by 1.4% (P\textless.001) and 7.9% (P\textless.001), respectively. CONCLUSIONS: In this comprehensive study, we explored and compared different DL systems to classify relations between bleeding events and other medical concepts. On our corpus, BERT-based models outperformed other DL models for identifying the relations of bleeding-related entities. In addition to pretrained contextualized word representation, BERT-based models benefited from the use of target entity representation over traditional sequence representation.
Guideline-discordant dosing of direct-acting oral anticoagulants in the veterans health administration.
Rose, A. J.; Lee, J. S.; Berlowitz, D. R.; Liu, W.; Mitra, A.; and Yu, H.
BMC health services research, 21(1): 1351. December 2021.
doi link bibtex abstract
doi link bibtex abstract
@article{rose_guideline-discordant_2021, title = {Guideline-discordant dosing of direct-acting oral anticoagulants in the veterans health administration}, volume = {21}, issn = {1472-6963}, doi = {10.1186/s12913-021-07397-x}, abstract = {BACKGROUND: Clear guidelines exist to guide the dosing of direct-acting oral anticoagulants (DOACs). It is not known how consistently these guidelines are followed in practice. METHODS: We studied patients from the Veterans Health Administration (VA) with non-valvular atrial fibrillation who received DOACs (dabigatran, rivaroxaban, apixaban) between 2010 and 2016. We used patient characteristics (age, creatinine, body mass) to identify which patients met guideline recommendations for low-dose therapy and which for full-dose therapy. We examined how often patient dosing was concordant with these recommendations. We examined variation in guideline-concordant dosing by site of care and over time. We examined patient-level predictors of guideline-concordant dosing using multivariable logistic models. RESULTS: A total of 73,672 patients who were prescribed DOACS were included. Of 5837 patients who were recommended to receive low-dose therapy, 1331 (23\%) received full-dose therapy instead. Of 67,935 patients recommended to receive full-dose therapy, 4079 (6\%) received low-dose therapy instead. Sites varied widely on guideline discordant dosing; on inappropriate low-dose therapy, sites varied from 0 to 15\%, while on inappropriate high-dose therapy, from 0 to 41\%. Guideline discordant therapy decreased by about 20\% in a relative sense over time, but its absolute numbers grew as DOAC therapy became more common. The most important patient-level predictors of receiving guideline-discordant therapy were older age and creatinine function being near the cutoff value. CONCLUSIONS: A substantial portion of DOAC prescriptions in the VA system are dosed contrary to clinical guidelines. This phenomenon varies widely across sites of care and has persisted over time.}, language = {eng}, number = {1}, journal = {BMC health services research}, author = {Rose, Adam J. and Lee, Jong Soo and Berlowitz, Dan R. and Liu, Weisong and Mitra, Avijit and Yu, Hong}, month = dec, year = {2021}, pmid = {34922546}, pmcid = {PMC8684634}, keywords = {Aged, Anticoagulants, Atrial Fibrillation, Atrial fibrillation, Dabigatran, Factor Xa Inhibitors, Humans, Medication therapy management, Quality of health care, Rivaroxaban, Veterans Health}, pages = {1351}, }
BACKGROUND: Clear guidelines exist to guide the dosing of direct-acting oral anticoagulants (DOACs). It is not known how consistently these guidelines are followed in practice. METHODS: We studied patients from the Veterans Health Administration (VA) with non-valvular atrial fibrillation who received DOACs (dabigatran, rivaroxaban, apixaban) between 2010 and 2016. We used patient characteristics (age, creatinine, body mass) to identify which patients met guideline recommendations for low-dose therapy and which for full-dose therapy. We examined how often patient dosing was concordant with these recommendations. We examined variation in guideline-concordant dosing by site of care and over time. We examined patient-level predictors of guideline-concordant dosing using multivariable logistic models. RESULTS: A total of 73,672 patients who were prescribed DOACS were included. Of 5837 patients who were recommended to receive low-dose therapy, 1331 (23%) received full-dose therapy instead. Of 67,935 patients recommended to receive full-dose therapy, 4079 (6%) received low-dose therapy instead. Sites varied widely on guideline discordant dosing; on inappropriate low-dose therapy, sites varied from 0 to 15%, while on inappropriate high-dose therapy, from 0 to 41%. Guideline discordant therapy decreased by about 20% in a relative sense over time, but its absolute numbers grew as DOAC therapy became more common. The most important patient-level predictors of receiving guideline-discordant therapy were older age and creatinine function being near the cutoff value. CONCLUSIONS: A substantial portion of DOAC prescriptions in the VA system are dosed contrary to clinical guidelines. This phenomenon varies widely across sites of care and has persisted over time.
Evaluating the Effectiveness of NoteAid in a Community Hospital Setting: Randomized Trial of Electronic Health Record Note Comprehension Interventions With Patients.
Lalor, J. P.; Hu, W.; Tran, M.; Wu, H.; Mazor, K. M.; and Yu, H.
Journal of Medical Internet Research, 23(5): e26354. May 2021.
doi link bibtex abstract
doi link bibtex abstract
@article{lalor_evaluating_2021, title = {Evaluating the {Effectiveness} of {NoteAid} in a {Community} {Hospital} {Setting}: {Randomized} {Trial} of {Electronic} {Health} {Record} {Note} {Comprehension} {Interventions} {With} {Patients}}, volume = {23}, issn = {1438-8871}, shorttitle = {Evaluating the {Effectiveness} of {NoteAid} in a {Community} {Hospital} {Setting}}, doi = {10.2196/26354}, abstract = {BACKGROUND: Interventions to define medical jargon have been shown to improve electronic health record (EHR) note comprehension among crowdsourced participants on Amazon Mechanical Turk (AMT). However, AMT participants may not be representative of the general population or patients who are most at-risk for low health literacy. OBJECTIVE: In this work, we assessed the efficacy of an intervention (NoteAid) for EHR note comprehension among participants in a community hospital setting. METHODS: Participants were recruited from Lowell General Hospital (LGH), a community hospital in Massachusetts, to take the ComprehENotes test, a web-based test of EHR note comprehension. Participants were randomly assigned to control (n=85) or intervention (n=89) groups to take the test without or with NoteAid, respectively. For comparison, we used a sample of 200 participants recruited from AMT to take the ComprehENotes test (100 in the control group and 100 in the intervention group). RESULTS: A total of 174 participants were recruited from LGH, and 200 participants were recruited from AMT. Participants in both intervention groups (community hospital and AMT) scored significantly higher than participants in the control groups (P{\textless}.001). The average score for the community hospital participants was significantly lower than the average score for the AMT participants (P{\textless}.001), consistent with the lower education levels in the community hospital sample. Education level had a significant effect on scores for the community hospital participants (P{\textless}.001). CONCLUSIONS: Use of NoteAid was associated with significantly improved EHR note comprehension in both community hospital and AMT samples. Our results demonstrate the generalizability of ComprehENotes as a test of EHR note comprehension and the effectiveness of NoteAid for improving EHR note comprehension.}, language = {eng}, number = {5}, journal = {Journal of Medical Internet Research}, author = {Lalor, John P. and Hu, Wen and Tran, Matthew and Wu, Hao and Mazor, Kathleen M. and Yu, Hong}, month = may, year = {2021}, pmid = {33983124}, pmcid = {PMC8160802}, keywords = {Comprehension, Electronic Health Records, Hospitals, Community, Humans, comprehension, crowdsourcing, efficacy, electronic health record, health literacy, information storage and retrieval, intervention, literacy, natural language processing, psychometrics}, pages = {e26354}, }
BACKGROUND: Interventions to define medical jargon have been shown to improve electronic health record (EHR) note comprehension among crowdsourced participants on Amazon Mechanical Turk (AMT). However, AMT participants may not be representative of the general population or patients who are most at-risk for low health literacy. OBJECTIVE: In this work, we assessed the efficacy of an intervention (NoteAid) for EHR note comprehension among participants in a community hospital setting. METHODS: Participants were recruited from Lowell General Hospital (LGH), a community hospital in Massachusetts, to take the ComprehENotes test, a web-based test of EHR note comprehension. Participants were randomly assigned to control (n=85) or intervention (n=89) groups to take the test without or with NoteAid, respectively. For comparison, we used a sample of 200 participants recruited from AMT to take the ComprehENotes test (100 in the control group and 100 in the intervention group). RESULTS: A total of 174 participants were recruited from LGH, and 200 participants were recruited from AMT. Participants in both intervention groups (community hospital and AMT) scored significantly higher than participants in the control groups (P\textless.001). The average score for the community hospital participants was significantly lower than the average score for the AMT participants (P\textless.001), consistent with the lower education levels in the community hospital sample. Education level had a significant effect on scores for the community hospital participants (P\textless.001). CONCLUSIONS: Use of NoteAid was associated with significantly improved EHR note comprehension in both community hospital and AMT samples. Our results demonstrate the generalizability of ComprehENotes as a test of EHR note comprehension and the effectiveness of NoteAid for improving EHR note comprehension.
SBDH and Suicide: A Multi-task Learning Framework for SBDH in Electronic Health Records.
Mitra, A.; Rawat, B. P. S.; Druhl, E.; Keating, H.; Goodwin, R.; Hu, W.; Liu, W.; Tsai, J.; Smelson, D. A.; and Yu, H.
In Online, October 2021.
SciNLP 2021
link bibtex
link bibtex
@inproceedings{mitra_sbdh_2021, address = {Online}, title = {{SBDH} and {Suicide}: {A} {Multi}-task {Learning} {Framework} for {SBDH} in {Electronic} {Health} {Records}}, shorttitle = {{SciNLP} 2021}, author = {Mitra, Avijit and Rawat, Bhanu Pratap Singh and Druhl, Emily and Keating, Heather and Goodwin, Raelene and Hu, Wen and Liu, Weisong and Tsai, Jack and Smelson, David A. and Yu, Hong}, month = oct, year = {2021}, note = {SciNLP 2021}, }
Membership Inference Attack Susceptibility of Clinical Language Models.
Jagannatha, A.; Rawat, B. P. S.; and Yu, H.
CoRR, abs/2104.08305. 2021.
arXiv: 2104.08305
Paper link bibtex abstract
Paper link bibtex abstract
@article{jagannatha_membership_2021, title = {Membership {Inference} {Attack} {Susceptibility} of {Clinical} {Language} {Models}}, volume = {abs/2104.08305}, url = {https://arxiv.org/abs/2104.08305}, abstract = {Deep Neural Network (DNN) models have been shown to have high empirical privacy leakages. Clinical language models (CLMs) trained on clinical data have been used to improve performance in biomedical natural language processing tasks. In this work, we investigate the risks of training-data leakage through white-box or black-box access to CLMs. We design and employ membership inference attacks to estimate the empirical privacy leaks for model architectures like BERT and GPT2. We show that membership inference attacks on CLMs lead to non-trivial privacy leakages of up to 7\%. Our results show that smaller models have lower empirical privacy leakages than larger ones, and masked LMs have lower leakages than auto-regressive LMs. We further show that differentially private CLMs can have improved model utility on clinical domain while ensuring low empirical privacy leakage. Lastly, we also study the effects of group-level membership inference and disease rarity on CLM privacy leakages.}, journal = {CoRR}, author = {Jagannatha, Abhyuday and Rawat, Bhanu Pratap Singh and Yu, Hong}, year = {2021}, note = {arXiv: 2104.08305}, }
Deep Neural Network (DNN) models have been shown to have high empirical privacy leakages. Clinical language models (CLMs) trained on clinical data have been used to improve performance in biomedical natural language processing tasks. In this work, we investigate the risks of training-data leakage through white-box or black-box access to CLMs. We design and employ membership inference attacks to estimate the empirical privacy leaks for model architectures like BERT and GPT2. We show that membership inference attacks on CLMs lead to non-trivial privacy leakages of up to 7%. Our results show that smaller models have lower empirical privacy leakages than larger ones, and masked LMs have lower leakages than auto-regressive LMs. We further show that differentially private CLMs can have improved model utility on clinical domain while ensuring low empirical privacy leakage. Lastly, we also study the effects of group-level membership inference and disease rarity on CLM privacy leakages.
Improving Formality Style Transfer with Context-Aware Rule Injection.
Yao, Z.; and Yu, H.
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1561–1570, Online, August 2021. Association for Computational Linguistics
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{yao_improving_2021, address = {Online}, title = {Improving {Formality} {Style} {Transfer} with {Context}-{Aware} {Rule} {Injection}}, url = {https://aclanthology.org/2021.acl-long.124}, doi = {10.18653/v1/2021.acl-long.124}, abstract = {Models pre-trained on large-scale regular text corpora often do not work well for user-generated data where the language styles differ significantly from the mainstream text. Here we present Context-Aware Rule Injection (CARI), an innovative method for formality style transfer (FST) by injecting multiple rules into an end-to-end BERT-based encoder and decoder model. CARI is able to learn to select optimal rules based on context. The intrinsic evaluation showed that CARI achieved the new highest performance on the FST benchmark dataset. Our extrinsic evaluation showed that CARI can greatly improve the regular pre-trained models' performance on several tweet sentiment analysis tasks. Our contributions are as follows: 1.We propose a new method, CARI, to integrate rules for pre-trained language models. CARI is context-aware and can trained end-to-end with the downstream NLP applications. 2.We have achieved new state-of-the-art results for FST on the benchmark GYAFC dataset. 3.We are the first to evaluate FST methods with extrinsic evaluation and specifically on sentiment classification tasks. We show that CARI outperformed existing rule-based FST approaches for sentiment classification.}, urldate = {2021-09-21}, booktitle = {Proceedings of the 59th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics} and the 11th {International} {Joint} {Conference} on {Natural} {Language} {Processing} ({Volume} 1: {Long} {Papers})}, publisher = {Association for Computational Linguistics}, author = {Yao, Zonghai and Yu, Hong}, month = aug, year = {2021}, pages = {1561--1570}, }
Models pre-trained on large-scale regular text corpora often do not work well for user-generated data where the language styles differ significantly from the mainstream text. Here we present Context-Aware Rule Injection (CARI), an innovative method for formality style transfer (FST) by injecting multiple rules into an end-to-end BERT-based encoder and decoder model. CARI is able to learn to select optimal rules based on context. The intrinsic evaluation showed that CARI achieved the new highest performance on the FST benchmark dataset. Our extrinsic evaluation showed that CARI can greatly improve the regular pre-trained models' performance on several tweet sentiment analysis tasks. Our contributions are as follows: 1.We propose a new method, CARI, to integrate rules for pre-trained language models. CARI is context-aware and can trained end-to-end with the downstream NLP applications. 2.We have achieved new state-of-the-art results for FST on the benchmark GYAFC dataset. 3.We are the first to evaluate FST methods with extrinsic evaluation and specifically on sentiment classification tasks. We show that CARI outperformed existing rule-based FST approaches for sentiment classification.
Epinoter: A Natural Language Processing Tool for Epidemiological Studies.
Liu, W.; Li, F.; Jin, Y.; Granillo, E.; Yarzebski, J.; Li, W.; and Yu, H.
In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies, volume 5, pages 754–761, February 2021.
link bibtex
link bibtex
@inproceedings{liu_epinoter_2021, title = {Epinoter: {A} {Natural} {Language} {Processing} {Tool} for {Epidemiological} {Studies}.}, volume = {5}, booktitle = {Proceedings of the 14th {International} {Joint} {Conference} on {Biomedical} {Engineering} {Systems} and {Technologies}}, author = {Liu, Weisong and Li, Fei and Jin, Yonghao and Granillo, Edgard and Yarzebski, Jorge and Li, Wenjun and Yu, Hong}, month = feb, year = {2021}, pages = {754--761}, }
2020
(15)
Generating Accurate Electronic Health Assessment from Medical Graph.
Yang, Z.; and Yu, H.
In Cohn, T.; He, Y.; and Liu, Y., editor(s), Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3764–3773, Online, November 2020. Association for Computational Linguistics
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{yang_generating_2020, address = {Online}, title = {Generating {Accurate} {Electronic} {Health} {Assessment} from {Medical} {Graph}}, url = {https://aclanthology.org/2020.findings-emnlp.336}, doi = {10.18653/v1/2020.findings-emnlp.336}, abstract = {One of the fundamental goals of artificial intelligence is to build computer-based expert systems. Inferring clinical diagnoses to generate a clinical assessment during a patient encounter is a crucial step towards building a medical diagnostic system. Previous works were mainly based on either medical domain-specific knowledge, or patients' prior diagnoses and clinical encounters. In this paper, we propose a novel model for automated clinical assessment generation (MCAG). MCAG is built on an innovative graph neural network, where rich clinical knowledge is incorporated into an end-to-end corpus-learning system. Our evaluation results against physician generated gold standard show that MCAG significantly improves the BLEU and rouge score compared with competitive baseline models. Further, physicians' evaluation showed that MCAG could generate high-quality assessments.}, urldate = {2023-11-15}, booktitle = {Findings of the {Association} for {Computational} {Linguistics}: {EMNLP} 2020}, publisher = {Association for Computational Linguistics}, author = {Yang, Zhichao and Yu, Hong}, editor = {Cohn, Trevor and He, Yulan and Liu, Yang}, month = nov, year = {2020}, pages = {3764--3773}, }
One of the fundamental goals of artificial intelligence is to build computer-based expert systems. Inferring clinical diagnoses to generate a clinical assessment during a patient encounter is a crucial step towards building a medical diagnostic system. Previous works were mainly based on either medical domain-specific knowledge, or patients' prior diagnoses and clinical encounters. In this paper, we propose a novel model for automated clinical assessment generation (MCAG). MCAG is built on an innovative graph neural network, where rich clinical knowledge is incorporated into an end-to-end corpus-learning system. Our evaluation results against physician generated gold standard show that MCAG significantly improves the BLEU and rouge score compared with competitive baseline models. Further, physicians' evaluation showed that MCAG could generate high-quality assessments.
Inferring ADR causality by predicting the Naranjo Score from Clinical Notes.
Rawat, B. P. S.; Jagannatha, A.; Liu, F.; and Yu, H.
In AMIA Fall Symposium, pages 1041–1049, 2020.
Paper link bibtex abstract
Paper link bibtex abstract
@inproceedings{rawat_inferring_2020, title = {Inferring {ADR} causality by predicting the {Naranjo} {Score} from {Clinical} {Notes}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8075501/}, abstract = {Clinical judgment studies are an integral part of drug safety surveillance and pharmacovigilance frameworks. They help quantify the causal relationship between medication and its adverse drug reactions (ADRs). To conduct such studies, physicians need to review patients’ charts manually to answer Naranjo questionnaire1. In this paper, we propose a methodology to automatically infer causal relations from patients’ discharge summaries by combining the capabilities of deep learning and statistical learning models. We use Bidirectional Encoder Representations from Transformers (BERT)2 to extract relevant paragraphs for each Naranjo question and then use a statistical learning model such as logistic regression to predict the Naranjo score and the causal relation between the medication and an ADR. Our methodology achieves a macro-averaged f1-score of 0.50 and weighted f1-score of 0.63.}, booktitle = {{AMIA} {Fall} {Symposium}}, author = {Rawat, Bhanu Pratap Singh and Jagannatha, Abhyuday and Liu, Feifan and Yu, Hong}, year = {2020}, pmcid = {PMC8075501}, pmid = {33936480}, pages = {1041--1049}, }
Clinical judgment studies are an integral part of drug safety surveillance and pharmacovigilance frameworks. They help quantify the causal relationship between medication and its adverse drug reactions (ADRs). To conduct such studies, physicians need to review patients’ charts manually to answer Naranjo questionnaire1. In this paper, we propose a methodology to automatically infer causal relations from patients’ discharge summaries by combining the capabilities of deep learning and statistical learning models. We use Bidirectional Encoder Representations from Transformers (BERT)2 to extract relevant paragraphs for each Naranjo question and then use a statistical learning model such as logistic regression to predict the Naranjo score and the causal relation between the medication and an ADR. Our methodology achieves a macro-averaged f1-score of 0.50 and weighted f1-score of 0.63.
Calibrating Structured Output Predictors for Natural Language Processing.
Jagannatha, A.; and Yu, H.
In 2020 Annual Conference of the Association for Computational Linguistics (ACL), volume Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2078–2092, July 2020.
NIHMSID: NIHMS1661932
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{jagannatha_calibrating_2020, title = {Calibrating {Structured} {Output} {Predictors} for {Natural} {Language} {Processing}.}, volume = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, url = {https://aclanthology.org/2020.acl-main.188}, doi = {10.18653/v1/2020.acl-main.188}, abstract = {We address the problem of calibrating prediction confidence for output entities of interest in natural language processing (NLP) applications. It is important that NLP applications such as named entity recognition and question answering produce calibrated confidence scores for their predictions, especially if the system is to be deployed in a safety-critical domain such as healthcare. However, the output space of such structured prediction models is often too large to adapt binary or multi-class calibration methods directly. In this study, we propose a general calibration scheme for output entities of interest in neural-network based structured prediction models. Our proposed method can be used with any binary class calibration scheme and a neural network model. Additionally, we show that our calibration method can also be used as an uncertainty-aware, entity-specific decoding step to improve the performance of the underlying model at no additional training cost or data requirements. We show that our method outperforms current calibration techniques for named-entity-recognition, part-of-speech and question answering. We also improve our model's performance from our decoding step across several tasks and benchmark datasets. Our method improves the calibration and model performance on out-of-domain test scenarios as well.}, booktitle = {2020 {Annual} {Conference} of the {Association} for {Computational} {Linguistics} ({ACL})}, author = {Jagannatha, Abhyuday and Yu, Hong}, month = jul, year = {2020}, pmcid = {PMC7890517}, pmid = {33612961}, note = {NIHMSID: NIHMS1661932}, pages = {2078--2092}, }
We address the problem of calibrating prediction confidence for output entities of interest in natural language processing (NLP) applications. It is important that NLP applications such as named entity recognition and question answering produce calibrated confidence scores for their predictions, especially if the system is to be deployed in a safety-critical domain such as healthcare. However, the output space of such structured prediction models is often too large to adapt binary or multi-class calibration methods directly. In this study, we propose a general calibration scheme for output entities of interest in neural-network based structured prediction models. Our proposed method can be used with any binary class calibration scheme and a neural network model. Additionally, we show that our calibration method can also be used as an uncertainty-aware, entity-specific decoding step to improve the performance of the underlying model at no additional training cost or data requirements. We show that our method outperforms current calibration techniques for named-entity-recognition, part-of-speech and question answering. We also improve our model's performance from our decoding step across several tasks and benchmark datasets. Our method improves the calibration and model performance on out-of-domain test scenarios as well.
Conversational machine comprehension: a literature review.
Gupta, S.; Rawat, B. P. S.; and Yu, H.
arXiv preprint arXiv:2006.00671,2739–2753. December 2020.
COLING 2020
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{gupta_conversational_2020, title = {Conversational machine comprehension: a literature review}, shorttitle = {Conversational machine comprehension}, url = {https://aclanthology.org/2020.coling-main.247}, doi = {10.18653/v1/2020.coling-main.247}, abstract = {Conversational Machine Comprehension (CMC), a research track in conversational AI, expects the machine to understand an open-domain natural language text and thereafter engage in a multi-turn conversation to answer questions related to the text. While most of the research in Machine Reading Comprehension (MRC) revolves around single-turn question answering (QA), multi-turn CMC has recently gained prominence, thanks to the advancement in natural language understanding via neural language models such as BERT and the introduction of large-scale conversational datasets such as CoQA and QuAC. The rise in interest has, however, led to a flurry of concurrent publications, each with a different yet structurally similar modeling approach and an inconsistent view of the surrounding literature. With the volume of model submissions to conversational datasets increasing every year, there exists a need to consolidate the scattered knowledge in this domain to streamline future research. This literature review attempts at providing a holistic overview of CMC with an emphasis on the common trends across recently published models, specifically in their approach to tackling conversational history. The review synthesizes a generic framework for CMC models while highlighting the differences in recent approaches and intends to serve as a compendium of CMC for future researchers.}, journal = {arXiv preprint arXiv:2006.00671}, author = {Gupta, Somil and Rawat, Bhanu Pratap Singh and Yu, Hong}, month = dec, year = {2020}, note = {COLING 2020}, pages = {2739--2753}, }
Conversational Machine Comprehension (CMC), a research track in conversational AI, expects the machine to understand an open-domain natural language text and thereafter engage in a multi-turn conversation to answer questions related to the text. While most of the research in Machine Reading Comprehension (MRC) revolves around single-turn question answering (QA), multi-turn CMC has recently gained prominence, thanks to the advancement in natural language understanding via neural language models such as BERT and the introduction of large-scale conversational datasets such as CoQA and QuAC. The rise in interest has, however, led to a flurry of concurrent publications, each with a different yet structurally similar modeling approach and an inconsistent view of the surrounding literature. With the volume of model submissions to conversational datasets increasing every year, there exists a need to consolidate the scattered knowledge in this domain to streamline future research. This literature review attempts at providing a holistic overview of CMC with an emphasis on the common trends across recently published models, specifically in their approach to tackling conversational history. The review synthesizes a generic framework for CMC models while highlighting the differences in recent approaches and intends to serve as a compendium of CMC for future researchers.
Bleeding Entity Recognition in Electronic Health Records: A Comprehensive Analysis of End-to-End Systems.
Mitra, A.; Rawat, B. P. S.; McManus, D.; Kapoor, A.; and Yu, H.
In AMIA Annu Symp Proc, pages 860–869, 2020.
Paper link bibtex abstract
Paper link bibtex abstract
@inproceedings{mitra_bleeding_2020, title = {Bleeding {Entity} {Recognition} in {Electronic} {Health} {Records}: {A} {Comprehensive} {Analysis} of {End}-to-{End} {Systems}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8075442/}, abstract = {A bleeding event is a common adverse drug reaction amongst patients on anticoagulation and factors critically into a clinician's decision to prescribe or continue anticoagulation for atrial fibrillation. However, bleeding events are not uniformly captured in the administrative data of electronic health records (EHR). As manual review is prohibitively expensive, we investigate the effectiveness of various natural language processing (NLP) methods for automatic extraction of bleeding events. Using our expert-annotated 1,079 de-identified EHR notes, we evaluated state-of-the-art NLP models such as biLSTM-CRF with language modeling, and different BERT variants for six entity types. On our dataset, the biLSTM-CRF surpassed other models resulting in a macro F1-score of 0.75 whereas the performance difference is negligible for sentence and document-level predictions with the best macro F1-scores of 0.84 and 0.96, respectively. Our error analyses suggest that the models' incorrect predictions can be attributed to variability in entity spans, memorization, and missing negation signals.}, booktitle = {{AMIA} {Annu} {Symp} {Proc}}, author = {Mitra, Avijit and Rawat, Bhanu Pratap Singh and McManus, David and Kapoor, Alok and Yu, Hong}, year = {2020}, pmid = {33936461 PMCID: PMC8075442}, pages = {860--869}, }
A bleeding event is a common adverse drug reaction amongst patients on anticoagulation and factors critically into a clinician's decision to prescribe or continue anticoagulation for atrial fibrillation. However, bleeding events are not uniformly captured in the administrative data of electronic health records (EHR). As manual review is prohibitively expensive, we investigate the effectiveness of various natural language processing (NLP) methods for automatic extraction of bleeding events. Using our expert-annotated 1,079 de-identified EHR notes, we evaluated state-of-the-art NLP models such as biLSTM-CRF with language modeling, and different BERT variants for six entity types. On our dataset, the biLSTM-CRF surpassed other models resulting in a macro F1-score of 0.75 whereas the performance difference is negligible for sentence and document-level predictions with the best macro F1-scores of 0.84 and 0.96, respectively. Our error analyses suggest that the models' incorrect predictions can be attributed to variability in entity spans, memorization, and missing negation signals.
Neural Multi-Task Learning for Adverse Drug Reaction Extraction.
Liu, F.; Zheng, X.; Yu, H.; and Tjia, J.
AMIA ... Annual Symposium proceedings. AMIA Symposium, 2020: 756–762. 2020.
Paper link bibtex abstract
Paper link bibtex abstract
@article{liu_neural_2020, title = {Neural {Multi}-{Task} {Learning} for {Adverse} {Drug} {Reaction} {Extraction}}, volume = {2020}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8075418/pdf/110_3417286.pdf}, abstract = {A reliable and searchable knowledge database of adverse drug reactions (ADRs) is highly important and valuable for improving patient safety at the point of care. In this paper, we proposed a neural multi-task learning system, NeuroADR, to extract ADRs as well as relevant modifiers from free-text drug labels. Specifically, the NeuroADR system exploited a hierarchical multi-task learning (HMTL) framework to perform named entity recognition (NER) and relation extraction (RE) jointly, where interactions among the learned deep encoder representations from different subtasks are explored. Different from the conventional HMTL approach, NeuroADR adopted a novel task decomposition strategy to generate auxiliary subtasks for more inter-task interactions and integrated a new label encoding schema for better handling discontinuous entities. Experimental results demonstrate the effectiveness of the proposed system.}, language = {eng}, journal = {AMIA ... Annual Symposium proceedings. AMIA Symposium}, author = {Liu, Feifan and Zheng, Xiaoyu and Yu, Hong and Tjia, Jennifer}, year = {2020}, pmid = {33936450}, pmcid = {PMC8075418}, keywords = {Data Mining, Databases, Factual, Deep Learning, Drug-Related Side Effects and Adverse Reactions, Humans, Machine Learning}, pages = {756--762}, }
A reliable and searchable knowledge database of adverse drug reactions (ADRs) is highly important and valuable for improving patient safety at the point of care. In this paper, we proposed a neural multi-task learning system, NeuroADR, to extract ADRs as well as relevant modifiers from free-text drug labels. Specifically, the NeuroADR system exploited a hierarchical multi-task learning (HMTL) framework to perform named entity recognition (NER) and relation extraction (RE) jointly, where interactions among the learned deep encoder representations from different subtasks are explored. Different from the conventional HMTL approach, NeuroADR adopted a novel task decomposition strategy to generate auxiliary subtasks for more inter-task interactions and integrated a new label encoding schema for better handling discontinuous entities. Experimental results demonstrate the effectiveness of the proposed system.
BENTO: A Visual Platform for Building Clinical NLP Pipelines Based on CodaLab.
Jin, Y.; Li, F.; and Yu, H.
In 2020 Annual Conference of the Association for Computational Linguistics (ACL), pages 95–100, July 2020.
NIHMSID: NIHMS1644629
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{jin_bento_2020, title = {{BENTO}: {A} {Visual} {Platform} for {Building} {Clinical} {NLP} {Pipelines} {Based} on {CodaLab}.}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7679080/}, doi = {10.18653/v1/2020.acl-demos.13}, abstract = {CodaLab is an open-source web-based platform for collaborative computational research. Although CodaLab has gained popularity in the research community, its interface has limited support for creating reusable tools that can be easily applied to new datasets and composed into pipelines. In clinical domain, natural language processing (NLP) on medical notes generally involves multiple steps, like tokenization, named entity recognition, etc. Since these steps require different tools which are usually scattered in different publications, it is not easy for researchers to use them to process their own datasets. In this paper, we present BENTO, a workflow management platform with a graphic user interface (GUI) that is built on top of CodaLab, to facilitate the process of building clinical NLP pipelines. BENTO comes with a number of clinical NLP tools that have been pre-trained using medical notes and expert annotations and can be readily used for various clinical NLP tasks. It also allows researchers and developers to create their custom tools (e.g., pre-trained NLP models) and use them in a controlled and reproducible way. In addition, the GUI interface enables researchers with limited computer background to compose tools into NLP pipelines and then apply the pipelines on their own datasets in a "what you see is what you get" (WYSIWYG) way. Although BENTO is designed for clinical NLP applications, the underlying architecture is flexible to be tailored to any other domains.}, booktitle = {2020 {Annual} {Conference} of the {Association} for {Computational} {Linguistics} ({ACL})}, author = {Jin, Yonghao and Li, Fei and Yu, Hong}, month = jul, year = {2020}, pmcid = {PMC7679080}, pmid = {33223604}, note = {NIHMSID: NIHMS1644629}, pages = {95--100}, }
CodaLab is an open-source web-based platform for collaborative computational research. Although CodaLab has gained popularity in the research community, its interface has limited support for creating reusable tools that can be easily applied to new datasets and composed into pipelines. In clinical domain, natural language processing (NLP) on medical notes generally involves multiple steps, like tokenization, named entity recognition, etc. Since these steps require different tools which are usually scattered in different publications, it is not easy for researchers to use them to process their own datasets. In this paper, we present BENTO, a workflow management platform with a graphic user interface (GUI) that is built on top of CodaLab, to facilitate the process of building clinical NLP pipelines. BENTO comes with a number of clinical NLP tools that have been pre-trained using medical notes and expert annotations and can be readily used for various clinical NLP tasks. It also allows researchers and developers to create their custom tools (e.g., pre-trained NLP models) and use them in a controlled and reproducible way. In addition, the GUI interface enables researchers with limited computer background to compose tools into NLP pipelines and then apply the pipelines on their own datasets in a "what you see is what you get" (WYSIWYG) way. Although BENTO is designed for clinical NLP applications, the underlying architecture is flexible to be tailored to any other domains.
ICD Coding from Clinical Text Using Multi-Filter Residual Convolutional Neural Network.
Li, F.; and Yu, H.
In The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), pages 8180–8187, New York City, New York, February 2020.
doi link bibtex
doi link bibtex
@inproceedings{li_icd_2020, address = {New York City, New York}, title = {{ICD} {Coding} from {Clinical} {Text} {Using} {Multi}-{Filter} {Residual} {Convolutional} {Neural} {Network}.}, shorttitle = {{AAAI} 2020}, doi = {10.1609/AAAI.V34I05.6331}, booktitle = {The {Thirty}-{Fourth} {AAAI} {Conference} on {Artificial} {Intelligence} ({AAAI}-20)}, author = {Li, Fei and Yu, Hong}, month = feb, year = {2020}, keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning}, pages = {8180--8187}, }
Improved Pretraining for Domain-specific Contextual Embedding Models.
Rongali, S.; Jagannatha, A.; Rawat, B. P. S.; and Yu, H.
CoRR, abs/2004.02288. 2020.
arXiv: 2004.02288
Paper link bibtex
Paper link bibtex
@article{rongali_improved_2020, title = {Improved {Pretraining} for {Domain}-specific {Contextual} {Embedding} {Models}}, volume = {abs/2004.02288}, url = {https://arxiv.org/abs/2004.02288}, journal = {CoRR}, author = {Rongali, Subendhu and Jagannatha, Abhyuday and Rawat, Bhanu Pratap Singh and Yu, Hong}, year = {2020}, note = {arXiv: 2004.02288}, }
Neural data-to-text generation with dynamic content planning.
Chen, K.; Li, F.; Hu, B.; Peng, W.; Chen, Q.; Yu, H.; and Xiang, Y.
Knowledge-Based Systems,106610. November 2020.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{chen_neural_2020, title = {Neural data-to-text generation with dynamic content planning}, issn = {0950-7051}, url = {http://www.sciencedirect.com/science/article/pii/S0950705120307395}, doi = {10.1016/j.knosys.2020.106610}, abstract = {Neural data-to-text generation models have achieved significant advancement in recent years. However, these models have two shortcomings: the generated texts tend to miss some vital information, and they often generate descriptions that are not consistent with the structured input data. To alleviate these problems, we propose a Neural data-to-text generation model with Dynamic content Planning, named NDP 2 2This work was completed in cooperation with Baidu Inc.for abbreviation. The NDP can utilize the previously generated text to dynamically select the appropriate entry from the given structured data. We further design a reconstruction mechanism with a novel objective function that can reconstruct the whole entry of the used data sequentially from the hidden states of the decoder, which aids the accuracy of the generated text. Empirical results show that the NDP achieves superior performance over the state-of-the-art on ROTOWIRE and NBAZHN datasets, in terms of relation generation (RG), content selection (CS), content ordering (CO) and BLEU metrics. The human evaluation result shows that the texts generated by the proposed NDP are better than the corresponding ones generated by NCP in most of time. And using the proposed reconstruction mechanism, the fidelity of the generated text can be further improved significantly.}, language = {en}, urldate = {2020-12-29}, journal = {Knowledge-Based Systems}, author = {Chen, Kai and Li, Fayuan and Hu, Baotian and Peng, Weihua and Chen, Qingcai and Yu, Hong and Xiang, Yang}, month = nov, year = {2020}, keywords = {Data-to-text, Dynamic content planning, Reconstruction mechanism}, pages = {106610}, }
Neural data-to-text generation models have achieved significant advancement in recent years. However, these models have two shortcomings: the generated texts tend to miss some vital information, and they often generate descriptions that are not consistent with the structured input data. To alleviate these problems, we propose a Neural data-to-text generation model with Dynamic content Planning, named NDP 2 2This work was completed in cooperation with Baidu Inc.for abbreviation. The NDP can utilize the previously generated text to dynamically select the appropriate entry from the given structured data. We further design a reconstruction mechanism with a novel objective function that can reconstruct the whole entry of the used data sequentially from the hidden states of the decoder, which aids the accuracy of the generated text. Empirical results show that the NDP achieves superior performance over the state-of-the-art on ROTOWIRE and NBAZHN datasets, in terms of relation generation (RG), content selection (CS), content ordering (CO) and BLEU metrics. The human evaluation result shows that the texts generated by the proposed NDP are better than the corresponding ones generated by NCP in most of time. And using the proposed reconstruction mechanism, the fidelity of the generated text can be further improved significantly.
Generating Medical Assessments Using a Neural Network Model: Algorithm Development and Validation.
Hu, B.; Bajracharya, A.; and Yu, H.
JMIR Medical Informatics, 8(1): e14971. 2020.
Company: JMIR Medical Informatics Distributor: JMIR Medical Informatics Institution: JMIR Medical Informatics Label: JMIR Medical Informatics Publisher: JMIR Publications Inc., Toronto, Canada
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{hu_generating_2020, title = {Generating {Medical} {Assessments} {Using} a {Neural} {Network} {Model}: {Algorithm} {Development} and {Validation}}, volume = {8}, copyright = {Unless stated otherwise, all articles are open-access distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work (}, shorttitle = {Generating {Medical} {Assessments} {Using} a {Neural} {Network} {Model}}, url = {https://medinform.jmir.org/2020/1/e14971/}, doi = {10.2196/14971}, abstract = {Background: Since its inception, artificial intelligence has aimed to use computers to help make clinical diagnoses. Evidence-based medical reasoning is important for patient care. Inferring clinical diagnoses is a crucial step during the patient encounter. Previous works mainly used expert systems or machine learning–based methods to predict the International Classification of Diseases - Clinical Modification codes based on electronic health records. We report an alternative approach: inference of clinical diagnoses from patients’ reported symptoms and physicians’ clinical observations. Objective: We aimed to report a natural language processing system for generating medical assessments based on patient information described in the electronic health record (EHR) notes. Methods: We processed EHR notes into the Subjective, Objective, Assessment, and Plan sections. We trained a neural network model for medical assessment generation (N2MAG). Our N2MAG is an innovative deep neural model that uses the Subjective and Objective sections of an EHR note to automatically generate an “expert-like” assessment of the patient. N2MAG can be trained in an end-to-end fashion and does not require feature engineering and external knowledge resources. Results: We evaluated N2MAG and the baseline models both quantitatively and qualitatively. Evaluated by both the Recall-Oriented Understudy for Gisting Evaluation metrics and domain experts, our results show that N2MAG outperformed the existing state-of-the-art baseline models. Conclusions: N2MAG could generate a medical assessment from the Subject and Objective section descriptions in EHR notes. Future work will assess its potential for providing clinical decision support. [JMIR Med Inform 2020;8(1):e14971]}, language = {en}, number = {1}, urldate = {2020-04-07}, journal = {JMIR Medical Informatics}, author = {Hu, Baotian and Bajracharya, Adarsha and Yu, Hong}, year = {2020}, pmid = {31939742 PMCID: PMC7006435}, note = {Company: JMIR Medical Informatics Distributor: JMIR Medical Informatics Institution: JMIR Medical Informatics Label: JMIR Medical Informatics Publisher: JMIR Publications Inc., Toronto, Canada}, pages = {e14971}, }
Background: Since its inception, artificial intelligence has aimed to use computers to help make clinical diagnoses. Evidence-based medical reasoning is important for patient care. Inferring clinical diagnoses is a crucial step during the patient encounter. Previous works mainly used expert systems or machine learning–based methods to predict the International Classification of Diseases - Clinical Modification codes based on electronic health records. We report an alternative approach: inference of clinical diagnoses from patients’ reported symptoms and physicians’ clinical observations. Objective: We aimed to report a natural language processing system for generating medical assessments based on patient information described in the electronic health record (EHR) notes. Methods: We processed EHR notes into the Subjective, Objective, Assessment, and Plan sections. We trained a neural network model for medical assessment generation (N2MAG). Our N2MAG is an innovative deep neural model that uses the Subjective and Objective sections of an EHR note to automatically generate an “expert-like” assessment of the patient. N2MAG can be trained in an end-to-end fashion and does not require feature engineering and external knowledge resources. Results: We evaluated N2MAG and the baseline models both quantitatively and qualitatively. Evaluated by both the Recall-Oriented Understudy for Gisting Evaluation metrics and domain experts, our results show that N2MAG outperformed the existing state-of-the-art baseline models. Conclusions: N2MAG could generate a medical assessment from the Subject and Objective section descriptions in EHR notes. Future work will assess its potential for providing clinical decision support. [JMIR Med Inform 2020;8(1):e14971]
Dynamic Data Selection for Curriculum Learning via Ability Estimation.
Lalor, J. P.; and Yu, H.
In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 545–555, Online, November 2020. Association for Computational Linguistics
Paper link bibtex abstract
Paper link bibtex abstract
@inproceedings{lalor_dynamic_2020, address = {Online}, title = {Dynamic {Data} {Selection} for {Curriculum} {Learning} via {Ability} {Estimation}}, url = {https://www.aclweb.org/anthology/2020.findings-emnlp.48}, abstract = {Curriculum learning methods typically rely on heuristics to estimate the difficulty of training examples or the ability of the model. In this work, we propose replacing difficulty heuristics with learned difficulty parameters. We also propose Dynamic Data selection for Curriculum Learning via Ability Estimation (DDaCLAE), a strategy that probes model ability at each training epoch to select the best training examples at that point. We show that models using learned difficulty and/or ability outperform heuristic-based curriculum learning models on the GLUE classification tasks.}, urldate = {2020-11-29}, booktitle = {Findings of the {Association} for {Computational} {Linguistics}: {EMNLP} 2020}, publisher = {Association for Computational Linguistics}, author = {Lalor, John P. and Yu, Hong}, month = nov, year = {2020}, pmid = {33381774 PMCID: PMC7771727}, pages = {545--555}, }
Curriculum learning methods typically rely on heuristics to estimate the difficulty of training examples or the ability of the model. In this work, we propose replacing difficulty heuristics with learned difficulty parameters. We also propose Dynamic Data selection for Curriculum Learning via Ability Estimation (DDaCLAE), a strategy that probes model ability at each training epoch to select the best training examples at that point. We show that models using learned difficulty and/or ability outperform heuristic-based curriculum learning models on the GLUE classification tasks.
Neural Data-to-Text Generation with Dynamic Content Planning.
Chen, K.; Li, F.; Hu, B.; Peng, W.; Chen, Q.; and Yu, H.
arXiv:2004.07426 [cs]. April 2020.
arXiv: 2004.07426
Paper link bibtex abstract
Paper link bibtex abstract
@article{chen_neural_2020-1, title = {Neural {Data}-to-{Text} {Generation} with {Dynamic} {Content} {Planning}}, url = {http://arxiv.org/abs/2004.07426}, abstract = {Neural data-to-text generation models have achieved significant advancement in recent years. However, these models have two shortcomings: the generated texts tend to miss some vital information, and they often generate descriptions that are not consistent with the structured input data. To alleviate these problems, we propose a Neural data-to-text generation model with Dynamic content Planning, named NDP for abbreviation. The NDP can utilize the previously generated text to dynamically select the appropriate entry from the given structured data. We further design a reconstruction mechanism with a novel objective function that can reconstruct the whole entry of the used data sequentially from the hidden states of the decoder, which aids the accuracy of the generated text. Empirical results show that the NDP achieves superior performance over the state-of-the-art on ROTOWIRE dataset, in terms of relation generation (RG), content selection (CS), content ordering (CO) and BLEU metrics. The human evaluation result shows that the texts generated by the proposed NDP are better than the corresponding ones generated by NCP in most of time. And using the proposed reconstruction mechanism, the fidelity of the generated text can be further improved significantly.}, urldate = {2020-12-29}, journal = {arXiv:2004.07426 [cs]}, author = {Chen, Kai and Li, Fayuan and Hu, Baotian and Peng, Weihua and Chen, Qingcai and Yu, Hong}, month = apr, year = {2020}, note = {arXiv: 2004.07426}, keywords = {Computer Science - Computation and Language}, }
Neural data-to-text generation models have achieved significant advancement in recent years. However, these models have two shortcomings: the generated texts tend to miss some vital information, and they often generate descriptions that are not consistent with the structured input data. To alleviate these problems, we propose a Neural data-to-text generation model with Dynamic content Planning, named NDP for abbreviation. The NDP can utilize the previously generated text to dynamically select the appropriate entry from the given structured data. We further design a reconstruction mechanism with a novel objective function that can reconstruct the whole entry of the used data sequentially from the hidden states of the decoder, which aids the accuracy of the generated text. Empirical results show that the NDP achieves superior performance over the state-of-the-art on ROTOWIRE dataset, in terms of relation generation (RG), content selection (CS), content ordering (CO) and BLEU metrics. The human evaluation result shows that the texts generated by the proposed NDP are better than the corresponding ones generated by NCP in most of time. And using the proposed reconstruction mechanism, the fidelity of the generated text can be further improved significantly.
BENTO: A Visual Platform for Building Clinical NLP Pipelines Based on CodaLab.
Jin, Y; Li, F; and Yu, H
In AMIA Fall Symposium, 2020.
link bibtex
link bibtex
@inproceedings{jin_bento_2020, title = {{BENTO}: {A} {Visual} {Platform} for {Building} {Clinical} {NLP} {Pipelines} {Based} on {CodaLab}.}, booktitle = {{AMIA} {Fall} {Symposium}}, author = {Jin, Y and Li, F and Yu, H}, year = {2020}, }
Learning Latent Space Representations to Predict Patient Outcomes: Model Development and Validation.
Rongali, S.; Rose, A. J.; McManus, D. D.; Bajracharya, A. S.; Kapoor, A.; Granillo, E.; and Yu, H.
Journal of Medical Internet Research, 22(3): e16374. 2020.
Company: Journal of Medical Internet Research Distributor: Journal of Medical Internet Research Institution: Journal of Medical Internet Research Label: Journal of Medical Internet Research Publisher: JMIR Publications Inc., Toronto, Canada
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{rongali_learning_2020, title = {Learning {Latent} {Space} {Representations} to {Predict} {Patient} {Outcomes}: {Model} {Development} and {Validation}}, volume = {22}, shorttitle = {Learning {Latent} {Space} {Representations} to {Predict} {Patient} {Outcomes}}, url = {https://www.jmir.org/2020/3/e16374/}, doi = {10.2196/16374}, abstract = {Background: Scalable and accurate health outcome prediction using electronic health record (EHR) data has gained much attention in research recently. Previous machine learning models mostly ignore relations between different types of clinical data (ie, laboratory components, International Classification of Diseases codes, and medications). Objective: This study aimed to model such relations and build predictive models using the EHR data from intensive care units. We developed innovative neural network models and compared them with the widely used logistic regression model and other state-of-the-art neural network models to predict the patient’s mortality using their longitudinal EHR data. Methods: We built a set of neural network models that we collectively called as long short-term memory (LSTM) outcome prediction using comprehensive feature relations or in short, CLOUT. Our CLOUT models use a correlational neural network model to identify a latent space representation between different types of discrete clinical features during a patient’s encounter and integrate the latent representation into an LSTM-based predictive model framework. In addition, we designed an ablation experiment to identify risk factors from our CLOUT models. Using physicians’ input as the gold standard, we compared the risk factors identified by both CLOUT and logistic regression models. Results: Experiments on the Medical Information Mart for Intensive Care-III dataset (selected patient population: 7537) show that CLOUT (area under the receiver operating characteristic curve=0.89) has surpassed logistic regression (0.82) and other baseline NN models (\<0.86). In addition, physicians’ agreement with the CLOUT-derived risk factor rankings was statistically significantly higher than the agreement with the logistic regression model. Conclusions: Our results support the applicability of CLOUT for real-world clinical use in identifying patients at high risk of mortality. Trial Registration: [J Med Internet Res 2020;22(3):e16374]}, language = {en}, number = {3}, urldate = {2020-04-07}, journal = {Journal of Medical Internet Research}, author = {Rongali, Subendhu and Rose, Adam J. and McManus, David D. and Bajracharya, Adarsha S. and Kapoor, Alok and Granillo, Edgard and Yu, Hong}, year = {2020}, pmid = {32202503 PMCID: PMC7136840}, note = {Company: Journal of Medical Internet Research Distributor: Journal of Medical Internet Research Institution: Journal of Medical Internet Research Label: Journal of Medical Internet Research Publisher: JMIR Publications Inc., Toronto, Canada}, pages = {e16374}, }
Background: Scalable and accurate health outcome prediction using electronic health record (EHR) data has gained much attention in research recently. Previous machine learning models mostly ignore relations between different types of clinical data (ie, laboratory components, International Classification of Diseases codes, and medications). Objective: This study aimed to model such relations and build predictive models using the EHR data from intensive care units. We developed innovative neural network models and compared them with the widely used logistic regression model and other state-of-the-art neural network models to predict the patient’s mortality using their longitudinal EHR data. Methods: We built a set of neural network models that we collectively called as long short-term memory (LSTM) outcome prediction using comprehensive feature relations or in short, CLOUT. Our CLOUT models use a correlational neural network model to identify a latent space representation between different types of discrete clinical features during a patient’s encounter and integrate the latent representation into an LSTM-based predictive model framework. In addition, we designed an ablation experiment to identify risk factors from our CLOUT models. Using physicians’ input as the gold standard, we compared the risk factors identified by both CLOUT and logistic regression models. Results: Experiments on the Medical Information Mart for Intensive Care-III dataset (selected patient population: 7537) show that CLOUT (area under the receiver operating characteristic curve=0.89) has surpassed logistic regression (0.82) and other baseline NN models (<0.86). In addition, physicians’ agreement with the CLOUT-derived risk factor rankings was statistically significantly higher than the agreement with the logistic regression model. Conclusions: Our results support the applicability of CLOUT for real-world clinical use in identifying patients at high risk of mortality. Trial Registration: [J Med Internet Res 2020;22(3):e16374]
2019
(11)
Improving electronic health record note comprehension with NoteAid: randomized trial of electronic health record note comprehension interventions with crowdsourced workers.
Lalor, J. P.; Woolf, B.; and Yu, H.
Journal of Medical Internet Research, 21(1): e10793. 2019.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{lalor_improving_2019, title = {Improving electronic health record note comprehension with {NoteAid}: randomized trial of electronic health record note comprehension interventions with crowdsourced workers}, volume = {21}, copyright = {Unless stated otherwise, all articles are open-access distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work (}, shorttitle = {Improving electronic health record note comprehension with noteaid}, url = {https://www.jmir.org/2019/1/e10793/}, doi = {10.2196/jmir.10793}, abstract = {Background: Patient portals are becoming more common, and with them, the ability of patients to access their personal electronic health records (EHRs). EHRs, in particular the free-text EHR notes, often contain medical jargon and terms that are difficult for laypersons to understand. There are many Web-based resources for learning more about particular diseases or conditions, including systems that directly link to lay definitions or educational materials for medical concepts. Objective: Our goal is to determine whether use of one such tool, NoteAid, leads to higher EHR note comprehension ability. We use a new EHR note comprehension assessment tool instead of patient self-reported scores. Methods: In this work, we compare a passive, self-service educational resource (MedlinePlus) with an active resource (NoteAid) where definitions are provided to the user for medical concepts that the system identifies. We use Amazon Mechanical Turk (AMT) to recruit individuals to complete ComprehENotes, a new test of EHR note comprehension. Results: Mean scores for individuals with access to NoteAid are significantly higher than the mean baseline scores, both for raw scores (P=.008) and estimated ability (P=.02). Conclusions: In our experiments, we show that the active intervention leads to significantly higher scores on the comprehension test as compared with a baseline group with no resources provided. In contrast, there is no significant difference between the group that was provided with the passive intervention and the baseline group. Finally, we analyze the demographics of the individuals who participated in our AMT task and show differences between groups that align with the current understanding of health literacy between populations. This is the first work to show improvements in comprehension using tools such as NoteAid as measured by an EHR note comprehension assessment tool as opposed to patient self-reported scores. [J Med Internet Res 2019;21(1):e10793]}, language = {en}, number = {1}, urldate = {2019-01-31}, journal = {Journal of Medical Internet Research}, author = {Lalor, John P. and Woolf, Beverly and Yu, Hong}, year = {2019}, pmid = {30664453 PMCID: 6351990}, pages = {e10793}, }
Background: Patient portals are becoming more common, and with them, the ability of patients to access their personal electronic health records (EHRs). EHRs, in particular the free-text EHR notes, often contain medical jargon and terms that are difficult for laypersons to understand. There are many Web-based resources for learning more about particular diseases or conditions, including systems that directly link to lay definitions or educational materials for medical concepts. Objective: Our goal is to determine whether use of one such tool, NoteAid, leads to higher EHR note comprehension ability. We use a new EHR note comprehension assessment tool instead of patient self-reported scores. Methods: In this work, we compare a passive, self-service educational resource (MedlinePlus) with an active resource (NoteAid) where definitions are provided to the user for medical concepts that the system identifies. We use Amazon Mechanical Turk (AMT) to recruit individuals to complete ComprehENotes, a new test of EHR note comprehension. Results: Mean scores for individuals with access to NoteAid are significantly higher than the mean baseline scores, both for raw scores (P=.008) and estimated ability (P=.02). Conclusions: In our experiments, we show that the active intervention leads to significantly higher scores on the comprehension test as compared with a baseline group with no resources provided. In contrast, there is no significant difference between the group that was provided with the passive intervention and the baseline group. Finally, we analyze the demographics of the individuals who participated in our AMT task and show differences between groups that align with the current understanding of health literacy between populations. This is the first work to show improvements in comprehension using tools such as NoteAid as measured by an EHR note comprehension assessment tool as opposed to patient self-reported scores. [J Med Internet Res 2019;21(1):e10793]
Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)–Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.
Li, F.; Jin, Y.; Liu, W.; Rawat, B. P. S.; Cai, P.; and Yu, H.
JMIR Medical Informatics, 7(3): e14830. September 2019.
Paper doi link bibtex
Paper doi link bibtex
@article{li_fine-tuning_2019, title = {Fine-{Tuning} {Bidirectional} {Encoder} {Representations} {From} {Transformers} ({BERT})–{Based} {Models} on {Large}-{Scale} {Electronic} {Health} {Record} {Notes}: {An} {Empirical} {Study}}, volume = {7}, issn = {2291-9694}, shorttitle = {Fine-{Tuning} {Bidirectional} {Encoder} {Representations} {From} {Transformers} ({BERT})–{Based} {Models} on {Large}-{Scale} {Electronic} {Health} {Record} {Notes}}, url = {http://medinform.jmir.org/2019/3/e14830/}, doi = {10.2196/14830}, language = {en}, number = {3}, urldate = {2019-10-07}, journal = {JMIR Medical Informatics}, author = {Li, Fei and Jin, Yonghao and Liu, Weisong and Rawat, Bhanu Pratap Singh and Cai, Pengshan and Yu, Hong}, month = sep, year = {2019}, pmid = {31516126 PMCID: PMC6746103}, pages = {e14830}, }
Detecting Hypoglycemia Incidents Reported in Patients’ Secure Messages: Using Cost-Sensitive Learning and Oversampling to Reduce Data Imbalance.
Chen, J.; Lalor, J.; Liu, W.; Druhl, E.; Granillo, E.; Vimalananda, V. G; and Yu, H.
Journal of Medical Internet Research, 21(3). March 2019.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{chen_detecting_2019, title = {Detecting {Hypoglycemia} {Incidents} {Reported} in {Patients}’ {Secure} {Messages}: {Using} {Cost}-{Sensitive} {Learning} and {Oversampling} to {Reduce} {Data} {Imbalance}}, volume = {21}, issn = {1439-4456}, shorttitle = {Detecting {Hypoglycemia} {Incidents} {Reported} in {Patients}’ {Secure} {Messages}}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6431826/}, doi = {10.2196/11990}, abstract = {Background Improper dosing of medications such as insulin can cause hypoglycemic episodes, which may lead to severe morbidity or even death. Although secure messaging was designed for exchanging nonurgent messages, patients sometimes report hypoglycemia events through secure messaging. Detecting these patient-reported adverse events may help alert clinical teams and enable early corrective actions to improve patient safety. Objective We aimed to develop a natural language processing system, called HypoDetect (Hypoglycemia Detector), to automatically identify hypoglycemia incidents reported in patients’ secure messages. Methods An expert in public health annotated 3000 secure message threads between patients with diabetes and US Department of Veterans Affairs clinical teams as containing patient-reported hypoglycemia incidents or not. A physician independently annotated 100 threads randomly selected from this dataset to determine interannotator agreement. We used this dataset to develop and evaluate HypoDetect. HypoDetect incorporates 3 machine learning algorithms widely used for text classification: linear support vector machines, random forest, and logistic regression. We explored different learning features, including new knowledge-driven features. Because only 114 (3.80\%) messages were annotated as positive, we investigated cost-sensitive learning and oversampling methods to mitigate the challenge of imbalanced data. Results The interannotator agreement was Cohen kappa=.976. Using cross-validation, logistic regression with cost-sensitive learning achieved the best performance (area under the receiver operating characteristic curve=0.954, sensitivity=0.693, specificity 0.974, F1 score=0.590). Cost-sensitive learning and the ensembled synthetic minority oversampling technique improved the sensitivity of the baseline systems substantially (by 0.123 to 0.728 absolute gains). Our results show that a variety of features contributed to the best performance of HypoDetect. Conclusions Despite the challenge of data imbalance, HypoDetect achieved promising results for the task of detecting hypoglycemia incidents from secure messages. The system has a great potential to facilitate early detection and treatment of hypoglycemia.}, number = {3}, urldate = {2019-12-29}, journal = {Journal of Medical Internet Research}, author = {Chen, Jinying and Lalor, John and Liu, Weisong and Druhl, Emily and Granillo, Edgard and Vimalananda, Varsha G and Yu, Hong}, month = mar, year = {2019}, pmid = {30855231 PMCID: PMC6431826}, }
Background Improper dosing of medications such as insulin can cause hypoglycemic episodes, which may lead to severe morbidity or even death. Although secure messaging was designed for exchanging nonurgent messages, patients sometimes report hypoglycemia events through secure messaging. Detecting these patient-reported adverse events may help alert clinical teams and enable early corrective actions to improve patient safety. Objective We aimed to develop a natural language processing system, called HypoDetect (Hypoglycemia Detector), to automatically identify hypoglycemia incidents reported in patients’ secure messages. Methods An expert in public health annotated 3000 secure message threads between patients with diabetes and US Department of Veterans Affairs clinical teams as containing patient-reported hypoglycemia incidents or not. A physician independently annotated 100 threads randomly selected from this dataset to determine interannotator agreement. We used this dataset to develop and evaluate HypoDetect. HypoDetect incorporates 3 machine learning algorithms widely used for text classification: linear support vector machines, random forest, and logistic regression. We explored different learning features, including new knowledge-driven features. Because only 114 (3.80%) messages were annotated as positive, we investigated cost-sensitive learning and oversampling methods to mitigate the challenge of imbalanced data. Results The interannotator agreement was Cohen kappa=.976. Using cross-validation, logistic regression with cost-sensitive learning achieved the best performance (area under the receiver operating characteristic curve=0.954, sensitivity=0.693, specificity 0.974, F1 score=0.590). Cost-sensitive learning and the ensembled synthetic minority oversampling technique improved the sensitivity of the baseline systems substantially (by 0.123 to 0.728 absolute gains). Our results show that a variety of features contributed to the best performance of HypoDetect. Conclusions Despite the challenge of data imbalance, HypoDetect achieved promising results for the task of detecting hypoglycemia incidents from secure messages. The system has a great potential to facilitate early detection and treatment of hypoglycemia.
Automatic Detection of Hypoglycemic Events From the Electronic Health Record Notes of Diabetes Patients: Empirical Study.
Jin, Y.; Li, F.; Vimalananda, V. G.; and Yu, H.
JMIR Medical Informatics, 7(4): e14340. 2019.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{jin_automatic_2019, title = {Automatic {Detection} of {Hypoglycemic} {Events} {From} the {Electronic} {Health} {Record} {Notes} of {Diabetes} {Patients}: {Empirical} {Study}}, volume = {7}, copyright = {Unless stated otherwise, all articles are open-access distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work (}, shorttitle = {Automatic {Detection} of {Hypoglycemic} {Events} {From} the {Electronic} {Health} {Record} {Notes} of {Diabetes} {Patients}}, url = {https://medinform.jmir.org/2019/4/e14340/}, doi = {10.2196/14340}, abstract = {Background: Hypoglycemic events are common and potentially dangerous conditions among patients being treated for diabetes. Automatic detection of such events could improve patient care and is valuable in population studies. Electronic health records (EHRs) are valuable resources for the detection of such events. Objective: In this study, we aim to develop a deep-learning–based natural language processing (NLP) system to automatically detect hypoglycemic events from EHR notes. Our model is called the High-Performing System for Automatically Detecting Hypoglycemic Events (HYPE). Methods: Domain experts reviewed 500 EHR notes of diabetes patients to determine whether each sentence contained a hypoglycemic event or not. We used this annotated corpus to train and evaluate HYPE, the high-performance NLP system for hypoglycemia detection. We built and evaluated both a classical machine learning model (ie, support vector machines [SVMs]) and state-of-the-art neural network models. Results: We found that neural network models outperformed the SVM model. The convolutional neural network (CNN) model yielded the highest performance in a 10-fold cross-validation setting: mean precision=0.96 (SD 0.03), mean recall=0.86 (SD 0.03), and mean F1=0.91 (SD 0.03). Conclusions: Despite the challenges posed by small and highly imbalanced data, our CNN-based HYPE system still achieved a high performance for hypoglycemia detection. HYPE can be used for EHR-based hypoglycemia surveillance and population studies in diabetes patients. [JMIR Med Inform 2019;7(4):e14340]}, language = {en}, number = {4}, urldate = {2019-11-10}, journal = {JMIR Medical Informatics}, author = {Jin, Yonghao and Li, Fei and Vimalananda, Varsha G. and Yu, Hong}, year = {2019}, pmid = {31702562 PMCID: PMC6913754}, keywords = {adverse events, convolutional neural networks, hypoglycemia, natural language processing}, pages = {e14340}, }
Background: Hypoglycemic events are common and potentially dangerous conditions among patients being treated for diabetes. Automatic detection of such events could improve patient care and is valuable in population studies. Electronic health records (EHRs) are valuable resources for the detection of such events. Objective: In this study, we aim to develop a deep-learning–based natural language processing (NLP) system to automatically detect hypoglycemic events from EHR notes. Our model is called the High-Performing System for Automatically Detecting Hypoglycemic Events (HYPE). Methods: Domain experts reviewed 500 EHR notes of diabetes patients to determine whether each sentence contained a hypoglycemic event or not. We used this annotated corpus to train and evaluate HYPE, the high-performance NLP system for hypoglycemia detection. We built and evaluated both a classical machine learning model (ie, support vector machines [SVMs]) and state-of-the-art neural network models. Results: We found that neural network models outperformed the SVM model. The convolutional neural network (CNN) model yielded the highest performance in a 10-fold cross-validation setting: mean precision=0.96 (SD 0.03), mean recall=0.86 (SD 0.03), and mean F1=0.91 (SD 0.03). Conclusions: Despite the challenges posed by small and highly imbalanced data, our CNN-based HYPE system still achieved a high performance for hypoglycemia detection. HYPE can be used for EHR-based hypoglycemia surveillance and population studies in diabetes patients. [JMIR Med Inform 2019;7(4):e14340]
Learning to detect and understand drug discontinuation events from clinical narratives.
Liu, F.; Pradhan, R.; Druhl, E.; Freund, E.; Liu, W.; Sauer, B. C.; Cunningham, F.; Gordon, A. J.; Peters, C. B.; and Yu, H.
Journal of the American Medical Informatics Association, 26(10): 943–951. October 2019.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{liu_learning_2019, title = {Learning to detect and understand drug discontinuation events from clinical narratives}, volume = {26}, url = {https://academic.oup.com/jamia/article/26/10/943/5481540}, doi = {10.1093/jamia/ocz048}, abstract = {AbstractObjective. Identifying drug discontinuation (DDC) events and understanding their reasons are important for medication management and drug safety survei}, language = {en}, number = {10}, urldate = {2019-12-29}, journal = {Journal of the American Medical Informatics Association}, author = {Liu, Feifan and Pradhan, Richeek and Druhl, Emily and Freund, Elaine and Liu, Weisong and Sauer, Brian C. and Cunningham, Fran and Gordon, Adam J. and Peters, Celena B. and Yu, Hong}, month = oct, year = {2019}, pmid = {31034028 PMCID: PMC6748801}, pages = {943--951}, }
AbstractObjective. Identifying drug discontinuation (DDC) events and understanding their reasons are important for medication management and drug safety survei
Overview of the First Natural Language Processing Challenge for Extracting Medication, Indication, and Adverse Drug Events from Electronic Health Record Notes (MADE 1.0).
Jagannatha, A.; Liu, F.; Liu, W.; and Yu, H.
Drug Safety, (1): 99–111. January 2019.
doi link bibtex abstract
doi link bibtex abstract
@article{jagannatha_overview_2019, title = {Overview of the {First} {Natural} {Language} {Processing} {Challenge} for {Extracting} {Medication}, {Indication}, and {Adverse} {Drug} {Events} from {Electronic} {Health} {Record} {Notes} ({MADE} 1.0)}, issn = {1179-1942}, doi = {10.1007/s40264-018-0762-z}, abstract = {INTRODUCTION: This work describes the Medication and Adverse Drug Events from Electronic Health Records (MADE 1.0) corpus and provides an overview of the MADE 1.0 2018 challenge for extracting medication, indication, and adverse drug events (ADEs) from electronic health record (EHR) notes. OBJECTIVE: The goal of MADE is to provide a set of common evaluation tasks to assess the state of the art for natural language processing (NLP) systems applied to EHRs supporting drug safety surveillance and pharmacovigilance. We also provide benchmarks on the MADE dataset using the system submissions received in the MADE 2018 challenge. METHODS: The MADE 1.0 challenge has released an expert-annotated cohort of medication and ADE information comprising 1089 fully de-identified longitudinal EHR notes from 21 randomly selected patients with cancer at the University of Massachusetts Memorial Hospital. Using this cohort as a benchmark, the MADE 1.0 challenge designed three shared NLP tasks. The named entity recognition (NER) task identifies medications and their attributes (dosage, route, duration, and frequency), indications, ADEs, and severity. The relation identification (RI) task identifies relations between the named entities: medication-indication, medication-ADE, and attribute relations. The third shared task (NER-RI) evaluates NLP models that perform the NER and RI tasks jointly. In total, 11 teams from four countries participated in at least one of the three shared tasks, and 41 system submissions were received in total. RESULTS: The best systems F1 scores for NER, RI, and NER-RI were 0.82, 0.86, and 0.61, respectively. Ensemble classifiers using the team submissions improved the performance further, with an F1 score of 0.85, 0.87, and 0.66 for the three tasks, respectively. CONCLUSION: MADE results show that recent progress in NLP has led to remarkable improvements in NER and RI tasks for the clinical domain. However, some room for improvement remains, particularly in the NER-RI task.}, language = {eng}, number = {1}, journal = {Drug Safety}, author = {Jagannatha, Abhyuday and Liu, Feifan and Liu, Weisong and Yu, Hong}, month = jan, year = {2019}, pmid = {30649735 PMCID: PMC6860017}, pages = {99--111}, }
INTRODUCTION: This work describes the Medication and Adverse Drug Events from Electronic Health Records (MADE 1.0) corpus and provides an overview of the MADE 1.0 2018 challenge for extracting medication, indication, and adverse drug events (ADEs) from electronic health record (EHR) notes. OBJECTIVE: The goal of MADE is to provide a set of common evaluation tasks to assess the state of the art for natural language processing (NLP) systems applied to EHRs supporting drug safety surveillance and pharmacovigilance. We also provide benchmarks on the MADE dataset using the system submissions received in the MADE 2018 challenge. METHODS: The MADE 1.0 challenge has released an expert-annotated cohort of medication and ADE information comprising 1089 fully de-identified longitudinal EHR notes from 21 randomly selected patients with cancer at the University of Massachusetts Memorial Hospital. Using this cohort as a benchmark, the MADE 1.0 challenge designed three shared NLP tasks. The named entity recognition (NER) task identifies medications and their attributes (dosage, route, duration, and frequency), indications, ADEs, and severity. The relation identification (RI) task identifies relations between the named entities: medication-indication, medication-ADE, and attribute relations. The third shared task (NER-RI) evaluates NLP models that perform the NER and RI tasks jointly. In total, 11 teams from four countries participated in at least one of the three shared tasks, and 41 system submissions were received in total. RESULTS: The best systems F1 scores for NER, RI, and NER-RI were 0.82, 0.86, and 0.61, respectively. Ensemble classifiers using the team submissions improved the performance further, with an F1 score of 0.85, 0.87, and 0.66 for the three tasks, respectively. CONCLUSION: MADE results show that recent progress in NLP has led to remarkable improvements in NER and RI tasks for the clinical domain. However, some room for improvement remains, particularly in the NER-RI task.
Naranjo Question Answering using End-to-End Multi-task Learning Model.
Rawat, B. P; Li, F.; and Yu, H.
25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD),2547–2555. 2019.
doi link bibtex abstract
doi link bibtex abstract
@article{rawat_naranjo_2019, title = {Naranjo {Question} {Answering} using {End}-to-{End} {Multi}-task {Learning} {Model}}, doi = {10.1145/3292500.3330770}, abstract = {In the clinical domain, it is important to understand whether an adverse drug reaction (ADR) is caused by a particular medication. Clinical judgement studies help judge the causal relation between a medication and its ADRs. In this study, we present the first attempt to automatically infer the causality between a drug and an ADR from electronic health records (EHRs) by answering the Naranjo questionnaire, the validated clinical question answering set used by domain experts for ADR causality assessment. Using physicians’ annotation as the gold standard, our proposed joint model, which uses multi-task learning to predict the answers of a subset of the Naranjo questionnaire, significantly outperforms the baseline pipeline model with a good margin, achieving a macro-weighted f-score between 0.3652 – 0.5271 and micro-weighted f-score between 0.9523 – 0.9918.}, journal = {25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)}, author = {Rawat, Bhanu P and Li, Fei and Yu, Hong}, year = {2019}, pmid = {31799022 NIHMSID: NIHMS1058295 PMCID:PMC6887102}, pages = {2547--2555}, }
In the clinical domain, it is important to understand whether an adverse drug reaction (ADR) is caused by a particular medication. Clinical judgement studies help judge the causal relation between a medication and its ADRs. In this study, we present the first attempt to automatically infer the causality between a drug and an ADR from electronic health records (EHRs) by answering the Naranjo questionnaire, the validated clinical question answering set used by domain experts for ADR causality assessment. Using physicians’ annotation as the gold standard, our proposed joint model, which uses multi-task learning to predict the answers of a subset of the Naranjo questionnaire, significantly outperforms the baseline pipeline model with a good margin, achieving a macro-weighted f-score between 0.3652 – 0.5271 and micro-weighted f-score between 0.9523 – 0.9918.
A neural abstractive summarization model guided with topic sentences. ICONIP.
Chen, C.; Hu, B.; Chen, Q.; and Yu, H.
In 2019.
link bibtex
link bibtex
@inproceedings{chen_neural_2019, title = {A neural abstractive summarization model guided with topic sentences. {ICONIP}}, author = {Chen, Chen and Hu, Baotian and Chen, Qingcai and Yu, Hong}, year = {2019}, }
An investigation of single-domain and multidomain medication and adverse drug event relation extraction from electronic health record notes using advanced deep learning models.
Li, F.; and Yu, H.
Journal of the American Medical Informatics Association, 26(7): 646–654. July 2019.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{li_investigation_2019, title = {An investigation of single-domain and multidomain medication and adverse drug event relation extraction from electronic health record notes using advanced deep learning models}, volume = {26}, url = {https://academic.oup.com/jamia/article/26/7/646/5426087}, doi = {10.1093/jamia/ocz018}, abstract = {AbstractObjective. We aim to evaluate the effectiveness of advanced deep learning models (eg, capsule network [CapNet], adversarial training [ADV]) for single-}, language = {en}, number = {7}, urldate = {2019-12-09}, journal = {Journal of the American Medical Informatics Association}, author = {Li, Fei and Yu, Hong}, month = jul, year = {2019}, pages = {646--654}, }
AbstractObjective. We aim to evaluate the effectiveness of advanced deep learning models (eg, capsule network [CapNet], adversarial training [ADV]) for single-
Anticoagulant prescribing for non-valvular atrial fibrillation in the Veterans Health Administration.
Rose, A.; Goldberg, R; McManus, D.; Kapoor, A; Wang, V; Liu, W; and Yu, H
Journal of the American Heart Association. 2019.
doi link bibtex abstract
doi link bibtex abstract
@article{rose_anticoagulant_2019, title = {Anticoagulant prescribing for non-valvular atrial fibrillation in the {Veterans} {Health} {Administration}}, doi = {10.1161/JAHA.119.012646}, abstract = {Background Direct acting oral anticoagulants (DOACs) theoretically could contribute to addressing underuse of anticoagulation in non-valvular atrial fibrillation (NVAF). Few studies have examined this prospect, however. The potential of DOACs to address underuse of anticoagulation in NVAF could be magnified within a healthcare system that sharply limits patients' exposure to out-of-pocket copayments, such as the Veterans Health Administration (VA). Methods and Results We used a clinical data set of all patients with NVAF treated within VA from 2007 to 2016 (n=987 373). We examined how the proportion of patients receiving any anticoagulation, and which agent was prescribed, changed over time. When first approved for VA use in 2011, DOACs constituted a tiny proportion of all prescriptions for anticoagulants (2\%); by 2016, this proportion had increased to 45\% of all prescriptions and 67\% of new prescriptions. Patient characteristics associated with receiving a DOAC, rather than warfarin, included white race, better kidney function, fewer comorbid conditions overall, and no history of stroke or bleeding. In 2007, before the introduction of DOACs, 56\% of VA patients with NVAF were receiving anticoagulation; this dipped to 44\% in 2012 just after the introduction of DOACs and had risen back to 51\% by 2016. Conclusions These results do not suggest that the availability of DOACs has led to an increased proportion of patients with NVAF receiving anticoagulation, even in the context of a healthcare system that sharply limits patients' exposure to out-of-pocket copayments.}, journal = {Journal of the American Heart Association}, author = {Rose, AJ and Goldberg, R and McManus, DD and Kapoor, A and Wang, V and Liu, W and Yu, H}, year = {2019}, pmid = {31441364 PMCID:PMC6755851}, }
Background Direct acting oral anticoagulants (DOACs) theoretically could contribute to addressing underuse of anticoagulation in non-valvular atrial fibrillation (NVAF). Few studies have examined this prospect, however. The potential of DOACs to address underuse of anticoagulation in NVAF could be magnified within a healthcare system that sharply limits patients' exposure to out-of-pocket copayments, such as the Veterans Health Administration (VA). Methods and Results We used a clinical data set of all patients with NVAF treated within VA from 2007 to 2016 (n=987 373). We examined how the proportion of patients receiving any anticoagulation, and which agent was prescribed, changed over time. When first approved for VA use in 2011, DOACs constituted a tiny proportion of all prescriptions for anticoagulants (2%); by 2016, this proportion had increased to 45% of all prescriptions and 67% of new prescriptions. Patient characteristics associated with receiving a DOAC, rather than warfarin, included white race, better kidney function, fewer comorbid conditions overall, and no history of stroke or bleeding. In 2007, before the introduction of DOACs, 56% of VA patients with NVAF were receiving anticoagulation; this dipped to 44% in 2012 just after the introduction of DOACs and had risen back to 51% by 2016. Conclusions These results do not suggest that the availability of DOACs has led to an increased proportion of patients with NVAF receiving anticoagulation, even in the context of a healthcare system that sharply limits patients' exposure to out-of-pocket copayments.
Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds.
Lalor, J. P.; Wu, H.; and Yu, H.
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4240–4250, Hong Kong, China, November 2019. Association for Computational Linguistics
NIHMSID: NIHMS1059054
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{lalor_learning_2019, address = {Hong Kong, China}, title = {Learning {Latent} {Parameters} without {Human} {Response} {Patterns}: {Item} {Response} {Theory} with {Artificial} {Crowds}}, shorttitle = {Learning {Latent} {Parameters} without {Human} {Response} {Patterns}}, url = {https://www.aclweb.org/anthology/D19-1434}, doi = {10.18653/v1/D19-1434}, abstract = {Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable information about model performance and behavior. Traditionally, IRT models are learned using human response pattern (RP) data, presenting a significant bottleneck for large data sets like those required for training deep neural networks (DNNs). In this work we propose learning IRT models using RPs generated from artificial crowds of DNN models. We demonstrate the effectiveness of learning IRT models using DNN-generated data through quantitative and qualitative analyses for two NLP tasks. Parameters learned from human and machine RPs for natural language inference and sentiment analysis exhibit medium to large positive correlations. We demonstrate a use-case for latent difficulty item parameters, namely training set filtering, and show that using difficulty to sample training data outperforms baseline methods. Finally, we highlight cases where human expectation about item difficulty does not match difficulty as estimated from the machine RPs.}, urldate = {2019-11-11}, booktitle = {Proceedings of the 2019 {Conference} on {Empirical} {Methods} in {Natural} {Language} {Processing} and the 9th {International} {Joint} {Conference} on {Natural} {Language} {Processing} ({EMNLP}-{IJCNLP})}, publisher = {Association for Computational Linguistics}, author = {Lalor, John P. and Wu, Hao and Yu, Hong}, month = nov, year = {2019}, pmcid = {PMC6892593}, pmid = {31803865}, note = {NIHMSID: NIHMS1059054}, pages = {4240--4250}, }
Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable information about model performance and behavior. Traditionally, IRT models are learned using human response pattern (RP) data, presenting a significant bottleneck for large data sets like those required for training deep neural networks (DNNs). In this work we propose learning IRT models using RPs generated from artificial crowds of DNN models. We demonstrate the effectiveness of learning IRT models using DNN-generated data through quantitative and qualitative analyses for two NLP tasks. Parameters learned from human and machine RPs for natural language inference and sentiment analysis exhibit medium to large positive correlations. We demonstrate a use-case for latent difficulty item parameters, namely training set filtering, and show that using difficulty to sample training data outperforms baseline methods. Finally, we highlight cases where human expectation about item difficulty does not match difficulty as estimated from the machine RPs.
2018
(13)
Detecting Opioid-Related Aberrant Behavior using Natural Language Processing.
Lingeman, J. M.; Wang, P.; Becker, W.; and Yu, H.
AMIA Annual Symposium Proceedings, 2017: 1179–1185. April 2018.
Paper link bibtex abstract
Paper link bibtex abstract
@article{lingeman_detecting_2018, title = {Detecting {Opioid}-{Related} {Aberrant} {Behavior} using {Natural} {Language} {Processing}}, volume = {2017}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5977697/}, abstract = {The United States is in the midst of a prescription opioid epidemic, with the number of yearly opioid-related overdose deaths increasing almost fourfold since 2000. To more effectively prevent unintentional opioid overdoses, the medical profession requires robust surveillance tools that can effectively identify at-risk patients. Drug-related aberrant behaviors observed in the clinical context may be important indicators of patients at risk for or actively abusing opioids. In this paper, we describe a natural language processing (NLP) method for automatic surveillance of aberrant behavior in medical notes relying only on the text of the notes. This allows for a robust and generalizable system that can be used for high volume analysis of electronic medical records for potential predictors of opioid abuse.}, urldate = {2024-04-10}, journal = {AMIA Annual Symposium Proceedings}, author = {Lingeman, Jesse M. and Wang, Priscilla and Becker, William and Yu, Hong}, month = apr, year = {2018}, pmid = {29854186}, pmcid = {PMC5977697}, pages = {1179--1185}, }
The United States is in the midst of a prescription opioid epidemic, with the number of yearly opioid-related overdose deaths increasing almost fourfold since 2000. To more effectively prevent unintentional opioid overdoses, the medical profession requires robust surveillance tools that can effectively identify at-risk patients. Drug-related aberrant behaviors observed in the clinical context may be important indicators of patients at risk for or actively abusing opioids. In this paper, we describe a natural language processing (NLP) method for automatic surveillance of aberrant behavior in medical notes relying only on the text of the notes. This allows for a robust and generalizable system that can be used for high volume analysis of electronic medical records for potential predictors of opioid abuse.
A natural language processing system that links medical terms in electronic health record notes to lay definitions: system development using physician reviews.
Chen, J.; Druhl, E.; Polepalli Ramesh, B.; Houston, T. K.; Brandt, C. A.; Zulman, D. M.; Vimalananda, V. G.; Malkani, S.; and Yu, H.
Journal of Medical Internet Research, 20(1): e26. January 2018.
doi link bibtex abstract
doi link bibtex abstract
@article{chen_natural_2018, title = {A natural language processing system that links medical terms in electronic health record notes to lay definitions: system development using physician reviews}, volume = {20}, issn = {1438-8871}, shorttitle = {A natural language processing system that links medical terms in electronic health record notes to lay definitions}, doi = {10.2196/jmir.8669}, abstract = {BACKGROUND: Many health care systems now allow patients to access their electronic health record (EHR) notes online through patient portals. Medical jargon in EHR notes can confuse patients, which may interfere with potential benefits of patient access to EHR notes. OBJECTIVE: The aim of this study was to develop and evaluate the usability and content quality of NoteAid, a Web-based natural language processing system that links medical terms in EHR notes to lay definitions, that is, definitions easily understood by lay people. METHODS: NoteAid incorporates two core components: CoDeMed, a lexical resource of lay definitions for medical terms, and MedLink, a computational unit that links medical terms to lay definitions. We developed innovative computational methods, including an adapted distant supervision algorithm to prioritize medical terms important for EHR comprehension to facilitate the effort of building CoDeMed. Ten physician domain experts evaluated the user interface and content quality of NoteAid. The evaluation protocol included a cognitive walkthrough session and a postsession questionnaire. Physician feedback sessions were audio-recorded. We used standard content analysis methods to analyze qualitative data from these sessions. RESULTS: Physician feedback was mixed. Positive feedback on NoteAid included (1) Easy to use, (2) Good visual display, (3) Satisfactory system speed, and (4) Adequate lay definitions. Opportunities for improvement arising from evaluation sessions and feedback included (1) improving the display of definitions for partially matched terms, (2) including more medical terms in CoDeMed, (3) improving the handling of terms whose definitions vary depending on different contexts, and (4) standardizing the scope of definitions for medicines. On the basis of these results, we have improved NoteAid's user interface and a number of definitions, and added 4502 more definitions in CoDeMed. CONCLUSIONS: Physician evaluation yielded useful feedback for content validation and refinement of this innovative tool that has the potential to improve patient EHR comprehension and experience using patient portals. Future ongoing work will develop algorithms to handle ambiguous medical terms and test and evaluate NoteAid with patients.}, language = {eng}, number = {1}, journal = {Journal of Medical Internet Research}, author = {Chen, Jinying and Druhl, Emily and Polepalli Ramesh, Balaji and Houston, Thomas K. and Brandt, Cynthia A. and Zulman, Donna M. and Vimalananda, Varsha G. and Malkani, Samir and Yu, Hong}, month = jan, year = {2018}, pmid = {29358159 PMCID: PMC5799720}, keywords = {computer software, consumer health informatics, electronic health records, natural language processing, usability testing}, pages = {e26}, }
BACKGROUND: Many health care systems now allow patients to access their electronic health record (EHR) notes online through patient portals. Medical jargon in EHR notes can confuse patients, which may interfere with potential benefits of patient access to EHR notes. OBJECTIVE: The aim of this study was to develop and evaluate the usability and content quality of NoteAid, a Web-based natural language processing system that links medical terms in EHR notes to lay definitions, that is, definitions easily understood by lay people. METHODS: NoteAid incorporates two core components: CoDeMed, a lexical resource of lay definitions for medical terms, and MedLink, a computational unit that links medical terms to lay definitions. We developed innovative computational methods, including an adapted distant supervision algorithm to prioritize medical terms important for EHR comprehension to facilitate the effort of building CoDeMed. Ten physician domain experts evaluated the user interface and content quality of NoteAid. The evaluation protocol included a cognitive walkthrough session and a postsession questionnaire. Physician feedback sessions were audio-recorded. We used standard content analysis methods to analyze qualitative data from these sessions. RESULTS: Physician feedback was mixed. Positive feedback on NoteAid included (1) Easy to use, (2) Good visual display, (3) Satisfactory system speed, and (4) Adequate lay definitions. Opportunities for improvement arising from evaluation sessions and feedback included (1) improving the display of definitions for partially matched terms, (2) including more medical terms in CoDeMed, (3) improving the handling of terms whose definitions vary depending on different contexts, and (4) standardizing the scope of definitions for medicines. On the basis of these results, we have improved NoteAid's user interface and a number of definitions, and added 4502 more definitions in CoDeMed. CONCLUSIONS: Physician evaluation yielded useful feedback for content validation and refinement of this innovative tool that has the potential to improve patient EHR comprehension and experience using patient portals. Future ongoing work will develop algorithms to handle ambiguous medical terms and test and evaluate NoteAid with patients.
Clinical Relation Extraction Toward Drug Safety Surveillance Using Electronic Health Record Narratives: Classical Learning Versus Deep Learning.
Munkhdalai, T.; Liu, F.; and Yu, H.
JMIR public health and surveillance, 4(2): e29. April 2018.
doi link bibtex abstract
doi link bibtex abstract
@article{munkhdalai_clinical_2018, title = {Clinical {Relation} {Extraction} {Toward} {Drug} {Safety} {Surveillance} {Using} {Electronic} {Health} {Record} {Narratives}: {Classical} {Learning} {Versus} {Deep} {Learning}}, volume = {4}, issn = {2369-2960}, shorttitle = {Clinical {Relation} {Extraction} {Toward} {Drug} {Safety} {Surveillance} {Using} {Electronic} {Health} {Record} {Narratives}}, doi = {10.2196/publichealth.9361}, abstract = {BACKGROUND: Medication and adverse drug event (ADE) information extracted from electronic health record (EHR) notes can be a rich resource for drug safety surveillance. Existing observational studies have mainly relied on structured EHR data to obtain ADE information; however, ADEs are often buried in the EHR narratives and not recorded in structured data. OBJECTIVE: To unlock ADE-related information from EHR narratives, there is a need to extract relevant entities and identify relations among them. In this study, we focus on relation identification. This study aimed to evaluate natural language processing and machine learning approaches using the expert-annotated medical entities and relations in the context of drug safety surveillance, and investigate how different learning approaches perform under different configurations. METHODS: We have manually annotated 791 EHR notes with 9 named entities (eg, medication, indication, severity, and ADEs) and 7 different types of relations (eg, medication-dosage, medication-ADE, and severity-ADE). Then, we explored 3 supervised machine learning systems for relation identification: (1) a support vector machines (SVM) system, (2) an end-to-end deep neural network system, and (3) a supervised descriptive rule induction baseline system. For the neural network system, we exploited the state-of-the-art recurrent neural network (RNN) and attention models. We report the performance by macro-averaged precision, recall, and F1-score across the relation types. RESULTS: Our results show that the SVM model achieved the best average F1-score of 89.1\% on test data, outperforming the long short-term memory (LSTM) model with attention (F1-score of 65.72\%) as well as the rule induction baseline system (F1-score of 7.47\%) by a large margin. The bidirectional LSTM model with attention achieved the best performance among different RNN models. With the inclusion of additional features in the LSTM model, its performance can be boosted to an average F1-score of 77.35\%. CONCLUSIONS: It shows that classical learning models (SVM) remains advantageous over deep learning models (RNN variants) for clinical relation identification, especially for long-distance intersentential relations. However, RNNs demonstrate a great potential of significant improvement if more training data become available. Our work is an important step toward mining EHRs to improve the efficacy of drug safety surveillance. Most importantly, the annotated data used in this study will be made publicly available, which will further promote drug safety research in the community.}, language = {eng}, number = {2}, journal = {JMIR public health and surveillance}, author = {Munkhdalai, Tsendsuren and Liu, Feifan and Yu, Hong}, month = apr, year = {2018}, pmid = {29695376 PMCID: PMC5943628}, keywords = {drug-related side effects and adverse reactions, electronic health records, medical informatics applications, natural language processing, neural networks}, pages = {e29}, }
BACKGROUND: Medication and adverse drug event (ADE) information extracted from electronic health record (EHR) notes can be a rich resource for drug safety surveillance. Existing observational studies have mainly relied on structured EHR data to obtain ADE information; however, ADEs are often buried in the EHR narratives and not recorded in structured data. OBJECTIVE: To unlock ADE-related information from EHR narratives, there is a need to extract relevant entities and identify relations among them. In this study, we focus on relation identification. This study aimed to evaluate natural language processing and machine learning approaches using the expert-annotated medical entities and relations in the context of drug safety surveillance, and investigate how different learning approaches perform under different configurations. METHODS: We have manually annotated 791 EHR notes with 9 named entities (eg, medication, indication, severity, and ADEs) and 7 different types of relations (eg, medication-dosage, medication-ADE, and severity-ADE). Then, we explored 3 supervised machine learning systems for relation identification: (1) a support vector machines (SVM) system, (2) an end-to-end deep neural network system, and (3) a supervised descriptive rule induction baseline system. For the neural network system, we exploited the state-of-the-art recurrent neural network (RNN) and attention models. We report the performance by macro-averaged precision, recall, and F1-score across the relation types. RESULTS: Our results show that the SVM model achieved the best average F1-score of 89.1% on test data, outperforming the long short-term memory (LSTM) model with attention (F1-score of 65.72%) as well as the rule induction baseline system (F1-score of 7.47%) by a large margin. The bidirectional LSTM model with attention achieved the best performance among different RNN models. With the inclusion of additional features in the LSTM model, its performance can be boosted to an average F1-score of 77.35%. CONCLUSIONS: It shows that classical learning models (SVM) remains advantageous over deep learning models (RNN variants) for clinical relation identification, especially for long-distance intersentential relations. However, RNNs demonstrate a great potential of significant improvement if more training data become available. Our work is an important step toward mining EHRs to improve the efficacy of drug safety surveillance. Most importantly, the annotated data used in this study will be made publicly available, which will further promote drug safety research in the community.
A hybrid Neural Network Model for Joint Prediction of Presence and Period Assertions of Medical Events in Clinical Notes.
Rumeng, L.; Abhyuday N, J.; and Hong, Y.
AMIA Annual Symposium Proceedings, 2017: 1149–1158. April 2018.
Paper link bibtex abstract
Paper link bibtex abstract
@article{rumeng_hybrid_2018, title = {A hybrid {Neural} {Network} {Model} for {Joint} {Prediction} of {Presence} and {Period} {Assertions} of {Medical} {Events} in {Clinical} {Notes}}, volume = {2017}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5977733/}, abstract = {In this paper, we propose a novel neural network architecture for clinical text mining. We formulate this hybrid neural network model (HNN), composed of recurrent neural network and deep residual network, to jointly predict the presence and period assertion values associated with medical events in clinical texts. We evaluate the effectiveness of our model on a corpus of expert-annotated longitudinal Electronic Health Records (EHR) notes from Cancer patients. Our experiments show that HNN improves the joint assertion classification accuracy as compared to conventional baselines.}, urldate = {2018-10-01}, journal = {AMIA Annual Symposium Proceedings}, author = {Rumeng, Li and Abhyuday N, Jagannatha and Hong, Yu}, month = apr, year = {2018}, pmid = {29854183}, pmcid = {PMC5977733}, pages = {1149--1158}, }
In this paper, we propose a novel neural network architecture for clinical text mining. We formulate this hybrid neural network model (HNN), composed of recurrent neural network and deep residual network, to jointly predict the presence and period assertion values associated with medical events in clinical texts. We evaluate the effectiveness of our model on a corpus of expert-annotated longitudinal Electronic Health Records (EHR) notes from Cancer patients. Our experiments show that HNN improves the joint assertion classification accuracy as compared to conventional baselines.
Assessing Readability of Medical Documents: A Ranking Approach.
Zheng, J.; and Yu, H
The Journal of Medical Internet Research Medical Informatics. March 2018.
doi link bibtex abstract
doi link bibtex abstract
@article{zheng_assessing_2018, title = {Assessing {Readability} of {Medical} {Documents}: {A} {Ranking} {Approach}.}, doi = {DOI: 10.2196/medinform.8611}, abstract = {BACKGROUND: The use of electronic health record (EHR) systems with patient engagement capabilities, including viewing, downloading, and transmitting health information, has recently grown tremendously. However, using these resources to engage patients in managing their own health remains challenging due to the complex and technical nature of the EHR narratives. OBJECTIVE: Our objective was to develop a machine learning-based system to assess readability levels of complex documents such as EHR notes. METHODS: We collected difficulty ratings of EHR notes and Wikipedia articles using crowdsourcing from 90 readers. We built a supervised model to assess readability based on relative orders of text difficulty using both surface text features and word embeddings. We evaluated system performance using the Kendall coefficient of concordance against human ratings. RESULTS: Our system achieved significantly higher concordance (.734) with human annotators than did a baseline using the Flesch-Kincaid Grade Level, a widely adopted readability formula (.531). The improvement was also consistent across different disease topics. This method's concordance with an individual human user's ratings was also higher than the concordance between different human annotators (.658). CONCLUSIONS: We explored methods to automatically assess the readability levels of clinical narratives. Our ranking-based system using simple textual features and easy-to-learn word embeddings outperformed a widely used readability formula. Our ranking-based method can predict relative difficulties of medical documents. It is not constrained to a predefined set of readability levels, a common design in many machine learning-based systems. Furthermore, the feature set does not rely on complex processing of the documents. One potential application of our readability ranking is personalization, allowing patients to better accommodate their own background knowledge.}, journal = {The Journal of Medical Internet Research Medical Informatics}, author = {Zheng, JP and Yu, H}, month = mar, year = {2018}, pmid = {29572199 PMCID: PMC5889493}, }
BACKGROUND: The use of electronic health record (EHR) systems with patient engagement capabilities, including viewing, downloading, and transmitting health information, has recently grown tremendously. However, using these resources to engage patients in managing their own health remains challenging due to the complex and technical nature of the EHR narratives. OBJECTIVE: Our objective was to develop a machine learning-based system to assess readability levels of complex documents such as EHR notes. METHODS: We collected difficulty ratings of EHR notes and Wikipedia articles using crowdsourcing from 90 readers. We built a supervised model to assess readability based on relative orders of text difficulty using both surface text features and word embeddings. We evaluated system performance using the Kendall coefficient of concordance against human ratings. RESULTS: Our system achieved significantly higher concordance (.734) with human annotators than did a baseline using the Flesch-Kincaid Grade Level, a widely adopted readability formula (.531). The improvement was also consistent across different disease topics. This method's concordance with an individual human user's ratings was also higher than the concordance between different human annotators (.658). CONCLUSIONS: We explored methods to automatically assess the readability levels of clinical narratives. Our ranking-based system using simple textual features and easy-to-learn word embeddings outperformed a widely used readability formula. Our ranking-based method can predict relative difficulties of medical documents. It is not constrained to a predefined set of readability levels, a common design in many machine learning-based systems. Furthermore, the feature set does not rely on complex processing of the documents. One potential application of our readability ranking is personalization, allowing patients to better accommodate their own background knowledge.
Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study.
Lalor, J.; Wu, H.; Munkhdalai, T.; and Yu, H.
In EMNLP, 2018.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{lalor_understanding_2018, title = {Understanding {Deep} {Learning} {Performance} through an {Examination} of {Test} {Set} {Difficulty}: {A} {Psychometric} {Case} {Study}}, url = {https://arxiv.org/abs/1702.04811v3}, doi = {DOI: 10.18653/v1/D18-1500}, abstract = {Interpreting the performance of deep learning models beyond test set accuracy is challenging. Characteristics of individual data points are often not considered during evaluation, and each data point is treated equally. We examine the impact of a test set question's difficulty to determine if there is a relationship between difficulty and performance. We model difficulty using well-studied psychometric methods on human response patterns. Experiments on Natural Language Inference (NLI) and Sentiment Analysis (SA) show that the likelihood of answering a question correctly is impacted by the question's difficulty. As DNNs are trained with more data, easy examples are learned more quickly than hard examples.}, booktitle = {{EMNLP}}, author = {Lalor, John and Wu, Hao and Munkhdalai, Tsendsuren and Yu, Hong}, year = {2018}, }
Interpreting the performance of deep learning models beyond test set accuracy is challenging. Characteristics of individual data points are often not considered during evaluation, and each data point is treated equally. We examine the impact of a test set question's difficulty to determine if there is a relationship between difficulty and performance. We model difficulty using well-studied psychometric methods on human response patterns. Experiments on Natural Language Inference (NLI) and Sentiment Analysis (SA) show that the likelihood of answering a question correctly is impacted by the question's difficulty. As DNNs are trained with more data, easy examples are learned more quickly than hard examples.
Soft Label Memorization-Generalization for Natural Language Inference.
Lalor, J.; Wu, H.; and Yu, H.
In 2018.
Paper link bibtex abstract
Paper link bibtex abstract
@inproceedings{lalor_soft_2018, title = {Soft {Label} {Memorization}-{Generalization} for {Natural} {Language} {Inference}.}, url = {https://arxiv.org/abs/1702.08563v3}, abstract = {Often when multiple labels are obtained for a training example it is assumed that there is an element of noise that must be accounted for. It has been shown that this disagreement can be considered signal instead of noise. In this work we investigate using soft labels for training data to improve generalization in machine learning models. However, using soft labels for training Deep Neural Networks (DNNs) is not practical due to the costs involved in obtaining multiple labels for large data sets. We propose soft label memorization-generalization (SLMG), a fine-tuning approach to using soft labels for training DNNs. We assume that differences in labels provided by human annotators represent ambiguity about the true label instead of noise. Experiments with SLMG demonstrate improved generalization performance on the Natural Language Inference (NLI) task. Our experiments show that by injecting a small percentage of soft label training data (0.03\% of training set size) we can improve generalization performance over several baselines.}, author = {Lalor, John and Wu, Hao and Yu, Hong}, year = {2018}, }
Often when multiple labels are obtained for a training example it is assumed that there is an element of noise that must be accounted for. It has been shown that this disagreement can be considered signal instead of noise. In this work we investigate using soft labels for training data to improve generalization in machine learning models. However, using soft labels for training Deep Neural Networks (DNNs) is not practical due to the costs involved in obtaining multiple labels for large data sets. We propose soft label memorization-generalization (SLMG), a fine-tuning approach to using soft labels for training DNNs. We assume that differences in labels provided by human annotators represent ambiguity about the true label instead of noise. Experiments with SLMG demonstrate improved generalization performance on the Natural Language Inference (NLI) task. Our experiments show that by injecting a small percentage of soft label training data (0.03% of training set size) we can improve generalization performance over several baselines.
Sentence Simplification with Memory-Augmented Neural Networks.
Vu, T.; Hu, B.; Munkhdalai, T.; and Yu, H.
In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018.
doi link bibtex abstract
doi link bibtex abstract
@inproceedings{vu_sentence_2018, title = {Sentence {Simplification} with {Memory}-{Augmented} {Neural} {Networks}}, doi = {DOI:10.18653/v1/N18-2013}, abstract = {Sentence simplification aims to simplify the content and structure of complex sentences, and thus make them easier to interpret for human readers, and easier to process for downstream NLP applications. Recent advances in neural machine translation have paved the way for novel approaches to the task. In this paper, we adapt an architecture with augmented memory capacities called Neural Semantic Encoders (Munkhdalai and Yu, 2017) for sentence simplification. Our experiments demonstrate the effectiveness of our approach on different simplification datasets, both in terms of automatic evaluation measures and human judgments.}, booktitle = {North {American} {Chapter} of the {Association} for {Computational} {Linguistics}: {Human} {Language} {Technologies}}, author = {Vu, Tu and Hu, Baotian and Munkhdalai, Tsendsuren and Yu, Hong}, year = {2018}, }
Sentence simplification aims to simplify the content and structure of complex sentences, and thus make them easier to interpret for human readers, and easier to process for downstream NLP applications. Recent advances in neural machine translation have paved the way for novel approaches to the task. In this paper, we adapt an architecture with augmented memory capacities called Neural Semantic Encoders (Munkhdalai and Yu, 2017) for sentence simplification. Our experiments demonstrate the effectiveness of our approach on different simplification datasets, both in terms of automatic evaluation measures and human judgments.
Recent Trends In Oral Anticoagulant Use and Post-Discharge Complications Among Atrial Fibrillation Patients With Acute Myocardial Infarction.
Amartya Kundu; Kevin O ’Day; Darleen M. Lessard; Joel M. Gore1; Steven A. Lubitz; Hong Yu; Mohammed W. Akhter; Daniel Z. Fisher; Robert M. Hayward Jr.; Nils Henninger; Jane S. Saczynski; Allan J. Walkey; Alok Kapoor; Jorge Yarzebski; Robert J. Goldberg; and David D. McManus
In 2018. Journal of Atrial Fibrillation
doi link bibtex abstract
doi link bibtex abstract
@inproceedings{amartya_kundu_recent_2018, title = {Recent {Trends} {In} {Oral} {Anticoagulant} {Use} and {Post}-{Discharge} {Complications} {Among} {Atrial} {Fibrillation} {Patients} {With} {Acute} {Myocardial} {Infarction}}, doi = {DOI: 10.4022/jafib.1749}, abstract = {BACKGROUND: Atrial fibrillation (AF) is a common complication of acute myocardial infarction (AMI).The CHA2DS2VAScand CHADS2risk scoresare used to identifypatients with AF at risk for strokeand to guide oral anticoagulants (OAC) use, including patients with AMI. However, the epidemiology of AF, further stratifiedaccording to patients' risk of stroke, has not been wellcharacterized among those hospitalized for AMI. METHODS: We examined trends in the frequency of AF, rates of discharge OAC use, and post-discharge outcomes among 6,627 residents of the Worcester, Massachusetts area who survived hospitalization for AMI at 11 medical centers between 1997 and 2011. RESULTS: A total of 1,050AMI patients had AF (16\%) andthe majority (91\%)had a CHA2DS2VAScscore {\textgreater}2.AF rates were highest among patients in the highest stroke risk group.In comparison to patients without AF, patients with AMI and AF in the highest stroke risk category had higher rates of post-discharge complications, including higher 30-day re-hospitalization [27 \% vs. 17 \%], 30-day post-discharge death [10 \% vs. 5\%], and 1-year post-discharge death [46 \% vs. 18 \%] (p {\textless} 0.001 for all). Notably, fewerthan half of guideline-eligible AF patientsreceived an OACprescription at discharge. Usage rates for other evidence-based therapiessuch as statins and beta-blockers,lagged in comparison to AMI patients free from AF. CONCLUSIONS: Our findings highlight the need to enhance efforts towards stroke prevention among AMI survivors with AF.}, publisher = {Journal of Atrial Fibrillation}, author = {{Amartya Kundu} and {Kevin O ’Day} and {Darleen M. Lessard} and {Joel M. Gore1} and {Steven A. Lubitz} and {Hong Yu} and {Mohammed W. Akhter} and {Daniel Z. Fisher} and {Robert M. Hayward Jr.} and {Nils Henninger} and {Jane S. Saczynski} and {Allan J. Walkey} and {Alok Kapoor} and {Jorge Yarzebski} and {Robert J. Goldberg} and {David D. McManus}}, year = {2018}, pmid = {29988239 PMCID: PMC6006973}, }
BACKGROUND: Atrial fibrillation (AF) is a common complication of acute myocardial infarction (AMI).The CHA2DS2VAScand CHADS2risk scoresare used to identifypatients with AF at risk for strokeand to guide oral anticoagulants (OAC) use, including patients with AMI. However, the epidemiology of AF, further stratifiedaccording to patients' risk of stroke, has not been wellcharacterized among those hospitalized for AMI. METHODS: We examined trends in the frequency of AF, rates of discharge OAC use, and post-discharge outcomes among 6,627 residents of the Worcester, Massachusetts area who survived hospitalization for AMI at 11 medical centers between 1997 and 2011. RESULTS: A total of 1,050AMI patients had AF (16%) andthe majority (91%)had a CHA2DS2VAScscore \textgreater2.AF rates were highest among patients in the highest stroke risk group.In comparison to patients without AF, patients with AMI and AF in the highest stroke risk category had higher rates of post-discharge complications, including higher 30-day re-hospitalization [27 % vs. 17 %], 30-day post-discharge death [10 % vs. 5%], and 1-year post-discharge death [46 % vs. 18 %] (p \textless 0.001 for all). Notably, fewerthan half of guideline-eligible AF patientsreceived an OACprescription at discharge. Usage rates for other evidence-based therapiessuch as statins and beta-blockers,lagged in comparison to AMI patients free from AF. CONCLUSIONS: Our findings highlight the need to enhance efforts towards stroke prevention among AMI survivors with AF.
ComprehENotes: An Instrument to Assess Patient EHR Note Reading Comprehension of Electronic Health Record Notes: Development and Validation.
Lalor, J; Wu, H; Chen, L; Mazor, K; and Yu, H
The Journal of Medical Internet Research. April 2018.
doi link bibtex abstract
doi link bibtex abstract
@article{lalor_comprehenotes:_2018, title = {{ComprehENotes}: {An} {Instrument} to {Assess} {Patient} {EHR} {Note} {Reading} {Comprehension} of {Electronic} {Health} {Record} {Notes}: {Development} and {Validation}}, doi = {DOI: 10.2196/jmir.9380}, abstract = {BACKGROUND: Patient portals are widely adopted in the United States and allow millions of patients access to their electronic health records (EHRs), including their EHR clinical notes. A patient's ability to understand the information in the EHR is dependent on their overall health literacy. Although many tests of health literacy exist, none specifically focuses on EHR note comprehension. OBJECTIVE: The aim of this paper was to develop an instrument to assess patients' EHR note comprehension. METHODS: We identified 6 common diseases or conditions (heart failure, diabetes, cancer, hypertension, chronic obstructive pulmonary disease, and liver failure) and selected 5 representative EHR notes for each disease or condition. One note that did not contain natural language text was removed. Questions were generated from these notes using Sentence Verification Technique and were analyzed using item response theory (IRT) to identify a set of questions that represent a good test of ability for EHR note comprehension. RESULTS: Using Sentence Verification Technique, 154 questions were generated from the 29 EHR notes initially obtained. Of these, 83 were manually selected for inclusion in the Amazon Mechanical Turk crowdsourcing tasks and 55 were ultimately retained following IRT analysis. A follow-up validation with a second Amazon Mechanical Turk task and IRT analysis confirmed that the 55 questions test a latent ability dimension for EHR note comprehension. A short test of 14 items was created along with the 55-item test. CONCLUSIONS: We developed ComprehENotes, an instrument for assessing EHR note comprehension from existing EHR notes, gathered responses using crowdsourcing, and used IRT to analyze those responses, thus resulting in a set of questions to measure EHR note comprehension. Crowdsourced responses from Amazon Mechanical Turk can be used to estimate item parameters and select a subset of items for inclusion in the test set using IRT. The final set of questions is the first test of EHR note comprehension.}, journal = {The Journal of Medical Internet Research}, author = {Lalor, J and Wu, H and Chen, L and Mazor, K and Yu, H}, month = apr, year = {2018}, pmid = {29695372 PMCID: PMC5943623}, }
BACKGROUND: Patient portals are widely adopted in the United States and allow millions of patients access to their electronic health records (EHRs), including their EHR clinical notes. A patient's ability to understand the information in the EHR is dependent on their overall health literacy. Although many tests of health literacy exist, none specifically focuses on EHR note comprehension. OBJECTIVE: The aim of this paper was to develop an instrument to assess patients' EHR note comprehension. METHODS: We identified 6 common diseases or conditions (heart failure, diabetes, cancer, hypertension, chronic obstructive pulmonary disease, and liver failure) and selected 5 representative EHR notes for each disease or condition. One note that did not contain natural language text was removed. Questions were generated from these notes using Sentence Verification Technique and were analyzed using item response theory (IRT) to identify a set of questions that represent a good test of ability for EHR note comprehension. RESULTS: Using Sentence Verification Technique, 154 questions were generated from the 29 EHR notes initially obtained. Of these, 83 were manually selected for inclusion in the Amazon Mechanical Turk crowdsourcing tasks and 55 were ultimately retained following IRT analysis. A follow-up validation with a second Amazon Mechanical Turk task and IRT analysis confirmed that the 55 questions test a latent ability dimension for EHR note comprehension. A short test of 14 items was created along with the 55-item test. CONCLUSIONS: We developed ComprehENotes, an instrument for assessing EHR note comprehension from existing EHR notes, gathered responses using crowdsourcing, and used IRT to analyze those responses, thus resulting in a set of questions to measure EHR note comprehension. Crowdsourced responses from Amazon Mechanical Turk can be used to estimate item parameters and select a subset of items for inclusion in the test set using IRT. The final set of questions is the first test of EHR note comprehension.
Detecting Hypoglycemia Incidence from Patients’ Secure Messages.
Chen, J; and Yu, H
In 2018.
link bibtex
link bibtex
@inproceedings{chen_detecting_2018, title = {Detecting {Hypoglycemia} {Incidence} from {Patients}’ {Secure} {Messages}}, author = {Chen, J and Yu, H}, year = {2018}, }
Extraction of Information Related to Adverse Drug Events from Electronic Health Record Notes: Design of an End-to-End Model Based on Deep Learning.
Li, F.; Liu, W.; and Yu, H.
JMIR medical informatics, 6(4): e12159. November 2018.
doi link bibtex abstract
doi link bibtex abstract
@article{li_extraction_2018, title = {Extraction of {Information} {Related} to {Adverse} {Drug} {Events} from {Electronic} {Health} {Record} {Notes}: {Design} of an {End}-to-{End} {Model} {Based} on {Deep} {Learning}}, volume = {6}, issn = {2291-9694}, shorttitle = {Extraction of {Information} {Related} to {Adverse} {Drug} {Events} from {Electronic} {Health} {Record} {Notes}}, doi = {10.2196/12159}, abstract = {BACKGROUND: Pharmacovigilance and drug-safety surveillance are crucial for monitoring adverse drug events (ADEs), but the main ADE-reporting systems such as Food and Drug Administration Adverse Event Reporting System face challenges such as underreporting. Therefore, as complementary surveillance, data on ADEs are extracted from electronic health record (EHR) notes via natural language processing (NLP). As NLP develops, many up-to-date machine-learning techniques are introduced in this field, such as deep learning and multi-task learning (MTL). However, only a few studies have focused on employing such techniques to extract ADEs. OBJECTIVE: We aimed to design a deep learning model for extracting ADEs and related information such as medications and indications. Since extraction of ADE-related information includes two steps-named entity recognition and relation extraction-our second objective was to improve the deep learning model using multi-task learning between the two steps. METHODS: We employed the dataset from the Medication, Indication and Adverse Drug Events (MADE) 1.0 challenge to train and test our models. This dataset consists of 1089 EHR notes of cancer patients and includes 9 entity types such as Medication, Indication, and ADE and 7 types of relations between these entities. To extract information from the dataset, we proposed a deep-learning model that uses a bidirectional long short-term memory (BiLSTM) conditional random field network to recognize entities and a BiLSTM-Attention network to extract relations. To further improve the deep-learning model, we employed three typical MTL methods, namely, hard parameter sharing, parameter regularization, and task relation learning, to build three MTL models, called HardMTL, RegMTL, and LearnMTL, respectively. RESULTS: Since extraction of ADE-related information is a two-step task, the result of the second step (ie, relation extraction) was used to compare all models. We used microaveraged precision, recall, and F1 as evaluation metrics. Our deep learning model achieved state-of-the-art results (F1=65.9\%), which is significantly higher than that (F1=61.7\%) of the best system in the MADE1.0 challenge. HardMTL further improved the F1 by 0.8\%, boosting the F1 to 66.7\%, whereas RegMTL and LearnMTL failed to boost the performance. CONCLUSIONS: Deep learning models can significantly improve the performance of ADE-related information extraction. MTL may be effective for named entity recognition and relation extraction, but it depends on the methods, data, and other factors. Our results can facilitate research on ADE detection, NLP, and machine learning.}, language = {eng}, number = {4}, journal = {JMIR medical informatics}, author = {Li, Fei and Liu, Weisong and Yu, Hong}, month = nov, year = {2018}, pmid = {30478023 PMCID: PMC6288593}, keywords = {adverse drug event, deep learning, multi-task learning, named entity recognition, natural language processing, relation extraction}, pages = {e12159}, }
BACKGROUND: Pharmacovigilance and drug-safety surveillance are crucial for monitoring adverse drug events (ADEs), but the main ADE-reporting systems such as Food and Drug Administration Adverse Event Reporting System face challenges such as underreporting. Therefore, as complementary surveillance, data on ADEs are extracted from electronic health record (EHR) notes via natural language processing (NLP). As NLP develops, many up-to-date machine-learning techniques are introduced in this field, such as deep learning and multi-task learning (MTL). However, only a few studies have focused on employing such techniques to extract ADEs. OBJECTIVE: We aimed to design a deep learning model for extracting ADEs and related information such as medications and indications. Since extraction of ADE-related information includes two steps-named entity recognition and relation extraction-our second objective was to improve the deep learning model using multi-task learning between the two steps. METHODS: We employed the dataset from the Medication, Indication and Adverse Drug Events (MADE) 1.0 challenge to train and test our models. This dataset consists of 1089 EHR notes of cancer patients and includes 9 entity types such as Medication, Indication, and ADE and 7 types of relations between these entities. To extract information from the dataset, we proposed a deep-learning model that uses a bidirectional long short-term memory (BiLSTM) conditional random field network to recognize entities and a BiLSTM-Attention network to extract relations. To further improve the deep-learning model, we employed three typical MTL methods, namely, hard parameter sharing, parameter regularization, and task relation learning, to build three MTL models, called HardMTL, RegMTL, and LearnMTL, respectively. RESULTS: Since extraction of ADE-related information is a two-step task, the result of the second step (ie, relation extraction) was used to compare all models. We used microaveraged precision, recall, and F1 as evaluation metrics. Our deep learning model achieved state-of-the-art results (F1=65.9%), which is significantly higher than that (F1=61.7%) of the best system in the MADE1.0 challenge. HardMTL further improved the F1 by 0.8%, boosting the F1 to 66.7%, whereas RegMTL and LearnMTL failed to boost the performance. CONCLUSIONS: Deep learning models can significantly improve the performance of ADE-related information extraction. MTL may be effective for named entity recognition and relation extraction, but it depends on the methods, data, and other factors. Our results can facilitate research on ADE detection, NLP, and machine learning.
Reference Standard Development to Train Natural Language Processing Algorithms to Detect Problematic Buprenorphine-Naloxone Therapy.
Celena B Peters; Fran Cunningham; Adam Gordon; Hong Yu; Cedric Salone; Jessica Zacher; Ronald Carico; Jianwei Leng; Nikolh Durley; Weisong Liu; Chao-Chin Lu; Emily Druhl; Feifan Liu; and Brian C Sauer
In VA Pharmacy Informatics Conference 2018, 2018.
Paper link bibtex
Paper link bibtex
@inproceedings{celena_b_peters_reference_2018, title = {Reference {Standard} {Development} to {Train} {Natural} {Language} {Processing} {Algorithms} to {Detect} {Problematic} {Buprenorphine}-{Naloxone} {Therapy}}, url = {https://vapharmacytraining.remote-learner.net/mod/resource/view.php?id=13218}, booktitle = {{VA} {Pharmacy} {Informatics} {Conference} 2018}, author = {{Celena B Peters} and {Fran Cunningham} and {Adam Gordon} and {Hong Yu} and {Cedric Salone} and {Jessica Zacher} and {Ronald Carico} and {Jianwei Leng} and {Nikolh Durley} and {Weisong Liu} and {Chao-Chin Lu} and {Emily Druhl} and {Feifan Liu} and {Brian C Sauer}}, year = {2018}, }
2017
(9)
Ranking medical terms to support expansion of lay language resources for patient comprehension of electronic health record notes: adapted distant supervision approach.
Chen, J.; Jagannatha, A. N.; Fodeh, S. J.; and Yu, H.
JMIR medical informatics, 5(4): e42. October 2017.
doi link bibtex abstract
doi link bibtex abstract
@article{chen_ranking_2017, title = {Ranking medical terms to support expansion of lay language resources for patient comprehension of electronic health record notes: adapted distant supervision approach}, volume = {5}, issn = {2291-9694}, shorttitle = {Ranking medical terms to support expansion of lay language resources for patient comprehension of electronic health record notes}, doi = {10.2196/medinform.8531}, abstract = {BACKGROUND: Medical terms are a major obstacle for patients to comprehend their electronic health record (EHR) notes. Clinical natural language processing (NLP) systems that link EHR terms to lay terms or definitions allow patients to easily access helpful information when reading through their EHR notes, and have shown to improve patient EHR comprehension. However, high-quality lay language resources for EHR terms are very limited in the public domain. Because expanding and curating such a resource is a costly process, it is beneficial and even necessary to identify terms important for patient EHR comprehension first. OBJECTIVE: We aimed to develop an NLP system, called adapted distant supervision (ADS), to rank candidate terms mined from EHR corpora. We will give EHR terms ranked as high by ADS a higher priority for lay language annotation-that is, creating lay definitions for these terms. METHODS: Adapted distant supervision uses distant supervision from consumer health vocabulary and transfer learning to adapt itself to solve the problem of ranking EHR terms in the target domain. We investigated 2 state-of-the-art transfer learning algorithms (ie, feature space augmentation and supervised distant supervision) and designed 5 types of learning features, including distributed word representations learned from large EHR data for ADS. For evaluating ADS, we asked domain experts to annotate 6038 candidate terms as important or nonimportant for EHR comprehension. We then randomly divided these data into the target-domain training data (1000 examples) and the evaluation data (5038 examples). We compared ADS with 2 strong baselines, including standard supervised learning, on the evaluation data. RESULTS: The ADS system using feature space augmentation achieved the best average precision, 0.850, on the evaluation set when using 1000 target-domain training examples. The ADS system using supervised distant supervision achieved the best average precision, 0.819, on the evaluation set when using only 100 target-domain training examples. The 2 ADS systems both performed significantly better than the baseline systems (P{\textless}.001 for all measures and all conditions). Using a rich set of learning features contributed to ADS's performance substantially. CONCLUSIONS: ADS can effectively rank terms mined from EHRs. Transfer learning improved ADS's performance even with a small number of target-domain training examples. EHR terms prioritized by ADS were used to expand a lay language resource that supports patient EHR comprehension. The top 10,000 EHR terms ranked by ADS are available upon request.}, language = {eng}, number = {4}, journal = {JMIR medical informatics}, author = {Chen, Jinying and Jagannatha, Abhyuday N. and Fodeh, Samah J. and Yu, Hong}, month = oct, year = {2017}, pmid = {29089288}, pmcid = {PMC5686421}, keywords = {Information extraction, electronic health records, lexical entry selection, natural language processing, transfer learning}, pages = {e42}, }
BACKGROUND: Medical terms are a major obstacle for patients to comprehend their electronic health record (EHR) notes. Clinical natural language processing (NLP) systems that link EHR terms to lay terms or definitions allow patients to easily access helpful information when reading through their EHR notes, and have shown to improve patient EHR comprehension. However, high-quality lay language resources for EHR terms are very limited in the public domain. Because expanding and curating such a resource is a costly process, it is beneficial and even necessary to identify terms important for patient EHR comprehension first. OBJECTIVE: We aimed to develop an NLP system, called adapted distant supervision (ADS), to rank candidate terms mined from EHR corpora. We will give EHR terms ranked as high by ADS a higher priority for lay language annotation-that is, creating lay definitions for these terms. METHODS: Adapted distant supervision uses distant supervision from consumer health vocabulary and transfer learning to adapt itself to solve the problem of ranking EHR terms in the target domain. We investigated 2 state-of-the-art transfer learning algorithms (ie, feature space augmentation and supervised distant supervision) and designed 5 types of learning features, including distributed word representations learned from large EHR data for ADS. For evaluating ADS, we asked domain experts to annotate 6038 candidate terms as important or nonimportant for EHR comprehension. We then randomly divided these data into the target-domain training data (1000 examples) and the evaluation data (5038 examples). We compared ADS with 2 strong baselines, including standard supervised learning, on the evaluation data. RESULTS: The ADS system using feature space augmentation achieved the best average precision, 0.850, on the evaluation set when using 1000 target-domain training examples. The ADS system using supervised distant supervision achieved the best average precision, 0.819, on the evaluation set when using only 100 target-domain training examples. The 2 ADS systems both performed significantly better than the baseline systems (P\textless.001 for all measures and all conditions). Using a rich set of learning features contributed to ADS's performance substantially. CONCLUSIONS: ADS can effectively rank terms mined from EHRs. Transfer learning improved ADS's performance even with a small number of target-domain training examples. EHR terms prioritized by ADS were used to expand a lay language resource that supports patient EHR comprehension. The top 10,000 EHR terms ranked by ADS are available upon request.
Meta Networks.
Munkhdalai, T.; and Yu, H.
In ICML, volume 70, pages 2554–2563, Sydney, Australia, August 2017.
link bibtex abstract
link bibtex abstract
@inproceedings{munkhdalai_meta_2017, address = {Sydney, Australia}, title = {Meta {Networks}}, volume = {70}, abstract = {Neural networks have been successfully applied in applications with a large amount of labeled data. However, the task of rapid generalization on new concepts with small training data while preserving performances on previously learned ones still presents a significant challenge to neural network models. In this work, we introduce a novel meta learning method, Meta Networks (MetaNet), that learns a meta-level knowledge across tasks and shifts its inductive biases via fast parameterization for rapid generalization. When evaluated on Omniglot and Mini-ImageNet benchmarks, our MetaNet models achieve a near human-level performance and outperform the baseline approaches by up to 6\% accuracy. We demonstrate several appealing properties of MetaNet relating to generalization and continual learning.}, booktitle = {{ICML}}, author = {Munkhdalai, Tsendsuren and Yu, Hong}, month = aug, year = {2017}, pmid = {31106300; PMCID: PMC6519722}, pages = {2554--2563}, }
Neural networks have been successfully applied in applications with a large amount of labeled data. However, the task of rapid generalization on new concepts with small training data while preserving performances on previously learned ones still presents a significant challenge to neural network models. In this work, we introduce a novel meta learning method, Meta Networks (MetaNet), that learns a meta-level knowledge across tasks and shifts its inductive biases via fast parameterization for rapid generalization. When evaluated on Omniglot and Mini-ImageNet benchmarks, our MetaNet models achieve a near human-level performance and outperform the baseline approaches by up to 6% accuracy. We demonstrate several appealing properties of MetaNet relating to generalization and continual learning.
Neural Semantic Encoders.
Munkhdalai, T; and Yu, H.
In European Chapter of the Association for Computational Linguistics 2017 (EACL), volume 1, pages 397–407, April 2017.
Paper link bibtex abstract
Paper link bibtex abstract
@inproceedings{munkhdalai_neural_2017, title = {Neural {Semantic} {Encoders}}, volume = {1}, url = {https://arxiv.org/pdf/1607.04315v2.pdf}, abstract = {We present a memory augmented neural network for natural language understanding: Neural Semantic Encoders. NSE is equipped with a novel memory update rule and has a variable sized encoding memory that evolves over time and maintains the understanding of input sequences through read\vphantom{\{}\}, compose and write operations. NSE can also access multiple and shared memories. In this paper, we demonstrated the effectiveness and the flexibility of NSE on five different natural language tasks: natural language inference, question answering, sentence classification, document sentiment analysis and machine translation where NSE achieved state-of-the-art performance when evaluated on publically available benchmarks. For example, our shared-memory model showed an encouraging result on neural machine translation, improving an attention-based baseline by approximately 1.0 BLEU.}, booktitle = {European {Chapter} of the {Association} for {Computational} {Linguistics} 2017 ({EACL})}, author = {Munkhdalai, T and Yu, Hong}, month = apr, year = {2017}, pmid = {29081578 PMCID: PMC5657452}, pages = {397--407}, }
We present a memory augmented neural network for natural language understanding: Neural Semantic Encoders. NSE is equipped with a novel memory update rule and has a variable sized encoding memory that evolves over time and maintains the understanding of input sequences through readp̌hantom\\, compose and write operations. NSE can also access multiple and shared memories. In this paper, we demonstrated the effectiveness and the flexibility of NSE on five different natural language tasks: natural language inference, question answering, sentence classification, document sentiment analysis and machine translation where NSE achieved state-of-the-art performance when evaluated on publically available benchmarks. For example, our shared-memory model showed an encouraging result on neural machine translation, improving an attention-based baseline by approximately 1.0 BLEU.
CIFT: Crowd-Informed Fine-Tuning to Improve Machine Learning Ability.
Lalor, J; Wu, H; and Yu, H
In February 2017.
link bibtex abstract
link bibtex abstract
@inproceedings{lalor_cift:_2017, title = {{CIFT}: {Crowd}-{Informed} {Fine}-{Tuning} to {Improve} {Machine} {Learning} {Ability}.}, abstract = {tem Response Theory (IRT) allows for measuring ability of Machine Learning models as compared to a human population. However, it is difficult to create a large dataset to train the ability of deep neural network models (DNNs). We propose Crowd-Informed Fine-Tuning (CIFT) as a new training process, where a pre-trained model is fine-tuned with a specialized supplemental training set obtained via IRT model-fitting on a large set of crowdsourced response patterns. With CIFT we can leverage the specialized set of data obtained through IRT to inform parameter tuning in DNNs. We experiment with two loss functions in CIFT to represent (i) memorization of fine-tuning items and (ii) learning a probability distribution over potential labels that is similar to the crowdsourced distribution over labels to simulate crowd knowledge. Our results show that CIFT improves ability for a state-of-the-art DNN model for Recognizing Textual Entailment (RTE) tasks and is generalizable to a large-scale RTE test set.}, author = {Lalor, J and Wu, H and Yu, H}, month = feb, year = {2017}, }
tem Response Theory (IRT) allows for measuring ability of Machine Learning models as compared to a human population. However, it is difficult to create a large dataset to train the ability of deep neural network models (DNNs). We propose Crowd-Informed Fine-Tuning (CIFT) as a new training process, where a pre-trained model is fine-tuned with a specialized supplemental training set obtained via IRT model-fitting on a large set of crowdsourced response patterns. With CIFT we can leverage the specialized set of data obtained through IRT to inform parameter tuning in DNNs. We experiment with two loss functions in CIFT to represent (i) memorization of fine-tuning items and (ii) learning a probability distribution over potential labels that is similar to the crowdsourced distribution over labels to simulate crowd knowledge. Our results show that CIFT improves ability for a state-of-the-art DNN model for Recognizing Textual Entailment (RTE) tasks and is generalizable to a large-scale RTE test set.
Assessing Electronic Health Record Readability.
Zheng, J; and Yu, H
In 2017.
link bibtex
link bibtex
@inproceedings{zheng_assessing_2017, title = {Assessing {Electronic} {Health} {Record} {Readability}.}, author = {Zheng, J and Yu, H}, year = {2017}, }
Reasoning with memory augmented neural networks for language comprehension.
Munkhdalai, T.; and Yu, H.
5th International Conference on Learning Representations (ICLR). 2017.
Paper link bibtex abstract
Paper link bibtex abstract
@article{munkhdalai_reasoning_2017, title = {Reasoning with memory augmented neural networks for language comprehension.}, url = {https://arxiv.org/abs/1610.06454}, abstract = {Hypothesis testing is an important cognitive process that supports human reasoning. In this paper, we introduce a computational hypothesis testing approach based on memory augmented neural networks. Our approach involves a hypothesis testing loop that reconsiders and progressively refines a previously formed hypothesis in order to generate new hypotheses to test. We apply the proposed approach to language comprehension task by using Neural Semantic Encoders (NSE). Our NSE models achieve the state-of-the-art results showing an absolute improvement of 1.2\% to 2.6\% accuracy over previous results obtained by single and ensemble systems on standard machine comprehension benchmarks such as the Children's Book Test (CBT) and Who-Did-What (WDW) news article datasets.}, urldate = {2017-06-02}, journal = {5th International Conference on Learning Representations (ICLR)}, author = {Munkhdalai, Tsendsuren and Yu, Hong}, year = {2017}, }
Hypothesis testing is an important cognitive process that supports human reasoning. In this paper, we introduce a computational hypothesis testing approach based on memory augmented neural networks. Our approach involves a hypothesis testing loop that reconsiders and progressively refines a previously formed hypothesis in order to generate new hypotheses to test. We apply the proposed approach to language comprehension task by using Neural Semantic Encoders (NSE). Our NSE models achieve the state-of-the-art results showing an absolute improvement of 1.2% to 2.6% accuracy over previous results obtained by single and ensemble systems on standard machine comprehension benchmarks such as the Children's Book Test (CBT) and Who-Did-What (WDW) news article datasets.
Readability Formulas and User Perceptions of Electronic Health Records Difficulty: A Corpus Study.
Zheng, J.; and Yu, H.
Journal of Medical Internet Research, 19(3): e59. 2017.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{zheng_readability_2017, title = {Readability {Formulas} and {User} {Perceptions} of {Electronic} {Health} {Records} {Difficulty}: {A} {Corpus} {Study}}, volume = {19}, copyright = {Unless stated otherwise, all articles are open-access distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work (}, shorttitle = {Readability {Formulas} and {User} {Perceptions} of {Electronic} {Health} {Records} {Difficulty}}, url = {https://www.jmir.org/2017/3/e59/}, doi = {10.2196/jmir.6962}, abstract = {Background: Electronic health records (EHRs) are a rich resource for developing applications to engage patients and foster patient activation, thus holding a strong potential to enhance patient-centered care. Studies have shown that providing patients with access to their own EHR notes may improve the understanding of their own clinical conditions and treatments, leading to improved health care outcomes. However, the highly technical language in EHR notes impedes patients’ comprehension. Numerous studies have evaluated the difficulty of health-related text using readability formulas such as Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), and Gunning-Fog Index (GFI). They conclude that the materials are often written at a grade level higher than common recommendations. Objective: The objective of our study was to explore the relationship between the aforementioned readability formulas and the laypeople’s perceived difficulty on 2 genres of text: general health information and EHR notes. We also validated the formulas’ appropriateness and generalizability on predicting difficulty levels of highly complex technical documents. Methods: We collected 140 Wikipedia articles on diabetes and 242 EHR notes with diabetes International Classification of Diseases, Ninth Revision code. We recruited 15 Amazon Mechanical Turk (AMT) users to rate difficulty levels of the documents. Correlations between laypeople’s perceived difficulty levels and readability formula scores were measured, and their difference was tested. We also compared word usage and the impact of medical concepts of the 2 genres of text. Results: The distributions of both readability formulas’ scores (P{\textless}.001) and laypeople’s perceptions (P=.002) on the 2 genres were different. Correlations of readability predictions and laypeople’s perceptions were weak. Furthermore, despite being graded at similar levels, documents of different genres were still perceived with different difficulty (P{\textless}.001). Word usage in the 2 related genres still differed significantly (P{\textless}.001). Conclusions: Our findings suggested that the readability formulas’ predictions did not align with perceived difficulty in either text genre. The widely used readability formulas were highly correlated with each other but did not show adequate correlation with readers’ perceived difficulty. Therefore, they were not appropriate to assess the readability of EHR notes. [J Med Internet Res 2017;19(3):e59]}, language = {en}, number = {3}, urldate = {2017-03-06}, journal = {Journal of Medical Internet Research}, author = {Zheng, Jiaping and Yu, Hong}, year = {2017}, pmid = {28254738 PMCID: PMC5355629}, pages = {e59}, }
Background: Electronic health records (EHRs) are a rich resource for developing applications to engage patients and foster patient activation, thus holding a strong potential to enhance patient-centered care. Studies have shown that providing patients with access to their own EHR notes may improve the understanding of their own clinical conditions and treatments, leading to improved health care outcomes. However, the highly technical language in EHR notes impedes patients’ comprehension. Numerous studies have evaluated the difficulty of health-related text using readability formulas such as Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), and Gunning-Fog Index (GFI). They conclude that the materials are often written at a grade level higher than common recommendations. Objective: The objective of our study was to explore the relationship between the aforementioned readability formulas and the laypeople’s perceived difficulty on 2 genres of text: general health information and EHR notes. We also validated the formulas’ appropriateness and generalizability on predicting difficulty levels of highly complex technical documents. Methods: We collected 140 Wikipedia articles on diabetes and 242 EHR notes with diabetes International Classification of Diseases, Ninth Revision code. We recruited 15 Amazon Mechanical Turk (AMT) users to rate difficulty levels of the documents. Correlations between laypeople’s perceived difficulty levels and readability formula scores were measured, and their difference was tested. We also compared word usage and the impact of medical concepts of the 2 genres of text. Results: The distributions of both readability formulas’ scores (P\textless.001) and laypeople’s perceptions (P=.002) on the 2 genres were different. Correlations of readability predictions and laypeople’s perceptions were weak. Furthermore, despite being graded at similar levels, documents of different genres were still perceived with different difficulty (P\textless.001). Word usage in the 2 related genres still differed significantly (P\textless.001). Conclusions: Our findings suggested that the readability formulas’ predictions did not align with perceived difficulty in either text genre. The widely used readability formulas were highly correlated with each other but did not show adequate correlation with readers’ perceived difficulty. Therefore, they were not appropriate to assess the readability of EHR notes. [J Med Internet Res 2017;19(3):e59]
Neural Tree Indexers for Text Understanding.
Munkhdalai, T.; and Yu, H.
In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 11–21, Valencia, Spain, April 2017. Association for Computational Linguistics
Paper link bibtex abstract
Paper link bibtex abstract
@inproceedings{munkhdalai_neural_2017-1, address = {Valencia, Spain}, title = {Neural {Tree} {Indexers} for {Text} {Understanding}}, url = {http://www.aclweb.org/anthology/E17-1002}, abstract = {Recurrent neural networks (RNNs) process input text sequentially and model the conditional transition between word tokens. In contrast, the advantages of recursive networks include that they explicitly model the compositionality and the recursive structure of natural language. However, the current recursive architecture is limited by its dependence on syntactic tree. In this paper, we introduce a robust syntactic parsing-independent tree structured model, Neural Tree Indexers (NTI) that provides a middle ground between the sequential RNNs and the syntactic treebased recursive models. NTI constructs a full n-ary tree by processing the input text with its node function in a bottom-up fashion. Attention mechanism can then be applied to both structure and node function. We implemented and evaluated a binary tree model of NTI, showing the model achieved the state-of-the-art performance on three different NLP tasks: natural language inference, answer sentence selection, and sentence classification, outperforming state-of-the-art recurrent and recursive neural networks.}, urldate = {2017-04-02}, booktitle = {Proceedings of the 15th {Conference} of the {European} {Chapter} of the {Association} for {Computational} {Linguistics}: {Volume} 1, {Long} {Papers}}, publisher = {Association for Computational Linguistics}, author = {Munkhdalai, Tsendsuren and Yu, Hong}, month = apr, year = {2017}, pages = {11--21}, }
Recurrent neural networks (RNNs) process input text sequentially and model the conditional transition between word tokens. In contrast, the advantages of recursive networks include that they explicitly model the compositionality and the recursive structure of natural language. However, the current recursive architecture is limited by its dependence on syntactic tree. In this paper, we introduce a robust syntactic parsing-independent tree structured model, Neural Tree Indexers (NTI) that provides a middle ground between the sequential RNNs and the syntactic treebased recursive models. NTI constructs a full n-ary tree by processing the input text with its node function in a bottom-up fashion. Attention mechanism can then be applied to both structure and node function. We implemented and evaluated a binary tree model of NTI, showing the model achieved the state-of-the-art performance on three different NLP tasks: natural language inference, answer sentence selection, and sentence classification, outperforming state-of-the-art recurrent and recursive neural networks.
Generating a Test of Electronic Health Record Narrative Comprehension with Item Response Theory.
Lalor, J; Wu, H; Chen, L; Mazor, K; and Yu, H
In November 2017.
link bibtex abstract
link bibtex abstract
@inproceedings{lalor_generating_2017, title = {Generating a {Test} of {Electronic} {Health} {Record} {Narrative} {Comprehension} with {Item} {Response} {Theory}.}, abstract = {In this work, we report the development of a new instrument to test patients' ability to comprehend EHR notes. Our instrument comprises of a test set of question and answer pairs that are based on the semantic content of EHR notes and selected using the psychometrics method Item Response Theory.}, author = {Lalor, J and Wu, H and Chen, L and Mazor, K and Yu, H}, month = nov, year = {2017}, }
In this work, we report the development of a new instrument to test patients' ability to comprehend EHR notes. Our instrument comprises of a test set of question and answer pairs that are based on the semantic content of EHR notes and selected using the psychometrics method Item Response Theory.
2016
(5)
Structured prediction models for RNN based sequence labeling in clinical text.
Jagannatha, A. N.; and Yu, H.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing, volume 2016, pages 856–865, November 2016.
link bibtex abstract
link bibtex abstract
@inproceedings{jagannatha_structured_2016, title = {Structured prediction models for {RNN} based sequence labeling in clinical text}, volume = {2016}, abstract = {Sequence labeling is a widely used method for named entity recognition and information extraction from unstructured natural language data. In clinical domain one major application of sequence labeling involves extraction of medical entities such as medication, indication, and side-effects from Electronic Health Record narratives. Sequence labeling in this domain, presents its own set of challenges and objectives. In this work we experimented with various CRF based structured learning models with Recurrent Neural Networks. We extend the previously studied LSTM-CRF models with explicit modeling of pairwise potentials. We also propose an approximate version of skip-chain CRF inference with RNN potentials. We use these methodologies for structured prediction in order to improve the exact phrase detection of various medical entities.}, language = {eng}, booktitle = {Proceedings of the {Conference} on {Empirical} {Methods} in {Natural} {Language} {Processing}}, author = {Jagannatha, Abhyuday N. and Yu, Hong}, month = nov, year = {2016}, pmid = {28004040 PMCID: PMC5167535}, keywords = {Computer Science - Computation and Language}, pages = {856--865}, }
Sequence labeling is a widely used method for named entity recognition and information extraction from unstructured natural language data. In clinical domain one major application of sequence labeling involves extraction of medical entities such as medication, indication, and side-effects from Electronic Health Record narratives. Sequence labeling in this domain, presents its own set of challenges and objectives. In this work we experimented with various CRF based structured learning models with Recurrent Neural Networks. We extend the previously studied LSTM-CRF models with explicit modeling of pairwise potentials. We also propose an approximate version of skip-chain CRF inference with RNN potentials. We use these methodologies for structured prediction in order to improve the exact phrase detection of various medical entities.
RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism.
Choi, E.; Bahadori, M. T.; Sun, J.; Kulas, J.; Schuetz, A.; and Stewart, W.
In Advances in Neural Information Processing Systems, pages 3504–3512, 2016.
Paper link bibtex
Paper link bibtex
@inproceedings{choi_retain:_2016, title = {{RETAIN}: {An} {Interpretable} {Predictive} {Model} for {Healthcare} using {Reverse} {Time} {Attention} {Mechanism}}, shorttitle = {{RETAIN}}, url = {http://papers.nips.cc/paper/6321-retain-an-interpretable-predictive-model-for-healthcare-using-reverse-time-attention-mechanism}, urldate = {2017-01-12}, booktitle = {Advances in {Neural} {Information} {Processing} {Systems}}, author = {Choi, Edward and Bahadori, Mohammad Taha and Sun, Jimeng and Kulas, Joshua and Schuetz, Andy and Stewart, Walter}, year = {2016}, pages = {3504--3512}, }
Learning to Rank Scientific Documents from the Crowd.
Lingeman, J. M; and Yu, H.
arXiv:1611.01400. November 2016.
Paper link bibtex abstract
Paper link bibtex abstract
@article{lingeman_learning_2016, title = {Learning to {Rank} {Scientific} {Documents} from the {Crowd}}, url = {https://arxiv.org/pdf/1611.01400v1.pdf}, abstract = {Finding related published articles is an important task in any science, but with the explosion of new work in the biomedical domain it has become especially challenging. Most existing methodologies use text similarity metrics to identify whether two articles are related or not. However biomedical knowledge discovery is hypothesis-driven. The most related articles may not be ones with the highest text similarities. In this study, we first develop an innovative crowd-sourcing approach to build an expert-annotated document-ranking corpus. Using this corpus as the gold standard, we then evaluate the approaches of using text similarity to rank the relatedness of articles. Finally, we develop and evaluate a new supervised model to automatically rank related scientific articles. Our results show that authors' ranking differ significantly from rankings by text-similarity-based models. By training a learning-to-rank model on a subset of the annotated corpus, we found the best supervised learning-to-rank model (SVM-Rank) significantly surpassed state-of-the-art baseline systems.}, journal = {arXiv:1611.01400}, author = {Lingeman, Jesse M and Yu, Hong}, month = nov, year = {2016}, }
Finding related published articles is an important task in any science, but with the explosion of new work in the biomedical domain it has become especially challenging. Most existing methodologies use text similarity metrics to identify whether two articles are related or not. However biomedical knowledge discovery is hypothesis-driven. The most related articles may not be ones with the highest text similarities. In this study, we first develop an innovative crowd-sourcing approach to build an expert-annotated document-ranking corpus. Using this corpus as the gold standard, we then evaluate the approaches of using text similarity to rank the relatedness of articles. Finally, we develop and evaluate a new supervised model to automatically rank related scientific articles. Our results show that authors' ranking differ significantly from rankings by text-similarity-based models. By training a learning-to-rank model on a subset of the annotated corpus, we found the best supervised learning-to-rank model (SVM-Rank) significantly surpassed state-of-the-art baseline systems.
Learning for Biomedical Information Extraction: Methodological Review of Recent Advances.
Liu, F.; Chen, J.; Jagannatha, A.; and Yu, H.
arXiv:1606.07993. June 2016.
Paper link bibtex abstract
Paper link bibtex abstract
@article{liu_learning_2016, title = {Learning for {Biomedical} {Information} {Extraction}: {Methodological} {Review} of {Recent} {Advances}}, url = {https://arxiv.org/ftp/arxiv/papers/1606/1606.07993.pdf}, abstract = {Biomedical information extraction (BioIE) is important to many applications, including clinical decision support, integrative biology, and pharmacovigilance, and therefore it has been an active research. Unlike existing reviews covering a holistic view on BioIE, this review focuses on mainly recent advances in learning based approaches, by systematically summarizing them into different aspects of methodological development. In addition, we dive into open information extraction and deep learning, two emerging and influential techniques and envision next generation of BioIE.}, journal = {arXiv:1606.07993}, author = {Liu, Feifan and Chen, Jinying and Jagannatha, Abhyuday and Yu, Hong}, month = jun, year = {2016}, }
Biomedical information extraction (BioIE) is important to many applications, including clinical decision support, integrative biology, and pharmacovigilance, and therefore it has been an active research. Unlike existing reviews covering a holistic view on BioIE, this review focuses on mainly recent advances in learning based approaches, by systematically summarizing them into different aspects of methodological development. In addition, we dive into open information extraction and deep learning, two emerging and influential techniques and envision next generation of BioIE.
Citation Analysis with Neural Attention Models.
Munkhdalai, M; Lalor, J; and Yu, H
In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis (LOUHI) ,, pages 69–77, Austin, TX, November 2016. Association for Computational Linguistics
Paper doi link bibtex
Paper doi link bibtex
@inproceedings{munkhdalai_citation_2016, address = {Austin, TX}, title = {Citation {Analysis} with {Neural} {Attention} {Models}}, url = {http://www.aclweb.org/anthology/W/W16/W16-6109.pdf}, doi = {10.18653/v1/W16-6109}, booktitle = {Proceedings of the {Seventh} {International} {Workshop} on {Health} {Text} {Mining} and {Information} {Analysis} ({LOUHI}) ,}, publisher = {Association for Computational Linguistics}, author = {Munkhdalai, M and Lalor, J and Yu, H}, month = nov, year = {2016}, pages = {69--77}, }
2015
(3)
Translating Electronic Health Record Notes from English to Spanish: A Preliminary Study.
Liu, W.; Cai, S.; Balaji, R.; Chiriboga, G.; Knight, K.; and Yu, H.
In ACL-IJCNLP, pages 134, Bei Jing, China, July 2015.
Paper doi link bibtex
Paper doi link bibtex
@inproceedings{liu_translating_2015, address = {Bei Jing, China}, title = {Translating {Electronic} {Health} {Record} {Notes} from {English} to {Spanish}: {A} {Preliminary} {Study}}, url = {http://aclweb.org/anthology/W/W15/W15-3816.pdf}, doi = {10.18653/v1/W15-3816}, booktitle = {{ACL}-{IJCNLP}}, author = {Liu, Weisong and Cai, Shu and Balaji, Ramesh and Chiriboga, German and Knight, Kevin and Yu, Hong}, month = jul, year = {2015}, pages = {134}, }
Figure-Associated Text Summarization and Evaluation.
Polepalli Ramesh, B.; Sethi, R. J.; and Yu, H.
PLOS ONE, 10(2): e0115671. February 2015.
Paper doi link bibtex
Paper doi link bibtex
@article{polepalli_ramesh_figure-associated_2015, title = {Figure-{Associated} {Text} {Summarization} and {Evaluation}}, volume = {10}, issn = {1932-6203}, url = {http://dx.plos.org/10.1371/journal.pone.0115671}, doi = {10.1371/journal.pone.0115671}, language = {en}, number = {2}, urldate = {2015-02-26}, journal = {PLOS ONE}, author = {Polepalli Ramesh, Balaji and Sethi, Ricky J. and Yu, Hong}, editor = {Sarkar, Indra Neil}, month = feb, year = {2015}, pmid = {25643357 PMCID: PMC4313946}, pages = {e0115671}, }
DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures.
Yin, X.; Yang, C.; Pei, W.; Man, H.; Zhang, J.; Learned-Miller, E.; and Yu, H.
PLoS ONE, 10(5). May 2015.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yin_detext:_2015, title = {{DeTEXT}: {A} {Database} for {Evaluating} {Text} {Extraction} from {Biomedical} {Literature} {Figures}}, volume = {10}, issn = {1932-6203}, shorttitle = {{DeTEXT}}, url = {http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4423993/}, doi = {10.1371/journal.pone.0126200}, abstract = {Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/.}, number = {5}, urldate = {2015-06-03}, journal = {PLoS ONE}, author = {Yin, Xu-Cheng and Yang, Chun and Pei, Wei-Yi and Man, Haixia and Zhang, Jun and Learned-Miller, Erik and Yu, Hong}, month = may, year = {2015}, pmid = {25951377 PMCID: PMC4423993}, }
Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/.
2014
(3)
Learning to Rank Figures within a Biomedical Article.
Liu, F.; and Yu, H.
PLoS ONE, 9(3): e61567. March 2014.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{liu_learning_2014, title = {Learning to {Rank} {Figures} within a {Biomedical} {Article}}, volume = {9}, issn = {1932-6203}, url = {http://dx.plos.org/10.1371/journal.pone.0061567}, doi = {10.1371/journal.pone.0061567}, abstract = {Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. This ever-increasing sheer volume has made it difficult for scientists to effectively and accurately access figures of their interest, the process of which is crucial for validating research facts and for formulating or testing novel research hypotheses. Current figure search applications can't fully meet this challenge as the "bag of figures" assumption doesn't take into account the relationship among figures. In our previous study, hundreds of biomedical researchers have annotated articles in which they serve as corresponding authors. They ranked each figure in their paper based on a figure's importance at their discretion, referred to as "figure ranking". Using this collection of annotated data, we investigated computational approaches to automatically rank figures. We exploited and extended the state-of-the-art listwise learning-to-rank algorithms and developed a new supervised-learning model BioFigRank. The cross-validation results show that BioFigRank yielded the best performance compared with other state-of-the-art computational models, and the greedy feature selection can further boost the ranking performance significantly. Furthermore, we carry out the evaluation by comparing BioFigRank with three-level competitive domain-specific human experts: (1) First Author, (2) Non-Author-In-Domain-Expert who is not the author nor co-author of an article but who works in the same field of the corresponding author of the article, and (3) Non-Author-Out-Domain-Expert who is not the author nor co-author of an article and who may or may not work in the same field of the corresponding author of an article. Our results show that BioFigRank outperforms Non-Author-Out-Domain-Expert and performs as well as Non-Author-In-Domain-Expert. Although BioFigRank underperforms First Author, since most biomedical researchers are either in- or out-domain-experts for an article, we conclude that BioFigRank represents an artificial intelligence system that offers expert-level intelligence to help biomedical researchers to navigate increasingly proliferated big data efficiently.}, language = {en}, number = {3}, urldate = {2015-02-26}, journal = {PLoS ONE}, author = {Liu, Feifan and Yu, Hong}, editor = {Preis, Tobias}, month = mar, year = {2014}, pmid = {24625719 PMCID: PMC3953065}, pages = {e61567}, }
Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. This ever-increasing sheer volume has made it difficult for scientists to effectively and accurately access figures of their interest, the process of which is crucial for validating research facts and for formulating or testing novel research hypotheses. Current figure search applications can't fully meet this challenge as the "bag of figures" assumption doesn't take into account the relationship among figures. In our previous study, hundreds of biomedical researchers have annotated articles in which they serve as corresponding authors. They ranked each figure in their paper based on a figure's importance at their discretion, referred to as "figure ranking". Using this collection of annotated data, we investigated computational approaches to automatically rank figures. We exploited and extended the state-of-the-art listwise learning-to-rank algorithms and developed a new supervised-learning model BioFigRank. The cross-validation results show that BioFigRank yielded the best performance compared with other state-of-the-art computational models, and the greedy feature selection can further boost the ranking performance significantly. Furthermore, we carry out the evaluation by comparing BioFigRank with three-level competitive domain-specific human experts: (1) First Author, (2) Non-Author-In-Domain-Expert who is not the author nor co-author of an article but who works in the same field of the corresponding author of the article, and (3) Non-Author-Out-Domain-Expert who is not the author nor co-author of an article and who may or may not work in the same field of the corresponding author of an article. Our results show that BioFigRank outperforms Non-Author-Out-Domain-Expert and performs as well as Non-Author-In-Domain-Expert. Although BioFigRank underperforms First Author, since most biomedical researchers are either in- or out-domain-experts for an article, we conclude that BioFigRank represents an artificial intelligence system that offers expert-level intelligence to help biomedical researchers to navigate increasingly proliferated big data efficiently.
Computational Approaches for Predicting Biomedical Research Collaborations.
Zhang, Q.; and Yu, H.
PLoS ONE, 9(11): e111795. November 2014.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{zhang_computational_2014, title = {Computational {Approaches} for {Predicting} {Biomedical} {Research} {Collaborations}}, volume = {9}, issn = {1932-6203}, url = {http://dx.plos.org/10.1371/journal.pone.0111795}, doi = {10.1371/journal.pone.0111795}, abstract = {Biomedical research is increasingly collaborative, and successful collaborations often produce high impact work. Computational approaches can be developed for automatically predicting biomedical research collaborations. Previous works of collaboration prediction mainly explored the topological structures of research collaboration networks, leaving out rich semantic information from the publications themselves. In this paper, we propose supervised machine learning approaches to predict research collaborations in the biomedical field. We explored both the semantic features extracted from author research interest profile and the author network topological features. We found that the most informative semantic features for author collaborations are related to research interest, including similarity of out-citing citations, similarity of abstracts. Of the four supervised machine learning models (naïve Bayes, naïve Bayes multinomial, SVMs, and logistic regression), the best performing model is logistic regression with an ROC ranging from 0.766 to 0.980 on different datasets. To our knowledge we are the first to study in depth how research interest and productivities can be used for collaboration prediction. Our approach is computationally efficient, scalable and yet simple to implement. The datasets of this study are available at https://github.com/qingzhanggithub/medline-collaboration-datasets.}, language = {en}, number = {11}, urldate = {2015-02-26}, journal = {PLoS ONE}, author = {Zhang, Qing and Yu, Hong}, editor = {Smalheiser, Neil R.}, month = nov, year = {2014}, pmid = {25375164 PMCID: PMC4222920}, pages = {e111795}, }
Biomedical research is increasingly collaborative, and successful collaborations often produce high impact work. Computational approaches can be developed for automatically predicting biomedical research collaborations. Previous works of collaboration prediction mainly explored the topological structures of research collaboration networks, leaving out rich semantic information from the publications themselves. In this paper, we propose supervised machine learning approaches to predict research collaborations in the biomedical field. We explored both the semantic features extracted from author research interest profile and the author network topological features. We found that the most informative semantic features for author collaborations are related to research interest, including similarity of out-citing citations, similarity of abstracts. Of the four supervised machine learning models (naïve Bayes, naïve Bayes multinomial, SVMs, and logistic regression), the best performing model is logistic regression with an ROC ranging from 0.766 to 0.980 on different datasets. To our knowledge we are the first to study in depth how research interest and productivities can be used for collaboration prediction. Our approach is computationally efficient, scalable and yet simple to implement. The datasets of this study are available at https://github.com/qingzhanggithub/medline-collaboration-datasets.
Automatically Recognizing Medication and Adverse Event Information From Food and Drug Administration’s Adverse Event Reporting System Narratives.
Polepalli Ramesh, B.; Belknap, S. M; Li, Z.; Frid, N.; West, D. P; and Yu, H.
JMIR Medical Informatics, 2(1): e10. June 2014.
Paper doi link bibtex
Paper doi link bibtex
@article{polepalli_ramesh_automatically_2014, title = {Automatically {Recognizing} {Medication} and {Adverse} {Event} {Information} {From} {Food} and {Drug} {Administration}’s {Adverse} {Event} {Reporting} {System} {Narratives}}, volume = {2}, issn = {2291-9694}, url = {http://medinform.jmir.org/2014/1/e10/}, doi = {10.2196/medinform.3022}, language = {en}, number = {1}, urldate = {2015-05-02}, journal = {JMIR Medical Informatics}, author = {Polepalli Ramesh, Balaji and Belknap, Steven M and Li, Zuofeng and Frid, Nadya and West, Dennis P and Yu, Hong}, month = jun, year = {2014}, pmid = {25600332}, pmcid = {PMC4288072}, pages = {e10}, }
2013
(2)
Systems for Improving Electronic Health Record Note Comprehension.
Polepalli Ramesh, B.; and Yu, H.
In ACM SIGIR Workshop on Health Search & Discovery, 2013.
Paper link bibtex abstract
Paper link bibtex abstract
@inproceedings{polepalli_ramesh_systems_2013, title = {Systems for {Improving} {Electronic} {Health} {Record} {Note} {Comprehension}}, url = {https://research.nuance.com/wp-content/uploads/2014/12/Systems-for-Improving-Electronic-Health-Record-Note-Comprehension.pdf}, abstract = {Allowing patients access to their physicians’ notes has the potential to enhance their understanding of disease and improve medication adherence and healthcare outcomes. However, a recent study involving over ten thousand patients showed that allowing patients to read their electronic health record (EHR) notes caused confusion, especially for the vulnerable (e.g., lower literacy, lower income) groups. This finding is not surprising as EHR notes contain medical jargon that may be difficult for patients to comprehend. To improve patients’ EHR note comprehension, we are developing a biomedical natural language processing system called NoteAid (http://clinicalnotesaid.org), which translates medical jargon into consumer-oriented lay language. The current NoteAid implementations link EHR medical terms to their definitions and other related educational material. Our evaluation has shown that all NoteAid implementations improve self-rated EHR note comprehension by 23\% to 40\% of lay people.}, booktitle = {{ACM} {SIGIR} {Workshop} on {Health} {Search} \& {Discovery}}, author = {Polepalli Ramesh, Balaji and Yu, Hong}, year = {2013}, }
Allowing patients access to their physicians’ notes has the potential to enhance their understanding of disease and improve medication adherence and healthcare outcomes. However, a recent study involving over ten thousand patients showed that allowing patients to read their electronic health record (EHR) notes caused confusion, especially for the vulnerable (e.g., lower literacy, lower income) groups. This finding is not surprising as EHR notes contain medical jargon that may be difficult for patients to comprehend. To improve patients’ EHR note comprehension, we are developing a biomedical natural language processing system called NoteAid (http://clinicalnotesaid.org), which translates medical jargon into consumer-oriented lay language. The current NoteAid implementations link EHR medical terms to their definitions and other related educational material. Our evaluation has shown that all NoteAid implementations improve self-rated EHR note comprehension by 23% to 40% of lay people.
CiteGraph: A Citation Network System for MEDLINE Articles and Analysis.
Qing, Z.; and Hong, Y.
Studies in Health Technology and Informatics,832–836. 2013.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{qing_citegraph:_2013, title = {{CiteGraph}: {A} {Citation} {Network} {System} for {MEDLINE} {Articles} and {Analysis}}, copyright = {©2013 © IMIA and IOS Press.}, issn = {0926-9630}, shorttitle = {{CiteGraph}}, url = {http://www.medra.org/servlet/aliasResolver?alias=iospressISSNISBN&issn=0926-9630&volume=192&spage=832}, doi = {10.3233/978-1-61499-289-9-832}, abstract = {This paper details the development and implementation of CiteGraph, a system for constructing large-scale citation and co-authorship networks from full-text biomedical articles. CiteGraph represents articles and authors by uniquely identified nodes, and connects those nodes through citation and co-authorship relations. CiteGraph network encompasses over 1.65 million full-text articles and 6.35 million citations by 1.37 million unique authors from the Elsevier full-text articles. Our evaluation shows 98\% 99\% F1-score for mapping a citation to the corresponding article and identifying MEDLINE articles. We further analyzed the characteristics of CiteGraph and found that they are consistent with assumptions made using small-scale bibliometric analysis. We also developed several novel network-based methods for analyzing publication, citation and collaboration patterns. This is the first work to develop a completely automated system for the creation of a large-scale citation network in the biomedical domain, and also to introduce novel findings in researcher publication histories. CiteGraph can be a useful resource to both the biomedical community, and bibliometric research.}, urldate = {2016-11-30}, journal = {Studies in Health Technology and Informatics}, author = {Qing, Zhang and Hong, Yu}, year = {2013}, pmid = {23920674}, pages = {832--836}, }
This paper details the development and implementation of CiteGraph, a system for constructing large-scale citation and co-authorship networks from full-text biomedical articles. CiteGraph represents articles and authors by uniquely identified nodes, and connects those nodes through citation and co-authorship relations. CiteGraph network encompasses over 1.65 million full-text articles and 6.35 million citations by 1.37 million unique authors from the Elsevier full-text articles. Our evaluation shows 98% 99% F1-score for mapping a citation to the corresponding article and identifying MEDLINE articles. We further analyzed the characteristics of CiteGraph and found that they are consistent with assumptions made using small-scale bibliometric analysis. We also developed several novel network-based methods for analyzing publication, citation and collaboration patterns. This is the first work to develop a completely automated system for the creation of a large-scale citation network in the biomedical domain, and also to introduce novel findings in researcher publication histories. CiteGraph can be a useful resource to both the biomedical community, and bibliometric research.
2012
(1)
2011
(7)
AskHERMES: An online question answering system for complex clinical questions.
Cao, Y.; Liu, F.; Simpson, P.; Antieau, L.; Bennett, A.; Cimino, J. J; Ely, J.; and Yu, H.
Journal of Biomedical Informatics, 44(2): 277–288. April 2011.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{cao_askhermes_2011, title = {{AskHERMES}: {An} online question answering system for complex clinical questions}, volume = {44}, issn = {1532-0480}, shorttitle = {{AskHERMES}}, url = {http://www.ncbi.nlm.nih.gov/pubmed/21256977}, doi = {10.1016/j.jbi.2011.01.004}, abstract = {{\textless}AbstractText Label="OBJECTIVE" NlmCategory="OBJECTIVE"{\textgreater}Clinical questions are often long and complex and take many forms. We have built a clinical question answering system named AskHERMES to perform robust semantic analysis on complex clinical questions and output question-focused extractive summaries as answers.{\textless}/AbstractText{\textgreater} {\textless}AbstractText Label="DESIGN" NlmCategory="METHODS"{\textgreater}This paper describes the system architecture and a preliminary evaluation of AskHERMES, which implements innovative approaches in question analysis, summarization, and answer presentation. Five types of resources were indexed in this system: MEDLINE abstracts, PubMed Central full-text articles, eMedicine documents, clinical guidelines and Wikipedia articles.{\textless}/AbstractText{\textgreater} {\textless}AbstractText Label="MEASUREMENT" NlmCategory="METHODS"{\textgreater}We compared the AskHERMES system with Google (Google and Google Scholar) and UpToDate and asked physicians to score the three systems by ease of use, quality of answer, time spent, and overall performance.{\textless}/AbstractText{\textgreater} {\textless}AbstractText Label="RESULTS" NlmCategory="RESULTS"{\textgreater}AskHERMES allows physicians to enter a question in a natural way with minimal query formulation and allows physicians to efficiently navigate among all the answer sentences to quickly meet their information needs. In contrast, physicians need to formulate queries to search for information in Google and UpToDate. The development of the AskHERMES system is still at an early stage, and the knowledge resource is limited compared with Google or UpToDate. Nevertheless, the evaluation results show that AskHERMES' performance is comparable to the other systems. In particular, when answering complex clinical questions, it demonstrates the potential to outperform both Google and UpToDate systems.{\textless}/AbstractText{\textgreater} {\textless}AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS"{\textgreater}AskHERMES, available at http://www.AskHERMES.org, has the potential to help physicians practice evidence-based medicine and improve the quality of patient care.{\textless}/AbstractText{\textgreater}}, number = {2}, urldate = {2011-03-25}, journal = {Journal of Biomedical Informatics}, author = {Cao, Yonggang and Liu, Feifan and Simpson, Pippa and Antieau, Lamont and Bennett, Andrew and Cimino, James J and Ely, John and Yu, Hong}, month = apr, year = {2011}, pmid = {21256977 PMCID: PMC3433744}, keywords = {Algorithms, Clinical Medicine, Databases, Factual, Information Storage and Retrieval, Online Systems, Software, expert systems, natural language processing}, pages = {277--288}, }
\textlessAbstractText Label="OBJECTIVE" NlmCategory="OBJECTIVE"\textgreaterClinical questions are often long and complex and take many forms. We have built a clinical question answering system named AskHERMES to perform robust semantic analysis on complex clinical questions and output question-focused extractive summaries as answers.\textless/AbstractText\textgreater \textlessAbstractText Label="DESIGN" NlmCategory="METHODS"\textgreaterThis paper describes the system architecture and a preliminary evaluation of AskHERMES, which implements innovative approaches in question analysis, summarization, and answer presentation. Five types of resources were indexed in this system: MEDLINE abstracts, PubMed Central full-text articles, eMedicine documents, clinical guidelines and Wikipedia articles.\textless/AbstractText\textgreater \textlessAbstractText Label="MEASUREMENT" NlmCategory="METHODS"\textgreaterWe compared the AskHERMES system with Google (Google and Google Scholar) and UpToDate and asked physicians to score the three systems by ease of use, quality of answer, time spent, and overall performance.\textless/AbstractText\textgreater \textlessAbstractText Label="RESULTS" NlmCategory="RESULTS"\textgreaterAskHERMES allows physicians to enter a question in a natural way with minimal query formulation and allows physicians to efficiently navigate among all the answer sentences to quickly meet their information needs. In contrast, physicians need to formulate queries to search for information in Google and UpToDate. The development of the AskHERMES system is still at an early stage, and the knowledge resource is limited compared with Google or UpToDate. Nevertheless, the evaluation results show that AskHERMES' performance is comparable to the other systems. In particular, when answering complex clinical questions, it demonstrates the potential to outperform both Google and UpToDate systems.\textless/AbstractText\textgreater \textlessAbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS"\textgreaterAskHERMES, available at http://www.AskHERMES.org, has the potential to help physicians practice evidence-based medicine and improve the quality of patient care.\textless/AbstractText\textgreater
BioN∅T: A searchable database of biomedical negated sentences.
Agarwal, S.; Yu, H.; and Kohane, I.
BMC Bioinformatics, 12(1): 420. 2011.
Paper doi link bibtex
Paper doi link bibtex
@article{agarwal_biont_2011, title = {{BioN}∅{T}: {A} searchable database of biomedical negated sentences}, volume = {12}, issn = {1471-2105}, shorttitle = {{BioN}∅{T}}, url = {http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-420}, doi = {10.1186/1471-2105-12-420}, language = {en}, number = {1}, urldate = {2016-11-30}, journal = {BMC Bioinformatics}, author = {Agarwal, Shashank and Yu, Hong and Kohane, Issac}, year = {2011}, pmid = {22032181 PMCID: PMC3225379}, pages = {420}, }
Toward automated consumer question answering: Automatically separating consumer questions from professional questions in the healthcare domain.
Liu, F.; Antieau, L. D.; and Yu, H.
Journal of Biomedical Informatics, 44(6): 1032–1038. December 2011.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{liu_toward_2011, title = {Toward automated consumer question answering: {Automatically} separating consumer questions from professional questions in the healthcare domain}, volume = {44}, issn = {15320464}, shorttitle = {Toward automated consumer question answering}, url = {http://linkinghub.elsevier.com/retrieve/pii/S1532046411001353}, doi = {10.1016/j.jbi.2011.08.008}, abstract = {OBJECTIVE: Both healthcare professionals and healthcare consumers have information needs that can be met through the use of computers, specifically via medical question answering systems. However, the information needs of both groups are different in terms of literacy levels and technical expertise, and an effective question answering system must be able to account for these differences if it is to formulate the most relevant responses for users from each group. In this paper, we propose that a first step toward answering the queries of different users is automatically classifying questions according to whether they were asked by healthcare professionals or consumers. DESIGN: We obtained two sets of consumer questions ({\textasciitilde}10,000 questions in total) from Yahoo answers. The professional questions consist of two question collections: 4654 point-of-care questions (denoted as PointCare) obtained from interviews of a group of family doctors following patient visits and 5378 questions from physician practices through professional online services (denoted as OnlinePractice). With more than 20,000 questions combined, we developed supervised machine-learning models for automatic classification between consumer questions and professional questions. To evaluate the robustness of our models, we tested the model that was trained on the Consumer-PointCare dataset on the Consumer-OnlinePractice dataset. We evaluated both linguistic features and statistical features and examined how the characteristics in two different types of professional questions (PointCare vs. OnlinePractice) may affect the classification performance. We explored information gain for feature reduction and the back-off linguistic category features. RESULTS: The 10-fold cross-validation results showed the best F1-measure of 0.936 and 0.946 on Consumer-PointCare and Consumer-OnlinePractice respectively, and the best F1-measure of 0.891 when testing the Consumer-PointCare model on the Consumer-OnlinePractice dataset. CONCLUSION: Healthcare consumer questions posted at Yahoo online communities can be reliably classified from professional questions posted by point-of-care clinicians and online physicians. The supervised machine-learning models are robust for this task. Our study will significantly benefit further development in automated consumer question answering.}, language = {en}, number = {6}, urldate = {2016-11-30}, journal = {Journal of Biomedical Informatics}, author = {Liu, Feifan and Antieau, Lamont D. and Yu, Hong}, month = dec, year = {2011}, pmid = {21856442 PMCID: PMC3226885}, keywords = {Artificial Intelligence, Consumer Participation, Databases, Factual, Delivery of Health Care, Humans, Information Dissemination, Information Storage and Retrieval, Internet, Point-of-Care Systems, Semantics, natural language processing}, pages = {1032--1038}, }
OBJECTIVE: Both healthcare professionals and healthcare consumers have information needs that can be met through the use of computers, specifically via medical question answering systems. However, the information needs of both groups are different in terms of literacy levels and technical expertise, and an effective question answering system must be able to account for these differences if it is to formulate the most relevant responses for users from each group. In this paper, we propose that a first step toward answering the queries of different users is automatically classifying questions according to whether they were asked by healthcare professionals or consumers. DESIGN: We obtained two sets of consumer questions (~10,000 questions in total) from Yahoo answers. The professional questions consist of two question collections: 4654 point-of-care questions (denoted as PointCare) obtained from interviews of a group of family doctors following patient visits and 5378 questions from physician practices through professional online services (denoted as OnlinePractice). With more than 20,000 questions combined, we developed supervised machine-learning models for automatic classification between consumer questions and professional questions. To evaluate the robustness of our models, we tested the model that was trained on the Consumer-PointCare dataset on the Consumer-OnlinePractice dataset. We evaluated both linguistic features and statistical features and examined how the characteristics in two different types of professional questions (PointCare vs. OnlinePractice) may affect the classification performance. We explored information gain for feature reduction and the back-off linguistic category features. RESULTS: The 10-fold cross-validation results showed the best F1-measure of 0.936 and 0.946 on Consumer-PointCare and Consumer-OnlinePractice respectively, and the best F1-measure of 0.891 when testing the Consumer-PointCare model on the Consumer-OnlinePractice dataset. CONCLUSION: Healthcare consumer questions posted at Yahoo online communities can be reliably classified from professional questions posted by point-of-care clinicians and online physicians. The supervised machine-learning models are robust for this task. Our study will significantly benefit further development in automated consumer question answering.
Simple and efficient machine learning frameworks for identifying protein-protein interaction relevant articles and experimental methods used to study the interactions.
Agarwal, S.; Liu, F.; and Yu, H.
BMC Bioinformatics, 12(Suppl 8): S10. 2011.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{agarwal_simple_2011, title = {Simple and efficient machine learning frameworks for identifying protein-protein interaction relevant articles and experimental methods used to study the interactions}, volume = {12}, issn = {1471-2105}, url = {http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-S8-S10}, doi = {10.1186/1471-2105-12-S8-S10}, abstract = {BACKGROUND: Protein-protein interaction (PPI) is an important biomedical phenomenon. Automatically detecting PPI-relevant articles and identifying methods that are used to study PPI are important text mining tasks. In this study, we have explored domain independent features to develop two open source machine learning frameworks. One performs binary classification to determine whether the given article is PPI relevant or not, named "Simple Classifier", and the other one maps the PPI relevant articles with corresponding interaction method nodes in a standardized PSI-MI (Proteomics Standards Initiative-Molecular Interactions) ontology, named "OntoNorm". RESULTS: We evaluated our system in the context of BioCreative challenge competition using the standardized data set. Our systems are amongst the top systems reported by the organizers, attaining 60.8\% F1-score for identifying relevant documents, and 52.3\% F1-score for mapping articles to interaction method ontology. CONCLUSION: Our results show that domain-independent machine learning frameworks can perform competitively well at the tasks of detecting PPI relevant articles and identifying the methods that were used to study the interaction in such articles.}, language = {en}, number = {Suppl 8}, urldate = {2016-11-30}, journal = {BMC Bioinformatics}, author = {Agarwal, Shashank and Liu, Feifan and Yu, Hong}, year = {2011}, pmid = {22151701 PMCID: PMC3269933}, pages = {S10}, }
BACKGROUND: Protein-protein interaction (PPI) is an important biomedical phenomenon. Automatically detecting PPI-relevant articles and identifying methods that are used to study PPI are important text mining tasks. In this study, we have explored domain independent features to develop two open source machine learning frameworks. One performs binary classification to determine whether the given article is PPI relevant or not, named "Simple Classifier", and the other one maps the PPI relevant articles with corresponding interaction method nodes in a standardized PSI-MI (Proteomics Standards Initiative-Molecular Interactions) ontology, named "OntoNorm". RESULTS: We evaluated our system in the context of BioCreative challenge competition using the standardized data set. Our systems are amongst the top systems reported by the organizers, attaining 60.8% F1-score for identifying relevant documents, and 52.3% F1-score for mapping articles to interaction method ontology. CONCLUSION: Our results show that domain-independent machine learning frameworks can perform competitively well at the tasks of detecting PPI relevant articles and identifying the methods that were used to study the interaction in such articles.
Parsing citations in biomedical articles using conditional random fields.
Zhang, Q.; Cao, Y.; and Yu, H.
Computers in Biology and Medicine, 41(4): 190–194. April 2011.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{zhang_parsing_2011, title = {Parsing citations in biomedical articles using conditional random fields}, volume = {41}, issn = {00104825}, url = {http://linkinghub.elsevier.com/retrieve/pii/S0010482511000291}, doi = {10.1016/j.compbiomed.2011.02.005}, abstract = {Citations are used ubiquitously in biomedical full-text articles and play an important role for representing both the rhetorical structure and the semantic content of the articles. As a result, text mining systems will significantly benefit from a tool that automatically extracts the content of a citation. In this study, we applied the supervised machine-learning algorithms Conditional Random Fields (CRFs) to automatically parse a citation into its fields (e.g., Author, Title, Journal, and Year). With a subset of html format open-access PubMed Central articles, we report an overall 97.95\% F1-score. The citation parser can be accessed at: http://www.cs.uwm.edu/∼qing/projects/cithit/index.html.}, language = {en}, number = {4}, urldate = {2016-11-30}, journal = {Computers in Biology and Medicine}, author = {Zhang, Qing and Cao, Yong-Gang and Yu, Hong}, month = apr, year = {2011}, pmid = {21419403 PMCID: PMC3086470}, pages = {190--194}, }
Citations are used ubiquitously in biomedical full-text articles and play an important role for representing both the rhetorical structure and the semantic content of the articles. As a result, text mining systems will significantly benefit from a tool that automatically extracts the content of a citation. In this study, we applied the supervised machine-learning algorithms Conditional Random Fields (CRFs) to automatically parse a citation into its fields (e.g., Author, Title, Journal, and Year). With a subset of html format open-access PubMed Central articles, we report an overall 97.95% F1-score. The citation parser can be accessed at: http://www.cs.uwm.edu/∼qing/projects/cithit/index.html.
Figure Text Extraction in Biomedical Literature.
Kim, D.; and Yu, H.
PLoS ONE, 6(1): e15338. January 2011.
Paper doi link bibtex
Paper doi link bibtex
@article{kim_figure_2011, title = {Figure {Text} {Extraction} in {Biomedical} {Literature}}, volume = {6}, issn = {1932-6203}, url = {http://dx.plos.org/10.1371/journal.pone.0015338}, doi = {10.1371/journal.pone.0015338}, language = {en}, number = {1}, urldate = {2016-11-30}, journal = {PLoS ONE}, author = {Kim, Daehyun and Yu, Hong}, editor = {Uversky, Vladimir N.}, month = jan, year = {2011}, pmid = {21249186 PMCID: PMC3020938}, pages = {e15338}, }
Automatic figure classification in bioscience literature.
Kim, D.; Ramesh, B. P.; and Yu, H.
Journal of Biomedical Informatics, 44(5): 848–858. October 2011.
Paper doi link bibtex
Paper doi link bibtex
@article{kim_automatic_2011, title = {Automatic figure classification in bioscience literature}, volume = {44}, issn = {15320464}, url = {http://linkinghub.elsevier.com/retrieve/pii/S1532046411000943}, doi = {10.1016/j.jbi.2011.05.003}, language = {en}, number = {5}, urldate = {2016-11-30}, journal = {Journal of Biomedical Informatics}, author = {Kim, Daehyun and Ramesh, Balaji Polepalli and Yu, Hong}, month = oct, year = {2011}, pmid = {21645638 PMCID: PMC3176927}, pages = {848--858}, }
2010
(4)
Lancet: a high precision medication event extraction system for clinical text.
Li, Z.; Liu, F.; Antieau, L.; Cao, Y.; and Yu, H.
Journal of the American Medical Informatics Association: JAMIA, 17(5): 563–567. October 2010.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{li_lancet_2010, title = {Lancet: a high precision medication event extraction system for clinical text}, volume = {17}, issn = {1527-974X}, shorttitle = {Lancet}, url = {http://www.ncbi.nlm.nih.gov/pubmed/20819865}, doi = {10.1136/jamia.2010.004077}, abstract = {OBJECTIVE: This paper presents Lancet, a supervised machine-learning system that automatically extracts medication events consisting of medication names and information pertaining to their prescribed use (dosage, mode, frequency, duration and reason) from lists or narrative text in medical discharge summaries. DESIGN: Lancet incorporates three supervised machine-learning models: a conditional random fields model for tagging individual medication names and associated fields, an AdaBoost model with decision stump algorithm for determining which medication names and fields belong to a single medication event, and a support vector machines disambiguation model for identifying the context style (narrative or list). MEASUREMENTS: The authors, from the University of Wisconsin-Milwaukee, participated in the third i2b2 shared-task for challenges in natural language processing for clinical data: medication extraction challenge. With the performance metrics provided by the i2b2 challenge, the micro F1 (precision/recall) scores are reported for both the horizontal and vertical level. RESULTS: Among the top 10 teams, Lancet achieved the highest precision at 90.4\% with an overall F1 score of 76.4\% (horizontal system level with exact match), a gain of 11.2\% and 12\%, respectively, compared with the rule-based baseline system jMerki. By combining the two systems, the hybrid system further increased the F1 score by 3.4\% from 76.4\% to 79.0\%. CONCLUSIONS: Supervised machine-learning systems with minimal external knowledge resources can achieve a high precision with a competitive overall F1 score.Lancet based on this learning framework does not rely on expensive manually curated rules. The system is available online at http://code.google.com/p/lancet/.}, number = {5}, urldate = {2010-09-21}, journal = {Journal of the American Medical Informatics Association: JAMIA}, author = {Li, Zuofeng and Liu, Feifan and Antieau, Lamont and Cao, Yonggang and Yu, Hong}, month = oct, year = {2010}, pmid = {20819865 PMCID: PMC2995682}, pages = {563--567}, }
OBJECTIVE: This paper presents Lancet, a supervised machine-learning system that automatically extracts medication events consisting of medication names and information pertaining to their prescribed use (dosage, mode, frequency, duration and reason) from lists or narrative text in medical discharge summaries. DESIGN: Lancet incorporates three supervised machine-learning models: a conditional random fields model for tagging individual medication names and associated fields, an AdaBoost model with decision stump algorithm for determining which medication names and fields belong to a single medication event, and a support vector machines disambiguation model for identifying the context style (narrative or list). MEASUREMENTS: The authors, from the University of Wisconsin-Milwaukee, participated in the third i2b2 shared-task for challenges in natural language processing for clinical data: medication extraction challenge. With the performance metrics provided by the i2b2 challenge, the micro F1 (precision/recall) scores are reported for both the horizontal and vertical level. RESULTS: Among the top 10 teams, Lancet achieved the highest precision at 90.4% with an overall F1 score of 76.4% (horizontal system level with exact match), a gain of 11.2% and 12%, respectively, compared with the rule-based baseline system jMerki. By combining the two systems, the hybrid system further increased the F1 score by 3.4% from 76.4% to 79.0%. CONCLUSIONS: Supervised machine-learning systems with minimal external knowledge resources can achieve a high precision with a competitive overall F1 score.Lancet based on this learning framework does not rely on expensive manually curated rules. The system is available online at http://code.google.com/p/lancet/.
Identifying discourse connectives in biomedical text.
Ramesh, B. P.; and Yu, H.
AMIA ... Annual Symposium proceedings. AMIA Symposium, 2010: 657–661. November 2010.
Paper link bibtex abstract
Paper link bibtex abstract
@article{ramesh_identifying_2010, title = {Identifying discourse connectives in biomedical text}, volume = {2010}, issn = {1942-597X}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041460/}, abstract = {Discourse connectives are words or phrases that connect or relate two coherent sentences or phrases and indicate the presence of discourse relations. Automatic recognition of discourse connectives may benefit many natural language processing applications. In this pilot study, we report the development of the supervised machine-learning classifiers with conditional random fields (CRFs) for automatically identifying discourse connectives in full-text biomedical articles. Our first classifier was trained on the open-domain 1 million token Penn Discourse Tree Bank (PDTB). We performed cross validation on biomedical articles (approximately 100K word tokens) that we annotated. The results show that the classifier trained on PDTB data attained a 0.55 F1-score for identifying discourse connectives in biomedical text, while the cross-validation results in the biomedical text attained a 0.69 F1-score, a much better performance despite a much smaller training size. Our preliminary analysis suggests the existence of domain-specific features, and we speculate that domain-adaption approaches may further improve performance.}, language = {ENG}, journal = {AMIA ... Annual Symposium proceedings. AMIA Symposium}, author = {Ramesh, Balaji Polepalli and Yu, Hong}, month = nov, year = {2010}, pmid = {21347060 PMCID: PMC3041460}, keywords = {Algorithms, Artificial Intelligence, Databases, Factual, Humans, Pilot Projects, Supervised Machine Learning, natural language processing}, pages = {657--661}, }
Discourse connectives are words or phrases that connect or relate two coherent sentences or phrases and indicate the presence of discourse relations. Automatic recognition of discourse connectives may benefit many natural language processing applications. In this pilot study, we report the development of the supervised machine-learning classifiers with conditional random fields (CRFs) for automatically identifying discourse connectives in full-text biomedical articles. Our first classifier was trained on the open-domain 1 million token Penn Discourse Tree Bank (PDTB). We performed cross validation on biomedical articles (approximately 100K word tokens) that we annotated. The results show that the classifier trained on PDTB data attained a 0.55 F1-score for identifying discourse connectives in biomedical text, while the cross-validation results in the biomedical text attained a 0.69 F1-score, a much better performance despite a much smaller training size. Our preliminary analysis suggests the existence of domain-specific features, and we speculate that domain-adaption approaches may further improve performance.
Biomedical negation scope detection with conditional random fields.
Agarwal, S.; and Yu, H.
Journal of the American Medical Informatics Association: JAMIA, 17(6): 696–701. November 2010.
00033 PMID: 20962133 PMCID: PMC3000754
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{agarwal_biomedical_2010, title = {Biomedical negation scope detection with conditional random fields}, volume = {17}, issn = {1527-974X}, url = {http://www.ncbi.nlm.nih.gov/pubmed/20962133}, doi = {10.1136/jamia.2010.003228}, abstract = {{\textless}AbstractText Label="OBJECTIVE" NlmCategory="OBJECTIVE"{\textgreater}Negation is a linguistic phenomenon that marks the absence of an entity or event. Negated events are frequently reported in both biological literature and clinical notes. Text mining applications benefit from the detection of negation and its scope. However, due to the complexity of language, identifying the scope of negation in a sentence is not a trivial task.{\textless}/AbstractText{\textgreater} {\textless}AbstractText Label="DESIGN" NlmCategory="METHODS"{\textgreater}Conditional random fields (CRF), a supervised machine-learning algorithm, were used to train models to detect negation cue phrases and their scope in both biological literature and clinical notes. The models were trained on the publicly available BioScope corpus.{\textless}/AbstractText{\textgreater} {\textless}AbstractText Label="MEASUREMENT" NlmCategory="METHODS"{\textgreater}The performance of the CRF models was evaluated on identifying the negation cue phrases and their scope by calculating recall, precision and F1-score. The models were compared with four competitive baseline systems.{\textless}/AbstractText{\textgreater} {\textless}AbstractText Label="RESULTS" NlmCategory="RESULTS"{\textgreater}The best CRF-based model performed statistically better than all baseline systems and NegEx, achieving an F1-score of 98\% and 95\% on detecting negation cue phrases and their scope in clinical notes, and an F1-score of 97\% and 85\% on detecting negation cue phrases and their scope in biological literature.{\textless}/AbstractText{\textgreater} {\textless}AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS"{\textgreater}This approach is robust, as it can identify negation scope in both biological and clinical text. To benefit text mining applications, the system is publicly available as a Java API and as an online application at http://negscope.askhermes.org.{\textless}/AbstractText{\textgreater}}, number = {6}, urldate = {2011-03-25}, journal = {Journal of the American Medical Informatics Association: JAMIA}, author = {Agarwal, Shashank and Yu, Hong}, month = nov, year = {2010}, note = {00033 PMID: 20962133 PMCID: PMC3000754}, keywords = {Humans, natural language processing}, pages = {696--701}, }
\textlessAbstractText Label="OBJECTIVE" NlmCategory="OBJECTIVE"\textgreaterNegation is a linguistic phenomenon that marks the absence of an entity or event. Negated events are frequently reported in both biological literature and clinical notes. Text mining applications benefit from the detection of negation and its scope. However, due to the complexity of language, identifying the scope of negation in a sentence is not a trivial task.\textless/AbstractText\textgreater \textlessAbstractText Label="DESIGN" NlmCategory="METHODS"\textgreaterConditional random fields (CRF), a supervised machine-learning algorithm, were used to train models to detect negation cue phrases and their scope in both biological literature and clinical notes. The models were trained on the publicly available BioScope corpus.\textless/AbstractText\textgreater \textlessAbstractText Label="MEASUREMENT" NlmCategory="METHODS"\textgreaterThe performance of the CRF models was evaluated on identifying the negation cue phrases and their scope by calculating recall, precision and F1-score. The models were compared with four competitive baseline systems.\textless/AbstractText\textgreater \textlessAbstractText Label="RESULTS" NlmCategory="RESULTS"\textgreaterThe best CRF-based model performed statistically better than all baseline systems and NegEx, achieving an F1-score of 98% and 95% on detecting negation cue phrases and their scope in clinical notes, and an F1-score of 97% and 85% on detecting negation cue phrases and their scope in biological literature.\textless/AbstractText\textgreater \textlessAbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS"\textgreaterThis approach is robust, as it can identify negation scope in both biological and clinical text. To benefit text mining applications, the system is publicly available as a Java API and as an online application at http://negscope.askhermes.org.\textless/AbstractText\textgreater
Automatic Figure Ranking and User Interfacing for Intelligent Figure Search.
Yu, H.; Liu, F.; and Ramesh, B. P.
PLoS ONE, 5(10): e12983. October 2010.
Paper doi link bibtex
Paper doi link bibtex
@article{yu_automatic_2010, title = {Automatic {Figure} {Ranking} and {User} {Interfacing} for {Intelligent} {Figure} {Search}}, volume = {5}, issn = {1932-6203}, url = {http://dx.plos.org/10.1371/journal.pone.0012983}, doi = {10.1371/journal.pone.0012983}, language = {en}, number = {10}, urldate = {2016-11-30}, journal = {PLoS ONE}, author = {Yu, Hong and Liu, Feifan and Ramesh, Balaji Polepalli}, editor = {Wong, Kelvin Kian Loong}, month = oct, year = {2010}, pmid = {20949102 PMCID: PMC2951344}, keywords = {balaji-ramesh, bioinformatics, feifan-liu, hong-yu}, pages = {e12983}, }
2009
(5)
Using the Weighted Keyword Model to Improve Information Retrieval for Answering Biomedical Questions.
Yu, H.; and Cao, Y.
Summit on translational bioinformatics, 2009: 143. 2009.
link bibtex abstract
link bibtex abstract
@article{yu_using_2009, title = {Using the {Weighted} {Keyword} {Model} to {Improve} {Information} {Retrieval} for {Answering} {Biomedical} {Questions}}, volume = {2009}, abstract = {Physicians ask many complex questions during the patient encounter. Information retrieval systems that can provide immediate and relevant answers to these questions can be invaluable aids to the practice of evidence-based medicine. In this study, we first automatically identify topic keywords from ad hoc clinical questions with a Condition Random Field model that is trained over thousands of manually annotated clinical questions. We then report on a linear model that assigns query weights based on their automatically identified semantic roles: topic keywords, domain specific terms, and their synonyms. Our evaluation shows that this weighted keyword model improves information retrieval from the Text Retrieval Conference Genomics track data.}, journal = {Summit on translational bioinformatics}, author = {Yu, Hong and Cao, Yong-Gang}, year = {2009}, pmid = {21347188 PMCID: PMC3041568}, pages = {143}, }
Physicians ask many complex questions during the patient encounter. Information retrieval systems that can provide immediate and relevant answers to these questions can be invaluable aids to the practice of evidence-based medicine. In this study, we first automatically identify topic keywords from ad hoc clinical questions with a Condition Random Field model that is trained over thousands of manually annotated clinical questions. We then report on a linear model that assigns query weights based on their automatically identified semantic roles: topic keywords, domain specific terms, and their synonyms. Our evaluation shows that this weighted keyword model improves information retrieval from the Text Retrieval Conference Genomics track data.
Investigating and annotating the role of citation in biomedical full-text articles.
Yu, H.; Agarwal, S.; and Frid, N.
In Bioinformatics and Biomedicine Workshop, pages 308–313, November 2009. IEEE
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{yu_investigating_2009, title = {Investigating and annotating the role of citation in biomedical full-text articles}, isbn = {978-1-4244-5121-0}, url = {http://ieeexplore.ieee.org/document/5332080/}, doi = {10.1109/BIBMW.2009.5332080}, abstract = {Citations are ubiquitous in scientific articles and play important roles for representing the semantic content of a full-text biomedical article. In this work, we manually examined full-text biomedical articles to analyze the semantic content of citations in full-text biomedical articles. After developing a citation relation schema and annotation guideline, our pilot annotation results show an overall agreement of 0.71, and here we report on the research challenges and the lessons we've learned while trying to overcome them. Our work is a first step toward automatic citation classification in full-text biomedical articles, which may contribute to many text mining tasks, including information retrieval, extraction, summarization, and question answering.}, urldate = {2016-11-30}, booktitle = {Bioinformatics and {Biomedicine} {Workshop}}, publisher = {IEEE}, author = {Yu, Hong and Agarwal, Shashank and Frid, Nadya}, month = nov, year = {2009}, pmid = {21170175 PMCID: PMC3003334}, pages = {308--313}, }
Citations are ubiquitous in scientific articles and play important roles for representing the semantic content of a full-text biomedical article. In this work, we manually examined full-text biomedical articles to analyze the semantic content of citations in full-text biomedical articles. After developing a citation relation schema and annotation guideline, our pilot annotation results show an overall agreement of 0.71, and here we report on the research challenges and the lessons we've learned while trying to overcome them. Our work is a first step toward automatic citation classification in full-text biomedical articles, which may contribute to many text mining tasks, including information retrieval, extraction, summarization, and question answering.
Evaluating the weighted-keyword model to improve clinical question answering.
Cao, Y.; Ely, J.; and Yu, H.
In Bioinformatics and Biomedicine Workshop, pages 331–335, November 2009. IEEE
INSPEC Accession Number: 10975550
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@inproceedings{cao_evaluating_2009, title = {Evaluating the weighted-keyword model to improve clinical question answering}, isbn = {978-1-4244-5121-0}, url = {http://ieeexplore.ieee.org/document/5332084/}, doi = {10.1109/BIBMW.2009.5332084}, abstract = {Physicians ask many complex questions during their encounters with patients. Question answering systems provide immediate and direct answers to ad hoc clinical questions, and because these systems might aid in the practice of evidence-based medicine, we are developing the clinical question answering system, AskHERMES, to generate answers to such questions. In this study, we report the evaluation of a new weighted-keyword model for improving our question answering system. As part of this development, a physician manually examined AskHERMES' answers to 20 ad hoc clinical questions created with and without the weighted-keyword model. The results show that the weighted-keyword model improves quality in question answering. AskHERMES can be accessed at http://www.AskHERMES.org.}, urldate = {2016-11-30}, booktitle = {Bioinformatics and {Biomedicine} {Workshop}}, publisher = {IEEE}, author = {Cao, Yong-Gang and Ely, John and Yu, Hong}, month = nov, year = {2009}, note = {INSPEC Accession Number: 10975550}, pages = {331--335}, }
Physicians ask many complex questions during their encounters with patients. Question answering systems provide immediate and direct answers to ad hoc clinical questions, and because these systems might aid in the practice of evidence-based medicine, we are developing the clinical question answering system, AskHERMES, to generate answers to such questions. In this study, we report the evaluation of a new weighted-keyword model for improving our question answering system. As part of this development, a physician manually examined AskHERMES' answers to 20 ad hoc clinical questions created with and without the weighted-keyword model. The results show that the weighted-keyword model improves quality in question answering. AskHERMES can be accessed at http://www.AskHERMES.org.
Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion.
Agarwal, S.; and Yu, H.
Bioinformatics, 25(23): 3174–3180. December 2009.
Paper doi link bibtex
Paper doi link bibtex
@article{agarwal_automatically_2009, title = {Automatically classifying sentences in full-text biomedical articles into {Introduction}, {Methods}, {Results} and {Discussion}}, volume = {25}, issn = {1367-4803, 1460-2059}, url = {https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btp548}, doi = {10.1093/bioinformatics/btp548}, language = {en}, number = {23}, urldate = {2016-11-30}, journal = {Bioinformatics}, author = {Agarwal, S. and Yu, H.}, month = dec, year = {2009}, pmid = {21347163}, pmcid = {PMC3041564}, pages = {3174--3180}, }
Are figure legends sufficient? Evaluating the contribution of associated text to biomedical figure comprehension.
Yu, H.; Agarwal, S.; Johnston, M.; and Cohen, A.
Journal of Biomedical Discovery and Collaboration, 4(1): 1. 2009.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yu_are_2009, title = {Are figure legends sufficient? {Evaluating} the contribution of associated text to biomedical figure comprehension}, volume = {4}, issn = {1747-5333}, shorttitle = {Are figure legends sufficient?}, url = {http://www.j-biomed-discovery.com/content/4/1/1}, doi = {10.1186/1747-5333-4-1}, abstract = {BACKGROUND:Biomedical scientists need to access figures to validate research facts and to formulate or to test novel research hypotheses. However, figures are difficult to comprehend without associated text (e.g., figure legend and other reference text). We are developing automated systems to extract the relevant explanatory information along with figures extracted from full text articles. Such systems could be very useful in improving figure retrieval and in reducing the workload of biomedical scientists, who otherwise have to retrieve and read the entire full-text journal article to determine which figures are relevant to their research. As a crucial step, we studied the importance of associated text in biomedical figure comprehension.METHODS:Twenty subjects evaluated three figure-text combinations: figure+legend, figure+legend+title+abstract, and figure+full-text. Using a Likert scale, each subject scored each figure+text according to the extent to which the subject thought he/she understood the meaning of the figure and the confidence in providing the assigned score. Additionally, each subject entered a free text summary for each figure-text. We identified missing information using indicator words present within the text summaries. Both the Likert scores and the missing information were statistically analyzed for differences among the figure-text types. We also evaluated the quality of text summaries with the text-summarization evaluation method the ROUGE score.RESULTS:Our results showed statistically significant differences in figure comprehension when varying levels of text were provided. When the full-text article is not available, presenting just the figure+legend left biomedical researchers lacking 39-68\% of the information about a figure as compared to having complete figure comprehension; adding the title and abstract improved the situation, but still left biomedical researchers missing 30\% of the information. When the full-text article is available, figure comprehension increased to 86-97\%; this indicates that researchers felt that only 3-14\% of the necessary information for full figure comprehension was missing when full text was available to them. Clearly there is information in the abstract and in the full text that biomedical scientists deem important for understanding the figures that appear in full-text biomedical articles.CONCLUSION:We conclude that the texts that appear in full-text biomedical articles are useful for understanding the meaning of a figure, and an effective figure-mining system needs to unlock the information beyond figure legend. Our work provides important guidance to the figure mining systems that extract information only from figure and figure legend.}, number = {1}, urldate = {2009-03-03}, journal = {Journal of Biomedical Discovery and Collaboration}, author = {Yu, Hong and Agarwal, Shashank and Johnston, Mark and Cohen, Aaron}, year = {2009}, pmid = {19126221 PMCID: PMC2631451}, pages = {1}, }
BACKGROUND:Biomedical scientists need to access figures to validate research facts and to formulate or to test novel research hypotheses. However, figures are difficult to comprehend without associated text (e.g., figure legend and other reference text). We are developing automated systems to extract the relevant explanatory information along with figures extracted from full text articles. Such systems could be very useful in improving figure retrieval and in reducing the workload of biomedical scientists, who otherwise have to retrieve and read the entire full-text journal article to determine which figures are relevant to their research. As a crucial step, we studied the importance of associated text in biomedical figure comprehension.METHODS:Twenty subjects evaluated three figure-text combinations: figure+legend, figure+legend+title+abstract, and figure+full-text. Using a Likert scale, each subject scored each figure+text according to the extent to which the subject thought he/she understood the meaning of the figure and the confidence in providing the assigned score. Additionally, each subject entered a free text summary for each figure-text. We identified missing information using indicator words present within the text summaries. Both the Likert scores and the missing information were statistically analyzed for differences among the figure-text types. We also evaluated the quality of text summaries with the text-summarization evaluation method the ROUGE score.RESULTS:Our results showed statistically significant differences in figure comprehension when varying levels of text were provided. When the full-text article is not available, presenting just the figure+legend left biomedical researchers lacking 39-68% of the information about a figure as compared to having complete figure comprehension; adding the title and abstract improved the situation, but still left biomedical researchers missing 30% of the information. When the full-text article is available, figure comprehension increased to 86-97%; this indicates that researchers felt that only 3-14% of the necessary information for full figure comprehension was missing when full text was available to them. Clearly there is information in the abstract and in the full text that biomedical scientists deem important for understanding the figures that appear in full-text biomedical articles.CONCLUSION:We conclude that the texts that appear in full-text biomedical articles are useful for understanding the meaning of a figure, and an effective figure-mining system needs to unlock the information beyond figure legend. Our work provides important guidance to the figure mining systems that extract information only from figure and figure legend.
2008
(1)
Translating biology: text mining tools that work.
Cohen, K B.; Yu, H.; Bourne, P. E; and Hirschman, L.
In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, volume 13, pages 551, 2008.
NIHMSID: NIHMS92147
Paper link bibtex
Paper link bibtex
@inproceedings{cohen_translating_2008, title = {Translating biology: text mining tools that work}, volume = {13}, url = {http://psb.stanford.edu/psb-online/proceedings/psb08/textmining.pdf}, booktitle = {Pacific {Symposium} on {Biocomputing}. {Pacific} {Symposium} on {Biocomputing}}, author = {Cohen, K Bretonnel and Yu, Hong and Bourne, Philip E and Hirschman, Lynette}, year = {2008}, pmcid = {PMC2934913}, pmid = {20827444}, note = {NIHMSID: NIHMS92147}, pages = {551}, }
2007
(3)
Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians.
Yu, H.; Lee, M.; Kaufman, D.; Ely, J.; Osheroff, J. A.; Hripcsak, G.; and Cimino, J.
Journal of Biomedical Informatics, 40(3): 236–251. June 2007.
Paper doi link bibtex
Paper doi link bibtex
@article{yu_development_2007, title = {Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians}, volume = {40}, issn = {15320464}, url = {http://linkinghub.elsevier.com/retrieve/pii/S1532046407000202}, doi = {10.1016/j.jbi.2007.03.002}, language = {en}, number = {3}, urldate = {2016-11-30}, journal = {Journal of Biomedical Informatics}, author = {Yu, Hong and Lee, Minsuk and Kaufman, David and Ely, John and Osheroff, Jerome A. and Hripcsak, George and Cimino, James}, month = jun, year = {2007}, pmid = {17462961}, keywords = {Algorithms Attitude of Health Personnel Attitude to Computers *Cognition Databases, Bibliographic Databases, Factual *Decision Support Techniques Humans Information Storage and Retrieval Information Systems Internet Logical Observation Identifiers Names and Codes Online Systems *Physicians PubMed Research Design Software}, pages = {236--251}, }
Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles.
Yu, H.; Kim, W.; Hatzivassiloglou, V.; and Wilbur, W. J.
Journal of Biomedical Informatics, 40(2): 150–159. April 2007.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yu_using_2007, title = {Using {MEDLINE} as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles}, volume = {40}, issn = {15320464}, url = {http://linkinghub.elsevier.com/retrieve/pii/S1532046406000621}, doi = {10.1016/j.jbi.2006.06.001}, abstract = {Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the meanings of the abbreviations. In this study, we present a semi-supervised method that applies MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. We first automatically generated from the MEDLINE abstracts a dictionary of abbreviation-full pairs based on a rule-based system that maps abbreviations to full forms when full forms are defined in the abstracts. We then trained on the MEDLINE abstracts and predicted the full forms of abbreviations in full-text journal articles by applying supervised machine-learning algorithms in a semi-supervised fashion. We report up to 92\% prediction precision and up to 91\% coverage.}, language = {en}, number = {2}, urldate = {2016-11-30}, journal = {Journal of Biomedical Informatics}, author = {Yu, Hong and Kim, Won and Hatzivassiloglou, Vasileios and Wilbur, W. John}, month = apr, year = {2007}, pmid = {16843731}, keywords = {*Artificial Intelligence Database Management Systems Information Storage and Retrieval/*methods *Medline *Natural Language Processing Pattern Recognition, Automated/*methods *Periodicals *Terminology}, pages = {150--159}, }
Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the meanings of the abbreviations. In this study, we present a semi-supervised method that applies MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. We first automatically generated from the MEDLINE abstracts a dictionary of abbreviation-full pairs based on a rule-based system that maps abbreviations to full forms when full forms are defined in the abstracts. We then trained on the MEDLINE abstracts and predicted the full forms of abbreviations in full-text journal articles by applying supervised machine-learning algorithms in a semi-supervised fashion. We report up to 92% prediction precision and up to 91% coverage.
The efficacy and safety of apixaban, an oral, direct factor Xa inhibitor, as thromboprophylaxis in patients following total knee replacement.
Lassen, M. R.; Davidson, B. L.; Gallus, A.; Pineo, G.; Ansell, J.; and Deitchman, D.
Journal of Thrombosis and Haemostasis, 5(12): 2368–2375. December 2007.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{lassen_efficacy_2007, title = {The efficacy and safety of apixaban, an oral, direct factor {Xa} inhibitor, as thromboprophylaxis in patients following total knee replacement}, volume = {5}, issn = {15387933, 15387836}, url = {http://doi.wiley.com/10.1111/j.1538-7836.2007.02764.x}, doi = {10.1111/j.1538-7836.2007.02764.x}, abstract = {BACKGROUND: Heparins and warfarin are currently used as venous thromboembolism (VTE) prophylaxis in surgery. Inhibition of factor (F) Xa provides a specific mechanism of anticoagulation and the potential for an improved benefit-risk profile. OBJECTIVES: To evaluate the safety and efficacy of apixaban, a potent, direct, oral inhibitor of FXa, in patients following total knee replacement (TKR), and to investigate dose-response relationships. PATIENTS/METHODS: A total of 1238 patients were randomized to one of six double-blind apixaban doses [5, 10 or 20 mg day(-1) administered as a single (q.d.) or a twice-daily divided dose (b.i.d.)], enoxaparin (30 mg b.i.d.) or open-label warfarin (titrated to an International Normalized Ratio of 1.8-3.0). Treatment lasted 10-14 days, commencing 12-24 h after surgery with apixaban or enoxaparin, and on the evening of surgery with warfarin. The primary efficacy outcome was a composite of VTE (mandatory venography) and all-cause mortality during treatment. The primary safety outcome was major bleeding. RESULTS: A total of 1217 patients were eligible for safety and 856 patients for efficacy analysis. All apixaban groups had lower primary efficacy event rates than either comparator. The primary outcome rate decreased with increasing apixaban dose (P = 0.09 with q.d./b.i.d. regimens combined, P = 0.19 for q.d. and P = 0.13 for b.i.d. dosing).A significant dose-related increase in the incidence of total adjudicated bleeding events was noted in the q.d. (P = 0.01) and b.i.d. (P = 0.02) apixaban groups; there was no difference between q.d. and b.i.d. regimens. CONCLUSIONS: Apixaban in doses of 2.5 mg b.i.d. or 5 mg q.d. has a promising benefit-risk profile compared with the current standards of care following TKR.}, language = {en}, number = {12}, urldate = {2016-11-30}, journal = {Journal of Thrombosis and Haemostasis}, author = {Lassen, M. R. and Davidson, B. L. and Gallus, A. and Pineo, G. and Ansell, J. and Deitchman, D.}, month = dec, year = {2007}, pmid = {17868430}, pages = {2368--2375}, }
BACKGROUND: Heparins and warfarin are currently used as venous thromboembolism (VTE) prophylaxis in surgery. Inhibition of factor (F) Xa provides a specific mechanism of anticoagulation and the potential for an improved benefit-risk profile. OBJECTIVES: To evaluate the safety and efficacy of apixaban, a potent, direct, oral inhibitor of FXa, in patients following total knee replacement (TKR), and to investigate dose-response relationships. PATIENTS/METHODS: A total of 1238 patients were randomized to one of six double-blind apixaban doses [5, 10 or 20 mg day(-1) administered as a single (q.d.) or a twice-daily divided dose (b.i.d.)], enoxaparin (30 mg b.i.d.) or open-label warfarin (titrated to an International Normalized Ratio of 1.8-3.0). Treatment lasted 10-14 days, commencing 12-24 h after surgery with apixaban or enoxaparin, and on the evening of surgery with warfarin. The primary efficacy outcome was a composite of VTE (mandatory venography) and all-cause mortality during treatment. The primary safety outcome was major bleeding. RESULTS: A total of 1217 patients were eligible for safety and 856 patients for efficacy analysis. All apixaban groups had lower primary efficacy event rates than either comparator. The primary outcome rate decreased with increasing apixaban dose (P = 0.09 with q.d./b.i.d. regimens combined, P = 0.19 for q.d. and P = 0.13 for b.i.d. dosing).A significant dose-related increase in the incidence of total adjudicated bleeding events was noted in the q.d. (P = 0.01) and b.i.d. (P = 0.02) apixaban groups; there was no difference between q.d. and b.i.d. regimens. CONCLUSIONS: Apixaban in doses of 2.5 mg b.i.d. or 5 mg q.d. has a promising benefit-risk profile compared with the current standards of care following TKR.
2006
(1)
The semantics of a definiendum constrains both the lexical semantics and the lexicosyntactic patterns in the definiens.
Yu, H.; and Wei, Y.
In Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL, pages 1–8, New York, USA, 2006.
Paper link bibtex
Paper link bibtex
@inproceedings{yu_semantics_2006, address = {New York, USA}, title = {The semantics of a definiendum constrains both the lexical semantics and the lexicosyntactic patterns in the definiens}, url = {https://dl.acm.org/citation.cfm?id=1567621}, booktitle = {Proceedings of the {BioNLP} {Workshop} on {Linking} {Natural} {Language} {Processing} and {Biology} at {HLT}-{NAACL}}, author = {Yu, H. and Wei, Y.}, year = {2006}, pages = {1--8}, }
2004
(1)
Using MEDLINE as a knowledge source for disambiguating abbreviations in full-text biomedical journal articles.
Yu, H.; Kim, W.; Hatzivassiloglou, V.; and John Wilbur, W
In Computer-Based Medical Systems, 2004. CBMS 2004. Proceedings. 17th IEEE Symposium on, pages 27–32, June 2004. IEEE
doi link bibtex abstract
doi link bibtex abstract
@inproceedings{yu_using_2004, title = {Using {MEDLINE} as a knowledge source for disambiguating abbreviations in full-text biomedical journal articles}, isbn = {0-7695-2104-5}, doi = {10.1109/CBMS.2004.1311686}, abstract = {Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many abbreviations represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of biomedical abbreviations. Since many abbreviations are ambiguous, it would be important to map abbreviations to their full forms, which ultimately represent the meanings of the abbreviations. In this study, we present a novel unsupervised method that applies MEDLINE records as a knowledge source for disambiguating abbreviations in full-text biomedical journal articles. We first automatically generated from MEDLINE records a knowledge source or dictionary of abbreviation-full pairs. We then trained on MEDLINE records and predicted the full forms of abbreviations in full-text journal articles by applying supervised machine-learning algorithms in an unsupervised fashion. We report up to 92\% prediction precision and up to 91\% coverage.}, booktitle = {Computer-{Based} {Medical} {Systems}, 2004. {CBMS} 2004. {Proceedings}. 17th {IEEE} {Symposium} on}, publisher = {IEEE}, author = {Yu, Hong and Kim, Won and Hatzivassiloglou, Vasileios and John Wilbur, W}, month = jun, year = {2004}, pages = {27--32}, }
Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many abbreviations represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of biomedical abbreviations. Since many abbreviations are ambiguous, it would be important to map abbreviations to their full forms, which ultimately represent the meanings of the abbreviations. In this study, we present a novel unsupervised method that applies MEDLINE records as a knowledge source for disambiguating abbreviations in full-text biomedical journal articles. We first automatically generated from MEDLINE records a knowledge source or dictionary of abbreviation-full pairs. We then trained on MEDLINE records and predicted the full forms of abbreviations in full-text journal articles by applying supervised machine-learning algorithms in an unsupervised fashion. We report up to 92% prediction precision and up to 91% coverage.
2003
(1)
Extracting synonymous gene and protein terms from biological literature.
Yu, H.; and Agichtein, E.
Bioinformatics, 19(Suppl 1): i340–i349. July 2003.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{yu_extracting_2003, title = {Extracting synonymous gene and protein terms from biological literature}, volume = {19}, issn = {1367-4803, 1460-2059}, url = {https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btg1047}, doi = {10.1093/bioinformatics/btg1047}, abstract = {MOTIVATION: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance. RESULTS: We have explored four complementary approaches for extracting gene and protein synonyms from text, namely the unsupervised, partially supervised, and supervised machine-learning techniques, as well as the manual knowledge-based approach. We report results of a large scale evaluation of these alternatives over an archive of biological journal articles. Our evaluation shows that our extraction techniques could be a valuable supplement to resources such as SWISSPROT, as our systems were able to capture gene and protein synonyms not listed in the SWISSPROT database.}, language = {en}, number = {Suppl 1}, urldate = {2016-11-30}, journal = {Bioinformatics}, author = {Yu, H. and Agichtein, E.}, month = jul, year = {2003}, pmid = {12855479}, keywords = {Abstracting and Indexing/*methods/standards Acetaminophen Algorithms Biology/methods/standards Computational Biology/methods/standards Database Management Systems *Databases, Bibliographic Documentation *Genes Information Storage and Retrieval/methods/standards *Natural Language Processing *Periodicals *Proteins Research Support, Controlled, Non-P.H.S. *Terminology Vocabulary, U.S. Gov't}, pages = {i340--i349}, }
MOTIVATION: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance. RESULTS: We have explored four complementary approaches for extracting gene and protein synonyms from text, namely the unsupervised, partially supervised, and supervised machine-learning techniques, as well as the manual knowledge-based approach. We report results of a large scale evaluation of these alternatives over an archive of biological journal articles. Our evaluation shows that our extraction techniques could be a valuable supplement to resources such as SWISSPROT, as our systems were able to capture gene and protein synonyms not listed in the SWISSPROT database.
2001
(1)
Knowledge-based disambiguation of abbreviations.
Yu, H.
In Proceedings of the AMIA Symposium, pages 1067, 2001.
Paper link bibtex
Paper link bibtex
@inproceedings{yu_knowledge-based_2001, title = {Knowledge-based disambiguation of abbreviations}, url = {https://pantherfile.uwm.edu/hongyu/www/files/articles/D010001419.pdf}, booktitle = {Proceedings of the {AMIA} {Symposium}}, author = {Yu, Hong}, year = {2001}, pmcid = {PMC2243340}, pages = {1067}, }
2000
(1)
A large scale, cross-disease family health history data set.
Yu, H.; and Hripcsak, G.
Proceedings of the AMIA Symposium,1162. 2000.
PMC2243911
Paper link bibtex
Paper link bibtex
@article{yu_large_2000, title = {A large scale, cross-disease family health history data set}, url = {https://pantherfile.uwm.edu/hongyu/www/files/articles/D200262.pdf}, journal = {Proceedings of the AMIA Symposium}, author = {Yu, Hong and Hripcsak, George}, year = {2000}, note = {PMC2243911}, pages = {1162}, }
1999
(1)
Representing genomic knowledge in the UMLS semantic network.
Yu, H.; Friedman, C.; Rhzetsky, A.; and Kra, P.
Proceedings of the AMIA Symposium,181. 1999.
Paper link bibtex abstract 1 download
Paper link bibtex abstract 1 download
@article{yu_representing_1999, title = {Representing genomic knowledge in the {UMLS} semantic network.}, url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2232882/}, abstract = {Genomics research has a significant impact on the understanding and treatment of human hereditary diseases, and biomedical literature concerning the genome project is becoming more and more important for clinicians. The Unified Medical Language System (UMLS) is designed to facilitate the retrieval and integration of information from multiple-readable biomedical information resources. This paper describes our efforts to integrate concepts important to genomics research with the UMLS semantic network. We found that the UMLS contains over 30 semantic types and most of the semantic relations that are essential for representing the underlying genomic knowledge. In addition, we observed that the organization of the network was appropriate for representing the hierarchical organization of the concepts. Because some of the concepts critical to the genomic domain were found to be missing, we propose to extend the network by adding six new semantic types and sixteen new semantic relations.}, journal = {Proceedings of the AMIA Symposium}, author = {Yu, Hong and Friedman, Carol and Rhzetsky, Andrey and Kra, Pauline}, year = {1999}, pmid = {10566345 PMCID: PMC2232882}, keywords = {*Genome, Controlled, Human Human Semantics *Unified Medical Language System Vocabulary}, pages = {181}, }
Genomics research has a significant impact on the understanding and treatment of human hereditary diseases, and biomedical literature concerning the genome project is becoming more and more important for clinicians. The Unified Medical Language System (UMLS) is designed to facilitate the retrieval and integration of information from multiple-readable biomedical information resources. This paper describes our efforts to integrate concepts important to genomics research with the UMLS semantic network. We found that the UMLS contains over 30 semantic types and most of the semantic relations that are essential for representing the underlying genomic knowledge. In addition, we observed that the organization of the network was appropriate for representing the hierarchical organization of the concepts. Because some of the concepts critical to the genomic domain were found to be missing, we propose to extend the network by adding six new semantic types and sixteen new semantic relations.
1988
(1)
Sensitivity and Specificity of Three Methods of Detecting Adverse Drug Reactions.
Berry, L. L.; Segal, R.; Sherrin, T. P.; and Fudge, K. A.
American Journal of Hospital Pharmacy, 45(7): 1534–1539. July 1988.
Paper doi link bibtex abstract
Paper doi link bibtex abstract
@article{berry_sensitivity_1988, title = {Sensitivity and {Specificity} of {Three} {Methods} of {Detecting} {Adverse} {Drug} {Reactions}}, volume = {45}, issn = {0002-9289}, url = {https://academic.oup.com/ajhp/article/45/7/1534/5180571}, doi = {10.1093/ajhp/45.7.1534}, abstract = {Abstract. The sensitivity and specificity of three methods of detecting adverse drug reactions (ADRs) were determined.Minimal use of a voluntary ADR reporting}, language = {en}, number = {7}, urldate = {2020-06-30}, journal = {American Journal of Hospital Pharmacy}, author = {Berry, Laura L. and Segal, Richard and Sherrin, Thomas P. and Fudge, Kathy A.}, month = jul, year = {1988}, pages = {1534--1539}, }
Abstract. The sensitivity and specificity of three methods of detecting adverse drug reactions (ADRs) were determined.Minimal use of a voluntary ADR reporting
undefined
(2)
Disparities in receipt of medications for opioid use disorder before and during the COVID-19 pandemic in the U.S. Veterans Health Administration.
Sung, M. L.; Leon, C.; Reisman, J. I.; Gordon, K. S.; Kerns, R. D.; Li, W.; Liu, W.; Mitra, A.; Yu, H.; and Becker, W. C.
Substance Use & Addiction Journal. .
Revision in review, Dr. Sung and Dr. Leon are co-first author.
link bibtex
link bibtex
@article{sung_disparities_nodate, title = {Disparities in receipt of medications for opioid use disorder before and during the {COVID}-19 pandemic in the {U}.{S}. {Veterans} {Health} {Administration}}, journal = {Substance Use \& Addiction Journal}, author = {Sung, Minhee L. and Leon, Casey and Reisman, Joel I. and Gordon, Kirsha S. and Kerns, Robert D. and Li, Wenjun and Liu, Weisong and Mitra, Avijit and Yu, Hong and Becker, William C.}, note = {Revision in review, Dr. Sung and Dr. Leon are co-first author.}, }
Occurrence of opioid related neurocognitive symptoms associated with long-term opioid therapy.
Leon, C.; Sung, M. L.; Reisman, J. I; Liu, W.; Kerns, R. D.; Gordon, K. S.; Mitra, A.; Kwon, S.; Yu, H.; Becker, W. C.; and Li, W.
Clinical Journal of Pain. .
In review.
link bibtex
link bibtex
@article{leon_occurrence_nodate, title = {Occurrence of opioid related neurocognitive symptoms associated with long-term opioid therapy}, journal = {Clinical Journal of Pain}, author = {Leon, Casey and Sung, Minhee L. and Reisman, Joel I and Liu, Weisong and Kerns, Robert D. and Gordon, Kirsha S. and Mitra, Avijit and Kwon, Sunjae and Yu, Hong and Becker, William C. and Li, Wenjun}, note = {In review.}, }