Healthcare LLM refers to a specialized large language model designed to optimize healthcare workflows through explainable reasoning and safe data generation. It assists with complex tasks like diagnostics, treatment planning, and patient management across multiple data modalities, such as text, images, and structured data.

The Healthcare Large Multimodal Models Project emphasizes transparency and safety, ensuring healthcare professionals can trust AI’s decisions. This will lead to improved efficiency, reduced errors, and enhanced patient outcomes across healthcare systems.

Posts

Analyzing Coreference and Bridging in Product Reviews

Product reviews may have complex discourse including coreference and bridging relations to a main product, competing products, and interacting products. Current approaches to aspect-based sentiment analysis (ABSA) and opinion summarization largely ignore this complexity. On the other hand, existing systems for coreference and bridging were trained in a different domain. We collect mention type annotations relevant to coreference and bridging for 498 product reviews. Using these annotations, we show that a state-of-the-art factuality score fails to catch coreference errors in product reviews, and that a state-of-the-art coreference system trained on OntoNotes does not perform nearly as well on product mentions. As our dataset grows, we expect it to help ABSA and opinion summarization systems to avoid entity reference errors.

Fast Few-shot Debugging for NLU Test Suites

We study few-shot debugging of transformer based natural language understanding models, using recently popularized test suites to not just diagnose but correct a problem. Given a few debugging examples of a certain phenomenon, and a held-out test set of the same phenomenon, we aim to maximize accuracy on the phenomenon at a minimal cost of accuracy on the original test set. We examine several methods that are faster than full epoch retraining. We introduce a new fast method, which samples a few in-danger examples from the original training set. Compared to fast methods using parameter distance constraints or Kullback-Leibler divergence, we achieve superior original accuracy for comparable debugging accuracy.

Team Papelo at FEVEROUS: Multi-hop Evidence Pursuit

We develop a system for the FEVEROUS fact extraction and verification task that ranks an initial set of potential evidence and then pursues missing evidence in subsequent hops by trying to generate it, with a “next hop prediction module” whose output is matched against page elements in a predicted article. Seeking evidence with the next hop prediction module continues to improve FEVEROUS score for up to seven hops. Label classification is trained on possibly incomplete extracted evidence chains, utilizing hints that facilitate numerical comparison. The system achieves .281 FEVEROUS score and .658 label accuracy on the development set, and finishes in second place with .259 FEVEROUS score and .576 label accuracy on the test set.

Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning

Image captioning systems are expected to have the ability to combine individual concepts when describing scenes with concept combinations that are not observed during training. In spite of significant progress in image captioning with the help of the autoregressive generation framework, current approaches fail to generalize well to novel concept combinations. We propose a new framework that revolves around probing several similar image caption training instances (retrieval), performing analogical reasoning over relevant entities in retrieved prototypes (analogy), and enhancing the generation process with reasoning outcomes (composition). Our method augments the generation model by referring to the neighboring instances in the training set to produce novel concept combinations in generated captions. We perform experiments on the widely used image captioning benchmarks. The proposed models achieve substantial improvement over the compared baselines on both composition-related evaluation metrics and conventional image captioning metrics.

Prediction of Non-Muscle Invasive Bladder Cancer Recurrence using Machine Learning of Quantitative Nuclear Features

Non-muscle invasive bladder cancer (NMIBC) generally has a good prognosis, however, recurrence after transurethral resection (TUR), the standard primary treatment, is a major problem. Clinical management after TUR has been based on risk classification using clinicopathological factors, but these classifications are not complete. In this study, we attempted to predict early recurrence of NMIBC based on machine learning of quantitative morphological features. In general, structural, cellular, and nuclear atypia are evaluated to determine cancer atypia. However, since it is difficult to accurately quantify structural atypia from TUR specimens, in this study, we used only nuclear atypia and analyzed it using feature extraction followed by classification using Support Vector Machine and Random Forest machine learning algorithms. For the analysis, 125 patients diagnosed with NMIBC were used, data from 95 patients were randomly selected for the training set, and data from 30 patients were randomly selected for the test set. The results showed that the support vector machine-based model predicted recurrence within 2 years after TUR with a probability of 90% and the random forest-based model with probability of 86.7%. In the future, the system can be used to objectively predict NMIBC recurrence after TUR.

Overcoming Poor Word Embeddings with Word Definitions

Modern natural language understanding models depend on pretrained subword embeddings, but applications may need to reason about words that were never or rarely seen during pretraining. We show that examples that depend critically on a rarer word are more challenging for natural language inference models. Then we explore how a model could learn to use definitions, provided in natural text, to overcome this handicap. Our model’s understanding of a definition is usually weaker than a well-modeled word embedding, but it recovers most of the performance gap from using a completely untrained word.

Prediction of Early Recurrence of Hepatocellular Carcinoma after Resection using Digital Pathology Images Assessed by Machine Learning

Hepatocellular carcinoma (HCC) is a representative primary liver cancer caused by long-term and repetitive liver injury. Surgical resection is generally selected as the radical cure treatment. Because the early recurrence of HCC after resection is associated with low overall survival, the prediction of recurrence after resection is clinically important. However, the pathological characteristics of the early recurrence of HCC have not yet been elucidated. We attempted to predict the early recurrence of HCC after resection based on digital pathologic images of hematoxylin and eosin-stained specimens and machine learning applying a support vector machine (SVM). The 158 HCC patients meeting the Milan criteria who underwent surgical resection were included in this study. The patients were categorized into three groups: Group I, patients with HCC recurrence within 1 year after resection (16 for training and 23 for test), Group II, patients with HCC recurrence between 1 and 2 years after resection (22 and 28), and Group III, patients with no HCC recurrence within 4 years after resection (31 and 38). The SVM-based prediction method separated the three groups with 89.9% (80/89) accuracy. Prediction of Groups I was consistent for all cases, while Group II was predicted to be Group III in one case, and Group III was predicted to be Group II in 8 cases. The use of digital pathology and machine learning could be used for highly accurate prediction of HCC recurrence after surgical resection, especially that for early recurrence. Currently, in most cases after HCC resection, regular blood tests and diagnostic imaging are used for follow-up observation, however, the use of digital pathology coupled with machine learning offers potential as a method for objective postoprative follow-up observation.

Improving Disentangled Text Representation Learning with Information Theoretical Guidance

Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.

Generating Followup Questions for Interpretable Multi hop Question Answering

We propose a framework for answering open domain multi hop questions in which partial information is read and used to generate followup questions, to finally be answered by a pretrained single hop answer extractor. This framework makes each hop interpretable, and makes the retrieval associated with later hops as flexible and specific as for the first hop. As a first instantiation of this framework, we train a pointer generator network to predict followup questions based on the question and partial information. This provides a novel application of a neural question generation network, which is applied to give weak ground truth single hop followup questions based on the final answers and their supporting facts. Learning to generate followup questions that select the relevant answer spans against downstream supporting facts, while avoiding distracting premises, poses an exciting semantic challenge for text generation. We present an evaluation using the two hop bridge questions of HotpotQA

Contextual Grounding of Natural Language Phrases in Images

In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token embeddings and image object features from an off-the-shelf object detector as input. Additional encoding to capture the positional and spatial information can be added to enhance the feature quality. There are separate text and image branches facilitating respective architectural refinements for different modalities. The text branch is pre-trained on a large-scale masked language modeling task while the image branch is trained from scratch. Next, the model learns the contextual representations of the text tokens and image objects through layers of high-order interaction respectively. The final grounding head ranks the correspondence between the textual and visual representations through cross-modal interaction. In the evaluation, we show that our model achieves the state-of-the-art grounding accuracy of 71.36% over the Flickr30K Entities dataset. No additional pre-training is necessary to deliver competitive results compared with related work that often requires task-agnostic and task-specific pre-training on cross-modal datasets. The implementation is publicly available at https://gitlab.com/necla-ml/Grounding