christopher malon Archives | NEC Labs America

Christopher Malon is a Senior Researcher in the Machine Learning Department at NEC Laboratories America. He received his Bachelor’s degree in Mathematics from the University of Chicago and holds a PhD in Mathematics from the Massachusetts Institute of Technology. He focuses on building more trustworthy AI systems, leading projects in fact verification, opinion summarization, and hallucination avoidance in generative AI. In AI for healthcare, he contributed to NEC’s research and development of the world’s first automatic cancer screening system based on digital biopsy images, and now leads efforts at NEC on trustworthy medical report generation for multiple modalities. A key theme in his fact verification and medical report generation research is recognizing missing or incomplete information, to pinpoint which details need additional consideration by an AI agent or a human. With a strong track record in interdisciplinary collaboration, Dr. Malon’s research is helping NEC translate machine learning innovations into trustworthy AI partners to improve productivity in healthcare and other domains.

Posts

On Synthesizing Data for Context Attribution in Question Answering

April 7, 2025/in Publications/by NEC Labs America

Question Answering (QA) accounts for a significant portion of LLM usage “in the wild”. However, LLMs sometimes produce false or misleading responses, also known as “hallucinations”. Therefore, grounding the generated answers in contextually provided information — i.e., providing evidence for the generated text — is paramount for LLMs’ trustworthiness. Providing this information is the task of context attribution. In this paper, we systematically study LLM-based approaches for this task, namely we investigate (i) zero-shot inference, (ii) LLM ensembling, and (iii) fine-tuning of small LMs on synthetic data generated by larger LLMs. Our key contribution is SynQA: a novel generative strategy for synthesizing context attribution data. Given selected context sentences, an LLM generates QA pairs that are supported by these sentences. This leverages LLMs’ natural strengths in text generation while ensuring clear attribution paths in the synthetic training data. We show that the attribution data synthesized via SynQA is highly effective for fine-tuning small LMs for context attribution in different QA tasks and domains. Finally, with a user study, we validate the usefulness of small LMs (fine-tuned on synthetic data from SynQA) in context attribution for QA.

NEC Labs America Attends the 39th Annual AAAI Conference on Artificial Intelligence #AAAI25

February 27, 2025/in Events/by NEC Labs America

Our NEC Lab America team attended the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025. The purpose of the AAAI conference series was to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. Our team presented technical papers, led special tracks, delivered talks on key topics, participated in workshops, conducted tutorials, and showcased research in poster sessions. The team greeted visitors at Booth #208 and was there Thursday through Saturday.

Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation

February 25, 2025/in Publications/by NEC Labs America

Multimodal Large Language Models (MLLMs) have shown impressive performance in vision and text tasks. However, hallucination remains a major challenge, especially in fields like healthcare where details are critical. In this work, we show how MLLMs may be enhanced to support Visual RAG (V-RAG), a retrieval-augmented generation framework that incorporates both text and visual data from retrieved images. On the MIMIC-CXR chest X-ray report generation and Multicare medical image caption generation datasets, we show that Visual RAG improves the accuracy of entity probing, which asks whether a medical entities is grounded by an image. We show that the improvements extend both to frequent and rare entities, the latter of which may have less positive training data. Downstream, we apply V-RAG with entity probing to correct hallucinations and generate more clinically accurate X-ray reports, obtaining a higher RadGraph-F1 score.

Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024

November 15, 2024/in Publications/by NEC Labs America

Separating disinformation from fact on the web has long challenged both the search and the reasoning powers of humans. We show that the reasoning power of large language models (LLMs) and the retrieval power of modern search engines can be combined to automate this process and explainably verify claims. We integrate LLMs and search under a multi-hop evidence pursuit strategy. This strategy generates an initial question based on an input claim using a sequence to sequence model, searches and formulates an answer to the question, and iteratively generates follow-up questions to pursue the evidence that is missing using an LLM. We demonstrate our system on the FEVER 2024 (AVeriTeC) shared task. Compared to a strategy of generating all the questions at once, our method obtains .045 higher label accuracy and .155 higher AVeriTeC score (evaluating the adequacy of the evidence). Through ablations, we show the importance of various design choices, such as the question generation method, medium-sized context, reasoning with one document at a time, adding metadata, paraphrasing, reducing the problem to two classes, and reconsidering the final verdict. Our submitted system achieves .510 AVeriTeC score on the dev set and .477 AVeriTec score on the test set.

Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models

November 12, 2024/in Publications/by NEC Labs America

When performing complex multi-step reasoning tasks, the ability of Large Language Models (LLMs) to derive structured intermediate proof steps is important for ensuring that the models truly perform the desired reasoning and for improving models’ explainability. This paper is centered around a focused study: whether the current state-of-the-art generalist LLMs can leverage the structures in a few examples to better construct the proof structures with in-context learning. Our study specifically focuses on structure-aware demonstration and structure-aware pruning. We demonstrate that they both help improve performance. A detailed analysis is provided to help understand the results.

Introducing the Trustworthy Generative AI Project: Pioneering the Future of Compositional Generation and Reasoning

August 19, 2024/in News/by NEC Labs America

We are thrilled to announce the launch of our latest research initiative, the Trustworthy Generative AI Project. This ambitious project is set to revolutionize how we interact with multimodal content by developing cutting-edge generative models capable of compositional generation and reasoning across text, images, reports, and even 3D videos.

Self-Consistent Decoding for More Factual Open Responses

February 29, 2024/in Publications/by NEC Labs America

Self-consistency has emerged as a powerful method for improving the accuracy of short answers generated by large language models. As previously defined, it only concerns the accuracy of a final answer parsed from generated text. In this work, we extend the idea to open response generation, by integrating voting into the decoding method. Each output sentence is selected from among multiple samples, conditioning on the previous selections, based on a simple token overlap score. We compare this Sample & Select method to greedy decoding, beam search, nucleus sampling, and the recently introduced hallucination avoiding decoders of DoLa, P-CRR, and S-CRR. We show that Sample & Select improves factuality by a 30% relative margin against these decoders in NLI-based evaluation on the subsets of CNN/DM and XSum used in the FRANK benchmark, while maintaining comparable ROUGE-1 F1 scores against reference summaries. We collect human verifications of the generated summaries, confirming the factual superiority of our method.

Automatically Evaluating Opinion Prevalence in Opinion Summarization

August 7, 2023/in Publications/by NEC Labs America

When faced with a large number of product reviews, it is not clear that a human can remember all of them and weight opinions representatively to write a good reference summary. Wepropose an automatic metric to test the prevalence of the opinions that a summary expresses, based on counting the number of reviews that are consistent with each statement in the summary, while discrediting trivial or redundant statements. To formulate this opinion prevalence metric, we consider several existing methods to score the factual consistency of a summary statement with respect to each individual source review. On a corpus of Amazon product reviews, we gather multiple human judgments of the opinion consistency, to determinewhich automatic metric best expresses consistency in product reviews. Using the resulting opinion prevalence metric, we show that a human authored summary has only slightly betteropinion prevalence than randomly selected extracts from the source reviews, and previous extractive and abstractive unsupervised opinion summarization methods perform worse thanhumans. We demonstrate room for improvement with a greedy construction of extractive summaries with twice the opinion prevalence achieved by humans. Finally, we show that pre-processing source reviews by simplification can raise the opinion prevalence achieved by existing abstractive opinion summarization systems to the level of human performance

KGxBoard: Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models

December 7, 2022/in Publications/by NEC Labs America

Knowledge Graphs (KGs) store information in the form of (head, predicate, tail)-triples. To augment KGs with new knowledge, researchers proposed models for KG Completion (KGC) tasks such as link prediction, i.e., answering (h, p, ?) or (?, p, t) queries. Such models are usually evaluated with averaged metrics on a held-out test set. While useful for tracking progress, averaged single-score metrics cannotreveal what exactly a model has learned — or failed to learn. To address this issue, we propose KGxBoard: an interactive framework for performing fine-grained evaluation on meaningful subsets of the data, each of which tests individual and interpretable capabilities of a KGC model. In our experiments, we highlight the findings that we discovered with the use of KGxBoard, which would have been impossible to detect with standard averaged single-score metrics.

Analyzing Coreference and Bridging in Product Reviews

October 12, 2022/in Publications/by NEC Labs America

Product reviews may have complex discourse including coreference and bridging relations to a main product, competing products, and interacting products. Current approaches to aspect-based sentiment analysis (ABSA) and opinion summarization largely ignore this complexity. On the other hand, existing systems for coreference and bridging were trained in a different domain. We collect mention type annotations relevant to coreference and bridging for 498 product reviews. Using these annotations, we show that a state-of-the-art factuality score fails to catch coreference errors in product reviews, and that a state-of-the-art coreference system trained on OntoNotes does not perform nearly as well on product mentions. As our dataset grows, we expect it to help ABSA and opinion summarization systems to avoid entity reference errors.