Tag Archive for: christopher malon

Christopher Malon is a Senior Researcher in the Machine Learning Department at NEC Laboratories America. He received his Bachelor’s degree in Mathematics from the University of Chicago and holds a PhD in Mathematics from the Massachusetts Institute of Technology.

He focuses on building more trustworthy AI systems, leading projects in fact verification, opinion summarization, and hallucination avoidance in generative AI. In AI for healthcare, he contributed to NEC’s research and development of the world’s first automatic cancer screening system based on digital biopsy images, and now leads efforts at NEC on trustworthy medical report generation for multiple modalities. A key theme in his fact verification and medical report generation research is recognizing missing or incomplete information, to pinpoint which details need additional consideration by an AI agent or a human.

With a strong track record in interdisciplinary collaboration, Dr. Malon’s research is helping NEC translate machine learning innovations into trustworthy AI partners to improve productivity in healthcare and other domains.

Posts

EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

December 20, 2025/in Publications/by NEC Labs America

Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models, have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in clinical metrics across four major datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of5.9% on unseen datasets.

DiscussLLM: Teaching Large Language Models When to Speak

August 25, 2025/in Publications/by NEC Labs America

Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an “awareness gap,” limiting their potential as truly collaborative partners in dynamic human discussions. We introduce , a framework designed to bridge this gap by training models to proactively decide not just to say, but critically, to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.

On Synthesizing Data for Context Attribution in Question Answering

August 1, 2025/in Publications/by NEC Labs America

Question Answering (QA) accounts for a significantportion of LLM usage “in the wild”.However, LLMs sometimes produce false ormisleading responses, also known as hallucinations.Therefore, grounding the generatedanswers in contextually provided information—i.e., providing evidence for the generated text—is paramount for LLMs’ trustworthiness. Providingthis information is the task of context attribution.In this paper, we systematically studyLLM-based approaches for this task, namelywe investigate (i) zero-shot inference, (ii) LLMensembling, and (iii) fine-tuning of small LMson synthetic data generated by larger LLMs.Our key contribution is SYNQA: a novel generativestrategy for synthesizing context attributiondata. Given selected context sentences, anLLM generates QA pairs that are supported bythese sentences. This leverages LLMs’ naturalstrengths in text generation while ensuring clearattribution paths in the synthetic training data.We show that the attribution data synthesizedvia SYNQA is highly effective for fine-tuningsmall LMs for context attribution in differentQA tasks and domains. Finally, with a userstudy, we validate the usefulness of small, efficientLMs (fine-tuned on synthetic data fromSYNQA) in context attribution for QA.

NEC Labs America Attends the 39th Annual AAAI Conference on Artificial Intelligence #AAAI25

February 27, 2025/in Events/by NEC Labs America

Our NEC Lab America team attended the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025. The purpose of the AAAI conference series was to promote research in AI and foster scientific exchange between researchers, practitioners, scientists, students, and engineers.

Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation

February 25, 2025/in Publications/by NEC Labs America

Multimodal Large Language Models (MLLMs) have shown impressive performance in vision and text tasks. However, hallucination remains a major challenge, especially in fields like healthcare where details are critical. In this work, we show how MLLMs may be enhanced to support Visual RAG (V-RAG), a retrieval-augmented generation framework that incorporates both text and visual data from retrieved images. On the MIMIC-CXR chest X-ray report generation and Multicare medical image caption generation datasets, we show that Visual RAG improves the accuracy of entity probing, which asks whether a medical entities is grounded by an image. We show that the improvements extend both to frequent and rare entities, the latter of which may have less positive training data. Downstream, we apply V-RAG with entity probing to correct hallucinations and generate more clinically accurate X-ray reports, obtaining a higher RadGraph-F1 score.

Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024

November 15, 2024/in Publications/by NEC Labs America

Separating disinformation from fact on the web has long challenged both the search and the reasoning powers of humans. We show that the reasoning power of large language models (LLMs) and the retrieval power of modern search engines can be combined to automate this process and explainably verify claims. We integrate LLMs and search under a multi-hop evidence pursuit strategy. This strategy generates an initial question based on an input claim using a sequence to sequence model, searches and formulates an answer to the question, and iteratively generates follow-up questions to pursue the evidence that is missing using an LLM. We demonstrate our system on the FEVER 2024 (AVeriTeC) shared task. Compared to a strategy of generating all the questions at once, our method obtains .045 higher label accuracy and .155 higher AVeriTeC score (evaluating the adequacy of the evidence). Through ablations, we show the importance of various design choices, such as the question generation method, medium-sized context, reasoning with one document at a time, adding metadata, paraphrasing, reducing the problem to two classes, and reconsidering the final verdict. Our submitted system achieves .510 AVeriTeC score on the dev set and .477 AVeriTec score on the test set.

Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models

November 12, 2024/in Publications/by NEC Labs America

When performing complex multi-step reasoning tasks, the ability of Large Language Models (LLMs) to derive structured intermediate proof steps is important for ensuring that the models truly perform the desired reasoning and for improving models’ explainability. This paper is centered around a focused study: whether the current state-of-the-art generalist LLMs can leverage the structures in a few examples to better construct the proof structures with in-context learning. Our study specifically focuses on structure-aware demonstration and structure-aware pruning. We demonstrate that they both help improve performance. A detailed analysis is provided to help understand the results.

Introducing the Trustworthy Generative AI Project: Pioneering the Future of Compositional Generation and Reasoning

August 19, 2024/in News/by NEC Labs America

We are thrilled to announce the launch of our latest research initiative, the Trustworthy Generative AI Project. This ambitious project is set to revolutionize how we interact with multimodal content by developing cutting-edge generative models capable of compositional generation and reasoning across text, images, reports, and even 3D videos.

Self-Consistent Decoding for More Factual Open Responses

February 29, 2024/in Publications/by NEC Labs America

Self-consistency has emerged as a powerful method for improving the accuracy of short answers generated by large language models. As previously defined, it only concerns the accuracy of a final answer parsed from generated text. In this work, we extend the idea to open response generation, by integrating voting into the decoding method. Each output sentence is selected from among multiple samples, conditioning on the previous selections, based on a simple token overlap score. We compare this Sample & Select method to greedy decoding, beam search, nucleus sampling, and the recently introduced hallucination avoiding decoders of DoLa, P-CRR, and S-CRR. We show that Sample & Select improves factuality by a 30% relative margin against these decoders in NLI-based evaluation on the subsets of CNN/DM and XSum used in the FRANK benchmark, while maintaining comparable ROUGE-1 F1 scores against reference summaries. We collect human verifications of the generated summaries, confirming the factual superiority of our method.

Automatically Evaluating Opinion Prevalence in Opinion Summarization

August 7, 2023/in Publications/by NEC Labs America

When faced with a large number of product reviews, it is not clear that a human can remember all of them and weight opinions representatively to write a good reference summary. Wepropose an automatic metric to test the prevalence of the opinions that a summary expresses, based on counting the number of reviews that are consistent with each statement in the summary, while discrediting trivial or redundant statements. To formulate this opinion prevalence metric, we consider several existing methods to score the factual consistency of a summary statement with respect to each individual source review. On a corpus of Amazon product reviews, we gather multiple human judgments of the opinion consistency, to determinewhich automatic metric best expresses consistency in product reviews. Using the resulting opinion prevalence metric, we show that a human authored summary has only slightly betteropinion prevalence than randomly selected extracts from the source reviews, and previous extractive and abstractive unsupervised opinion summarization methods perform worse thanhumans. We demonstrate room for improvement with a greedy construction of extractive summaries with twice the opinion prevalence achieved by humans. Finally, we show that pre-processing source reviews by simplification can raise the opinion prevalence achieved by existing abstractive opinion summarization systems to the level of human performance

Posts

DiscussLLM: Teaching Large Language Models When to Speak

On Synthesizing Data for Context Attribution in Question Answering

NEC Labs America Attends the 39th Annual AAAI Conference on Artificial Intelligence #AAAI25

Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation

Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024

Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models

Introducing the Trustworthy Generative AI Project: Pioneering the Future of Compositional Generation and Reasoning

Self-Consistent Decoding for More Factual Open Responses

Automatically Evaluating Opinion Prevalence in Opinion Summarization

Contact Us

About Us

Our Pages

Recent Publications

Events

News