Not all supervision signals are equal: Beta-KD learns to reweight them during knowledge transfer.
Machine learning is quietly shifting from learning from data alone to learning from both data and large teacher models. Beta-KD asks whether uncertainty can tell a multimodal LLM when to trust the teacher versus the data.
For decades, machine learning was framed around a simple idea: models learn from data. A quiet paradigm shift is now underway. Smaller AI models increasingly learn from both data and large teacher models, whose outputs and internal representations can guide compact models toward stronger performance. This teacher-student process, known as knowledge distillation, is becoming central to efficient and specialized AI.
But the new paradigm introduces a new challenge: not every signal deserves the same level of trust.
In multimodal AI, a student model may learn from ground-truth labels, a teacher’s probability distribution, and feature-level alignment between teacher and student representations. These signals can disagree. Some data examples are noisy. Some teacher outputs are uncertain. Some teacher signals may be too hard or too misleading for a much smaller student to copy.
A new paper from researchers at NEC Laboratories America, the University at Buffalo, SUNY, and Rutgers University addresses this challenge. Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models introduces Beta-KD, a framework for training compact multimodal models without blindly trusting a larger teacher. The work is authored by Jingchen Sun, University at Buffalo, SUNY; Shaobo Han, NEC Laboratories America, Inc.; Deep Patel, NEC Laboratories America, Inc.; Wataru Kohno, NEC Laboratories America, Inc., and NEC Corporation; Can Jin, Rutgers University; and Changyou Chen, University of Buffalo. The paper will be presented by Jingchen Sun, Shaobo Han and Deep Patel at CVPR 2026 next week in Denver.
How Beta-KD Works
Beta-KD formulates knowledge distillation as Bayesian inference with uncertainty-aware weighting. Instead of assigning fixed manual weights to each training objective, it learns how strongly each supervision signal should shape the student during knowledge transfer.
At the center of the method is beta, a precision-like weight that reflects how much the student should trust a teacher signal. A larger beta pushes the student to align more closely with the teacher. A smaller beta softens that teacher signal and gives more room for the data objective. This turns distillation from a fixed recipe into an adaptive trust mechanism.
Beta-KD supports two forms of uncertainty. Task-level weighting learns shared weights for supervision channels, such as data loss, output-level distillation, and feature-level distillation. Instance-level weighting goes further by predicting a sample-specific weight, so noisy, ambiguous, or more informative examples can be treated differently during training.
The approach is designed to be practical. The task-level version adds only a few learnable scalar parameters. The instance-level version uses a lightweight two-layer network that adds roughly 0.03 percent of the 1.67B-parameter student backbone, with negligible memory and training-time overhead.
Results
The team evaluated Beta-KD on multimodal visual question-answering benchmarks, where a model must understand both an image and a natural-language question. They distilled a 1.7B-parameter MobileVLM V2 student from a 7B-parameter MobileVLM V2 teacher and tested the student across benchmarks, including ScienceQA, GQA, TextVQA, POPE, MMBench, and MME.
The gains were consistent. On ScienceQA, instance-level Beta-KD improved VQA accuracy by up to 4.7 percentage points over a fixed-weight baseline. Across six benchmarks, Beta-KD improved the strongest existing distillation setup by an average of 2.0 points. Training analysis also showed faster convergence, smoother optimization, and closer alignment between teacher and student behavior.
Why This Matters Beyond Research
Knowledge distillation matters for two reasons: efficiency and specialization. It can compress the capabilities of a large general-purpose teacher into a smaller student model and help adapt that student for downstream real-world applications. That is especially important for mobile devices, embedded systems, robots, cameras, medical devices, and industrial monitoring platforms, where latency, bandwidth, privacy, memory, and energy constraints often make large cloud-only models impractical.
Beta-KD tackles one of the bottlenecks in making capable multimodal models deployable: how to transfer knowledge from a powerful teacher into a compact student without blindly trusting every teacher signal. By learning how to reweight supervision during knowledge transfer, Beta-KD makes distillation more automatic, more robust, and more practical for real-world model compression.
Code for Beta-KD is publicly available at https://github.com/Jingchensun/beta-kd.
About The Authors
Shaobo Han is a Senior Researcher in the Optical Networking and Sensing Department at NEC Laboratories America in Princeton, NJ. He received his Ph.D. in Electrical and Computer Engineering and his M.S. in Statistical Science from Duke University, where his research focused on probabilistic modeling, transfer learning, and structured variational inference. He also earned an M.Eng. degree in Signal and Information Processing from the University of Chinese Academy of Sciences. At NEC, Dr. Han has been prototyping and delivering advanced algorithmic solutions for real-world applications of sensing AI.
Deep Patel is a Senior Associate Researcher in the Machine Learning Department at NEC Laboratories America in Princeton, NJ. He earned his Bachelor of Science (BS) in Computer Science from Towson University. At NEC, Deep contributes to platforms for intelligent visual analytics, visual search, and vision-language interaction, helping develop video-based reasoning models that operate in real time across multi-camera systems. His work includes optimizing neural architectures for embedded systems and designing scalable inference pipelines for video AI applications.
Wataru Kohno was a Researcher in the Optical Networking and Sensing Department at NEC Laboratories America and is now at NEC Corporation in Tokyo. He earned his Ph.D. in Physics from Hokkaido University in Japan, where he built a strong foundation in the physics and engineering principles underlying optical communications. At NEC Laboratories America, he has contributed to multiple cutting-edge projects that expand the capabilities of distributed acoustic sensing. His recent work includes the development of advanced vibrometry techniques, recognition systems that adapt fiber sensing for downstream applications, and AI-enhanced methods for real-time detection in critical infrastructure such as power grids.
Publication to Blog Post Series
Our Publication-to-Blog Post Series highlights the real-world impact of our latest research, translating complex innovations into practical applications. From AI and machine learning to optical networking and intelligent systems, we showcase how our work goes beyond theory to address real-world challenges. Explore how cutting-edge research at NEC Laboratories America is driving measurable outcomes across industries.







