Not all supervision signals are equal: Beta-KD learns to reweight them during knowledge transfer.

Machine learning is quietly shifting from learning from data alone to learning from both data and large teacher models. Beta-KD asks whether uncertainty can tell a multimodal LLM when to trust the teacher versus the data.

Training Small AI Models Without Blindly Trusting Big Teacher Models

For decades, machine learning was framed around a simple idea: models learn from data. A quiet paradigm shift is now underway. Smaller AI models increasingly learn from both data and large teacher models, whose outputs and internal representations can guide compact models toward stronger performance. This teacher-student process, known as knowledge distillation, is becoming central to efficient and specialized AI.

But the new paradigm introduces a new challenge: not every signal deserves the same level of trust.

In multimodal AI, a student model may learn from ground-truth labels, a teacher’s probability distribution, and feature-level alignment between teacher and student representations. These signals can disagree. Some data examples are noisy. Some teacher outputs are uncertain. Some teacher signals may be too hard or too misleading for a much smaller student to copy.

A new paper from researchers at NEC Laboratories America, the University at Buffalo, SUNY, and Rutgers University addresses this challenge. Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models introduces Beta-KD, a framework for training compact multimodal models without blindly trusting a larger teacher. The work is authored by Jingchen Sun, University at Buffalo, SUNY; Shaobo Han, NEC Laboratories America, Inc.; Deep Patel, NEC Laboratories America, Inc.; Wataru Kohno, NEC Laboratories America, Inc., and NEC Corporation; Can Jin, Rutgers University; and Changyou Chen, University of Buffalo. The paper will be presented by Jingchen Sun, Shaobo Han and Deep Patel at CVPR 2026 next week in Denver.

How Beta-KD Works

Beta-KD formulates knowledge distillation as Bayesian inference with uncertainty-aware weighting. Instead of assigning fixed manual weights to each training objective, it learns how strongly each supervision signal should shape the student during knowledge transfer.

At the center of the method is beta, a precision-like weight that reflects how much the student should trust a teacher signal. A larger beta pushes the student to align more closely with the teacher. A smaller beta softens that teacher signal and gives more room for the data objective. This turns distillation from a fixed recipe into an adaptive trust mechanism.

Beta-KD supports two forms of uncertainty. Task-level weighting learns shared weights for supervision channels, such as data loss, output-level distillation, and feature-level distillation. Instance-level weighting goes further by predicting a sample-specific weight, so noisy, ambiguous, or more informative examples can be treated differently during training.

The approach is designed to be practical. The task-level version adds only a few learnable scalar parameters. The instance-level version uses a lightweight two-layer network that adds roughly 0.03 percent of the 1.67B-parameter student backbone, with negligible memory and training-time overhead.

Results

The team evaluated Beta-KD on multimodal visual question-answering benchmarks, where a model must understand both an image and a natural-language question. They distilled a 1.7B-parameter MobileVLM V2 student from a 7B-parameter MobileVLM V2 teacher and tested the student across benchmarks, including ScienceQA, GQA, TextVQA, POPE, MMBench, and MME.

The gains were consistent. On ScienceQA, instance-level Beta-KD improved VQA accuracy by up to 4.7 percentage points over a fixed-weight baseline. Across six benchmarks, Beta-KD improved the strongest existing distillation setup by an average of 2.0 points. Training analysis also showed faster convergence, smoother optimization, and closer alignment between teacher and student behavior.

Why This Matters Beyond Research

Knowledge distillation matters for two reasons: efficiency and specialization. It can compress the capabilities of a large general-purpose teacher into a smaller student model and help adapt that student for downstream real-world applications. That is especially important for mobile devices, embedded systems, robots, cameras, medical devices, and industrial monitoring platforms, where latency, bandwidth, privacy, memory, and energy constraints often make large cloud-only models impractical.

Beta-KD tackles one of the bottlenecks in making capable multimodal models deployable: how to transfer knowledge from a powerful teacher into a compact student without blindly trusting every teacher signal. By learning how to reweight supervision during knowledge transfer, Beta-KD makes distillation more automatic, more robust, and more practical for real-world model compression.

Code for Beta-KD is publicly available at https://github.com/Jingchensun/beta-kd.

About The Authors

Publication to Blog Post Series

Our Publication-to-Blog Post Series highlights the real-world impact of our latest research, translating complex innovations into practical applications. From AI and machine learning to optical networking and intelligent systems, we showcase how our work goes beyond theory to address real-world challenges. Explore how cutting-edge research at NEC Laboratories America is driving measurable outcomes across industries.

Open SAT How We Taught AI to Search Satellite Images Like a Search Engine

Open SAT: How We Taught AI to Search Satellite Images Like a Search Engine

Satellite imagery is vast, high-resolution, and rich with information, but finding specific objects within it using natural language has remained a stubborn challenge. Open-SAT, developed by researchers at NEC Laboratories America and North South University, tackles this problem without retraining any models.
Training Small AI Models Without Blindly Trusting Big Teacher Models

Training Small AI Models Without Blindly Trusting Big Teacher Models

Machine learning is shifting from learning from data alone to learning from both data and teacher models. Beta-KD uses uncertainty-aware Bayesian weighting to train compact multimodal AI without blindly trusting every teacher signal.