Training Small AI Models Without Blindly Trusting Big Teacher Models

May 27, 2026|byNEC Labs America|inNews|tagsbayesian inference, beta kd, can jin, changyou chen, cvpr 2026, deep patel, edge ai, efficient ai, jingchen sun, knowledge distillation, machine learning, ml publication, model compression, multimodal llms, optical networking sensing, pub to news series, rutgers university, shaobo han, small ai models, uncertainty aware learning, university at buffalo, vision language models, wataru kohno

Not all supervision signals are equal: Beta-KD learns to reweight them during knowledge transfer.

Machine learning is quietly shifting from learning from data alone to learning from both data and large teacher models. Beta-KD asks whether uncertainty can tell a multimodal LLM when to trust the teacher versus the data.

Training Small AI Models Without Blindly Trusting Big Teacher Models

For decades, machine learning was framed around a simple idea: models learn from data. A quiet paradigm shift is now underway. Smaller AI models increasingly learn from both data and large teacher models, whose outputs and internal representations can guide compact models toward stronger performance. This teacher-student process, known as knowledge distillation, is becoming central to efficient and specialized AI.

But the new paradigm introduces a new challenge: not every signal deserves the same level of trust.

In multimodal AI, a student model may learn from ground-truth labels, a teacher’s probability distribution, and feature-level alignment between teacher and student representations. These signals can disagree. Some data examples are noisy. Some teacher outputs are uncertain. Some teacher signals may be too hard or too misleading for a much smaller student to copy.

A new paper from researchers in the Machine Learning department at NEC Laboratories America, in collaboration with researchers at the University at Buffalo, SUNY, and Rutgers University, addresses this challenge. Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models introduces Beta-KD, a framework for training compact multimodal models without blindly trusting a larger teacher. The work is authored by Jingchen Sun, University at Buffalo, SUNY; Shaobo Han, NEC Laboratories America, Inc.; Deep Patel, NEC Laboratories America, Inc.; Wataru Kohno, NEC Laboratories America, Inc., and NEC Corporation; Can Jin, Rutgers University; and Changyou Chen, University of Buffalo. The paper will be presented by Jingchen Sun, Shaobo Han and Deep Patel at CVPR 2026 next week in Denver.

How Beta-KD Works

Beta-KD formulates knowledge distillation as Bayesian inference with uncertainty-aware weighting. Instead of assigning fixed manual weights to each training objective, it learns how strongly each supervision signal should shape the student during knowledge transfer.

At the center of the method is beta, a precision-like weight that reflects how much the student should trust a teacher signal. A larger beta pushes the student to align more closely with the teacher. A smaller beta softens that teacher signal and gives more room for the data objective. This turns distillation from a fixed recipe into an adaptive trust mechanism.

Beta-KD supports two forms of uncertainty. Task-level weighting learns shared weights for supervision channels, such as data loss, output-level distillation, and feature-level distillation. Instance-level weighting goes further by predicting a sample-specific weight, so noisy, ambiguous, or more informative examples can be treated differently during training.

The approach is designed to be practical. The task-level version adds only a few learnable scalar parameters. The instance-level version uses a lightweight two-layer network that adds roughly 0.03 percent of the 1.67B-parameter student backbone, with negligible memory and training-time overhead.

Results

The team evaluated Beta-KD on multimodal visual question-answering benchmarks, where a model must understand both an image and a natural-language question. They distilled a 1.7B-parameter MobileVLM V2 student from a 7B-parameter MobileVLM V2 teacher and tested the student across benchmarks, including ScienceQA, GQA, TextVQA, POPE, MMBench, and MME.

The gains were consistent. On ScienceQA, instance-level Beta-KD improved VQA accuracy by up to 4.7 percentage points over a fixed-weight baseline. Across six benchmarks, Beta-KD improved the strongest existing distillation setup by an average of 2.0 points. Training analysis also showed faster convergence, smoother optimization, and closer alignment between teacher and student behavior.

Why This Matters Beyond Research

Knowledge distillation matters for two reasons: efficiency and specialization. It can compress the capabilities of a large general-purpose teacher into a smaller student model and help adapt that student for downstream real-world applications. That is especially important for mobile devices, embedded systems, robots, cameras, medical devices, and industrial monitoring platforms, where latency, bandwidth, privacy, memory, and energy constraints often make large cloud-only models impractical.

Beta-KD tackles one of the bottlenecks in making capable multimodal models deployable: how to transfer knowledge from a powerful teacher into a compact student without blindly trusting every teacher signal. By learning how to reweight supervision during knowledge transfer, Beta-KD makes distillation more automatic, more robust, and more practical for real-world model compression.

Code for Beta-KD is publicly available at https://github.com/Jingchensun/beta-kd.

About The Authors

Shaobo Han is a Senior Researcher in the Optical Networking and Sensing Department at NEC Laboratories America in Princeton, NJ. He received his Ph.D. in Electrical and Computer Engineering and his M.S. in Statistical Science from Duke University, where his research focused on probabilistic modeling, transfer learning, and structured variational inference. He also earned an M.Eng. degree in Signal and Information Processing from the University of Chinese Academy of Sciences. At NEC, Dr. Han has been prototyping and delivering advanced algorithmic solutions for real-world applications of sensing AI.

Deep Patel is a Senior Associate Researcher in the Machine Learning Department at NEC Laboratories America in Princeton, NJ. He earned his Bachelor of Science (BS) in Computer Science from Towson University. At NEC, Deep contributes to platforms for intelligent visual analytics, visual search, and vision-language interaction, helping develop video-based reasoning models that operate in real time across multi-camera systems. His work includes optimizing neural architectures for embedded systems and designing scalable inference pipelines for video AI applications.

Wataru Kohno was a Researcher in the Optical Networking and Sensing Department at NEC Laboratories America and is now at NEC Corporation in Tokyo. He earned his Ph.D. in Physics from Hokkaido University in Japan, where he built a strong foundation in the physics and engineering principles underlying optical communications. At NEC Laboratories America, he has contributed to multiple cutting-edge projects that expand the capabilities of distributed acoustic sensing. His recent work includes the development of advanced vibrometry techniques, recognition systems that adapt fiber sensing for downstream applications, and AI-enhanced methods for real-time detection in critical infrastructure such as power grids.

Publication to News Series

Our Publication-to-News Series highlights the real-world impact of our latest research, translating complex innovations into practical applications. From AI and machine learning to optical networking and intelligent systems, we showcase how our work goes beyond theory to address real-world challenges. Explore how cutting-edge research at NEC Laboratories America is driving measurable outcomes across industries.

When Video AI Gets Physics Wrong, the Consequences Are Real

July 21, 2026

Video generation models can look physically convincing while getting the physics completely wrong. PhyCo, new research from our Media Analytics department, introduces continuous, controllable physical properties to video AI, allowing practitioners to specify friction, bounce, and force.

Mix-CLAP: Teaching Audio AI to Work in the Noisy Real World

July 8, 2026

Mix-CLAP from NEC Laboratories America delivers near-Transformer accuracy for sound event classification at a fraction of the compute cost, using dual lightweight encoders and adaptive, noise-aware inference for real-world edge deployment.

How AI Can Transform the Way Companies Buy What They Need

June 22, 2026

Procurement teams lose time and money to inaccurate demand forecasts and manual supplier negotiations. A new framework from NEC Corporation and NEC Laboratories America combines automated negotiation with multimodal AI forecasting to optimize both sides of the procurement process.

Open SAT: How We Taught AI to Search Satellite Images Like a Search Engine

June 3, 2026

Satellite imagery is vast, high-resolution, and rich with information, but finding specific objects within it using natural language has remained a stubborn challenge. Open-SAT, developed by researchers at NEC Laboratories America and North South University, tackles this problem without retraining any models.

Training Small AI Models Without Blindly Trusting Big Teacher Models

How Beta-KD Works

Results

Why This Matters Beyond Research

About The Authors

Publication to News Series

When Video AI Gets Physics Wrong, the Consequences Are Real

Mix-CLAP: Teaching Audio AI to Work in the Noisy Real World

How AI Can Transform the Way Companies Buy What They Need

Open SAT: How We Taught AI to Search Satellite Images Like a Search Engine

Contact Us

About Us

Our Pages

Recent Publications

Events

News