Vision-Language Models (VLM) are a class of AI models that integrate both visual and textual information to understand and generate content. These models combine techniques from computer vision and natural language processing to analyze images and text together, enabling a deeper understanding of visual content in the context of language.

Posts

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method’s superior performance at a reduced cost.