Projects | Media Analytics

MEDIA ANALYTICS

PEOPLE

PUBLICATIONS

PATENTS

Agentic LLMs for AI Orchestration

We develop an agentic LLM to solve complex workflows by deploying a combination of computer vision, logic and compute modules. Based on a natural language task specification, our LLM generates a plan to accomplish the task using available tools. The plan is represented as a Python program synthesized to deploy the available tools, which can be anything that can be invoked programmatically.

Autonomous Driving

While autonomous cars are rapidly becoming a reality, it remains a challenge to scalably deploy them across geographies and conditions. Our full-stack autonomy solutions include perception, prediction, planning, simulation and devops that leverage latest advances in generative AI, neural rendering, large language models, diffusion models and transformers.

Foundational Vision-Language Models

Our foundational models enable ubiquitous usage of computer vision across scenarios, applications and user preferences. By combining the power of very large-scale computer vision and natural language datasets, together with innovations in visual instruction following, our foundational models yield deeper domain-specific insights, at lower data center costs, and with lower hallucinations.

Neural Rendering and Diffusion for Simulation

Our simulation framework utilizes advances in neural rendering, diffusion models and large language models to automatically transform drive data into a full 3D sensor simulation testbed with unmatched photorealism. We offer language-based control to generate safety-critical scenarios such as collisions, traffic rule violations and other unsafe behaviors, to improve the perception and planning abilities of autonomous vehicles.

Open Vocabulary Perception

Perception methods such as object detection and image segmentation form a basic building block of most computer vision applications. We develop open vocabulary perception methods that combine the power of vision and language to provide rich descriptions of objects in scenes, including their attributes, behaviors, relations and interactions.

Prediction and Planning

We are pioneers in the development of generative models that predict long-horizon future trajectories of dynamic objects, with probabilistic outcomes that account for diverse future actions with the same past. Our methods such as DESIRE, SMART and DAC achieve various capabilities such as diversity, scene consistency, constant-time inference and multimodality that adheres to lane geometries and driving rules.

Multimodal LLMs for AI DevOps

Safety-critical applications must account for all scenarios, including those posing high risks despite being under-observed in usual scenarios. Applications like autonomous driving incur a high development cost since they require extensive data collection, data curation, model training and verification, which are prohibitively expensive and pose barriers to new entrants in the space.

3D Perception

We have pioneered the development of learned bird-eye view representations for road scenes which form a basis for 3D perception using images in applications like autonomous driving. Our techniques for 3D localization of objects achieve high accuracy for object position, orientation and part locations with just a monocular camera, using novel geometric and learned priors.

Robustness and Fairness

Modern applications of computer vision demand robustness across scenarios as well as social acceptability. For example, object detection must work across daytime and low-light conditions, or face recognition should produce accurate outputs across ethnicities. To deal with such scenarios, we develop universal representation learning methods that go beyond the limitations of expensive and high-quality labeled data, to utilize large-scale and diverse unlabeled data.

Robust and Unbiased Face Recognition

Our face recognition methods achieve high accuracy on competitive public benchmarks through the use of universal representation learning techniques that leverage very large-scale datasets, with robustness to variations such as occlusions, blur, lighting or accessories. We develop methods in long-tail recognition that account for the low sample diversity of most identities in face recognition datasets.

Privacy-Aware and Federated Learning

Privacy impacts every stakeholder in the AI solution ecosystem, including consumers, operators, solution providers and regulators. This is especially true for applications such as healthcare, safety and finance which require collecting and analyzing highly sensitive data. We develop AI solutions to assure customers that private information is not leaked at any stage of the data lifecycle.

Privacy-Aware Cameras

Besides privacy-aware learning, we also develop methods for privacy-aware sensing. In particular, we develop novel computational cameras that allow computer vision analysis even in sensitive environments like hospitals or smart homes. Our key innovation is a camera that removes private information. Our adversarial training approach allows simultaneously high accuracy and high privacy through learned phase masks, which are inserted in the focal plane of the camera.

Dynamic Multi-Task Architectures

Multi-task learning commonly encounters competition for resources among tasks when model capacity is limited. We develop neural architectures that allow control over the relative importance of tasks and total compute cost during inference time. Our controllable multi-task networks dynamically adjust architecture and weights to match desired task preferences as well as resource constraints.

Embodied AI

We develop embodied agents for robotics applications that require exploration, navigation and transport in complex scenes. Our modular hierarchical transport policy builds a topological graph of the scene to perform exploration, then combined motion planning algorithms to reach point goals within explored locations with object navigation policies for moving towards semantic targets at unknown locations.

Media Analytics | Projects

Agentic LLMs for AI Orchestration

Autonomous Driving

Foundational Vision-Language Models

Neural Rendering and Diffusion for Simulation

Open Vocabulary Perception

Prediction and Planning

Multimodal LLMs for AI DevOps

3D Perception

Robustness and Fairness

Robust and Unbiased Face Recognition

Privacy-Aware and Federated Learning

Privacy-Aware Cameras

Dynamic Multi-Task Architectures

Embodied AI

Contact Us

About Us

Our Pages

Read Our Blog Posts