NEC Labs Blue Logo Square

Srimat T. Chakradhar

Department Head

Integrated Systems

Posts

Deep Video Codec Control

Deep Video Codec Control Lossy video compression is commonly used when transmitting and storing video data. Unified video codecs (e.g., H.264 or H.265) remain the emph(Unknown sysvar: (de facto)) standard, despite the availability of advanced (neural) compression approaches. Transmitting videos in the face of dynamic network bandwidth conditions requires video codecs to adapt to vastly different compression strengths. Rate control modules augment the codec’s compression such that bandwidth constraints are satisfied and video distortion is minimized. While, both standard video codes and their rate control modules are developed to minimize video distortion w.r.t. human quality assessment, preserving the downstream performance of deep vision models is not considered. In this paper, we present the first end-to-end learnable deep video codec control considering both bandwidth constraints and downstream vision performance, while not breaking existing standardization. We demonstrate for two common vision tasks (semantic segmentation and optical flow estimation) and on two different datasets that our deep codec control better preserves downstream performance than using 2-pass average bit rate control while meeting dynamic bandwidth constraints and adhering to standardizations.

Semantic Multi-Resolution Communications

Semantic Multi-Resolution Communications Deep learning based joint source-channel coding (JSCC) has demonstrated significant advancements in data reconstruction compared to separate source-channel coding (SSCC). This superiority arises from the suboptimality of SSCC when dealing with finite block-length data. Moreover, SSCC falls short in reconstructing data in a multi-user and/or multi-resolution fashion, as it only tries to satisfy the worst channel and/or the highest quality data. To overcome these limitations, we propose a novel deep learning multi-resolution JSCC framework inspired by the concept of multi-task learning (MTL). This proposed framework excels at encoding data for different resolutions through hierarchical layers and effectively decodes it by leveraging both current and past layers of encoded data. Moreover, this framework holds great potential for semantic communication, where the objective extends beyond data reconstruction to preserving specific semantic attributes throughout the communication process. These semantic features could be crucial elements such as class labels, essential for classification tasks, or other key attributes that require preservation. Within this framework, each level of encoded data can be carefully designed to retain specific data semantics. As a result, the precision of a semantic classifier can be progressively enhanced across successive layers, emphasizing the preservation of targeted semantics throughout the encoding and decoding stages. We conduct experiments on MNIST and CIFAR10 dataset. The experiment with both datasets illustrates that our proposed method is capable of surpassing the SSCC method in reconstructing data with different resolutions, enabling the extraction of semantic features with heightened confidence in successive layers. This capability is particularly advantageous for prioritizing and preserving more crucial semantic features within the datasets.

Retrospective : A Dynamically Configurable Coprocessor For Convolutional Neural Networks

Retrospective : A dynamically configurable coprocessor for convolutional neural networks In 2008, parallel computing posed significant challenges due to the complexities of parallel programming and the bottlenecks associated with efficient parallel execution. Inspired by the remarkable scalability achieved by networking and storage systems in handling extensive packet traffic and persistent data respectively by leveraging best-effort service, we proposed a new and fundamentally different approach of best-effort computing.Having observed that a broad spectrum of existing and emerging computing workloads were from applications that had an inherent forgiving nature [2], [5], we proposed best effort computing. The new approach resulted in disproportionate gains in power, energy and latency, while improving performance. While contemplating the concept of best-effort computing [2], we noticed the resurgence of convolutional neural networks, which generated approximate but acceptable outcomes for numerous recognition, mining, and synthesis tasks. The lead author of this retrospective had previously conducted research on neural networks for his doctoral dissertation over a decade ago, and the reemergence of neural networks proved both surprising and exciting. Recognizing the connection between best-effort computing and convolutional neural networks, in 2008 we embarked on developing a programmable and dynamically reconfigurable convolutional neural network capable of performing best effort computing for various machine learning tasks that inherently allow for multiple acceptable answers. This combination of our thoughts on best-effort computing and the gradual evolution of convolutional neural networks (deep neural networks emerged much later) culminated in our 2010 ISCA work on dynamically reconfigurable convolutional neural networks.

AnB: Application-In-A-Box To Rapidly Deploy and Self-Optimize 5G Apps

AnB: Application-in-a-Box to rapidly deploy and self-optimize 5G apps We present Application in a Box (AnB) product concept aimed at simplifying the deployment and operation of remote 5G applications. AnB comes pre-configured with all necessary hardware and software components, including sensors like cameras, hardware and software components for a local 5G wireless network, and 5G-ready apps. Enterprises can easily download additional apps from an App Store. Setting up a 5G infrastructure and running applications on it is a significant challenge, but AnB is designed to make it fast, convenient, and easy, even for those without extensive knowledge of software, computers, wireless networks, or AI-based analytics. With AnB, customers only need to open the box, set up the sensors, turn on the 5G networking and edge computing devices, and start running their applications. Our system software automatically deploys and optimizes the pipeline of microservices in the application on a tiered computing infrastructure that includes device, edge, and cloud computing. Dynamic resource management, placement of critical tasks for low-latency response, and dynamic network bandwidth allocation for efficient 5G network usage are all automatically orchestrated. AnB offers cost savings, simplified setup and management, and increased reliability and security. We’ve implemented several real-world applications, such as collision prediction at busy traffic light intersections and remote construction site monitoring using video analytics. With AnB, deployment and optimization effort can be reduced from several months to just a few minutes. This is the first-of-its-kind approach to easing deployment effort and automating self-optimization of the application during system operation.

FactionFormer: Context-Driven Collaborative Vision Transformer Models for Edge Intelligence

FactionFormer: Context-Driven Collaborative Vision Transformer Models for Edge Intelligence Edge Intelligence has received attention in the recent times for its potential towards improving responsiveness, reducing the cost of data transmission, enhancing security and privacy, and enabling autonomous decisions by edge devices. However, edge devices lack the power and compute resources necessary to execute most Al models. In this paper, we present FactionFormer, a novel method to deploy resource-intensive deep-learning models, such as vision transformers (ViT), on resource-constrained edge devices. Our method is based on a key observation: edge devices are often deployed in settings where they encounter only a subset of the classes that the resource intensive Al model is trained to classify, and this subset changes across deployments. Therefore, we automatically identify this subset as a faction, devise on-the fly a bespoke resource-efficient ViT called a modelette for the faction and set up an efficient processing pipeline consisting of a modelette on the device, a wireless network such as 5G, and the resource-intensive ViT model on an edge server, all of which work collaboratively to do the inference. For several ViT models pre-trained on benchmark datasets, FactionFormer’s modelettes are up to 4× smaller than the corresponding baseline models in terms of the number of parameters, and they can infer up to 2.5× faster than the baseline setup where every input is processed by the resource-intensive ViT on the edge server. Our work is the first of its kind to propose a device-edge collaborative inference framework where bespoke deep learning models for the device are automatically devised on-the-fly for most frequently encountered subset of classes.

Elixir: A System To Enhance Data Quality For Multiple Analytics On A Video Stream

Elixir: A system to enhance data quality for multiple analytics on a video stream IoT sensors, especially video cameras, are ubiquitously deployed around the world to perform a variety of computer vision tasks in several verticals including retail, health- care, safety and security, transportation, manufacturing, etc. To amortize their high deployment effort and cost, it is desirable to perform multiple video analytics tasks, which we refer to as Analytical Units (AUs), off the video feed coming out of every camera. As AUs typically use deep learning-based AI/ML models, their performance depend on the quality of the input video, and recent work has shown that dynamically adjusting the camera setting exposed by popular network cameras can help improve the quality of the video feed and hence the AU accuracy, in a single AU setting. In this paper, we first show that in a multi-AU setting, changing the camera setting has disproportionate impact on different AUs performance. In particular, the optimal setting for one AU may severely degrade the performance for another AU, and further the impact on different AUs varies as the environmental condition changes. We then present Elixir, a system to enhance the video stream quality for multiple analytics on a video stream. Elixir leverages Multi-Objective Reinforcement Learning (MORL), where the RL agent caters to the objectives from different AUs and adjusts the camera setting to simultaneously enhance the performance of all AUs. To define the multiple objectives in MORL, we develop new AU-specific quality estimator values for each individual AU. We evaluate Elixir through real-world experiments on a testbed with three cameras deployed next to each other (overlooking a large enterprise parking lot) running Elixir and two baseline approaches, respectively. Elixir correctly detects 7.1% (22,068) and 5.0% (15,731) more cars, 94% (551) and 72% (478) more faces, and 670.4% (4975) and 158.6% (3507) more persons than the default-setting and time-sharing approaches, respectively. It also detects 115 license plates, far more than the time-sharing approach (7) and the default setting (0).

StreetAware: A High-Resolution Synchronized Multimodal Urban Scene Dataset

StreetAware: A High-Resolution Synchronized Multimodal Urban Scene Dataset Access to high-quality data is an important barrier in the digital analysis of urban settings, including applications within computer vision and urban design. Diverse forms of data collected from sensors in areas of high activity in the urban environment, particularly at street intersections, are valuable resources for researchers interpreting the dynamics between vehicles, pedestrians, and the built environment. In this paper, we present a high-resolution audio, video, and LiDAR dataset of three urban intersections in Brooklyn, New York, totaling almost 8 unique hours. The data were collected with custom Reconfigurable Environmental Intelligence Platform (REIP) sensors that were designed with the ability to accurately synchronize multiple video and audio inputs. The resulting data are novel in that they are inclusively multimodal, multi-angular, high-resolution, and synchronized. We demonstrate four ways the data could be utilized — (1) to discover and locate occluded objects using multiple sensors and modalities, (2) to associate audio events with their respective visual representations using both video and audio modes, (3) to track the amount of each type of object in a scene over time, and (4) to measure pedestrian speed using multiple synchronized camera views. In addition to these use cases, our data are available for other researchers to carry out analyses related to applying machine learning to understanding the urban environment (in which existing datasets may be inadequate), such as pedestrian-vehicle interaction modeling and pedestrian attribute recognition. Such analyses can help inform decisions made in the context of urban sensing and smart cities, including accessibility-aware urban design and Vision Zero initiatives.

Content-aware auto-scaling of stream processing applications on container orchestration platforms

Content-aware auto-scaling of stream processing applications on container orchestration platforms Modern applications are designed as an interacting set of microservices, and these applications are typically deployed on container orchestration platforms like Kubernetes. Several attractive features in Kubernetes make it a popular choice for deploying applications, and automatic scaling is one such feature. The default horizontal scaling technique in Kubernetes is the Horizontal Pod Autoscaler (HPA). It scales each microservice independently while ignoring the interactions among the microservices in an application. In this paper, we show that ignoring such interactions by HPA leads to inefficient scaling, and the optimal scaling of different microservices in the application varies as the stream content changes. To automatically adapt to variations in stream content, we present a novel system called DataX AutoScaler that leverages knowledge of the entire stream processing application pipeline to efficiently auto-scale different microservices by taking into account their complex interactions. Through experiments on real-world video analytics applications, such as face recognition and pose classification, we show that DataX AutoScaler adapts to variations in stream content and achieves up to 43% improvement in overall application performance compared to a baseline system that uses HPA.

DyCo: Dynamic, Contextualized AI Models

DyCo: Dynamic, Contextualized AI Models Devices with limited computing resources use smaller AI models to achieve low-latency inferencing. However, model accuracy is typically much lower than the accuracy of a bigger model that is trained and deployed in places where the computing resources are relatively abundant. We describe DyCo, a novel system that ensures privacy of stream data and dynamically improves the accuracy of small models used in devices. Unlike knowledge distillation or federated learning, DyCo treats AI models as black boxes. DyCo uses a semi-supervised approach to leverage existing training frameworks and network model architectures to periodically train contextualized, smaller models for resource-constrained devices. DyCo uses a bigger, highly accurate model in the edge-cloud to auto-label data received from each sensor stream. Training in the edge-cloud (as opposed to the public cloud) ensures data privacy, and bespoke models for thousands of live data streams can be designed in parallel by using multiple edge-clouds. DyCo uses the auto-labeled data to periodically re-train, stream-specific, bespoke small models. To reduce the periodic training costs, DyCo uses different policies that are based on stride, accuracy, and confidence information.We evaluate our system, and the contextualized models, by using two object detection models for vehicles and people, and two datasets (a public benchmark and another real-world proprietary dataset). Our results show that DyCo increases the mAP accuracy measure of small models by an average of 16.3% (and up to 20%) for the public benchmark and an average of 19.0% (and up to 64.9%) for the real-world dataset. DyCo also decreases the training costs for contextualized models by more than an order of magnitude.

APT: Adaptive Perceptual quality based camera Tuning using reinforcement learning

APT: Adaptive Perceptual quality based camera Tuning using reinforcement learning Cameras are increasingly being deployed in cities, enterprises and roads world-wide to enable many applications in public safety, intelligent transportation, retail, healthcare and manufacturing. Often, after initial deployment of the cameras, the environmental conditions and the scenes around these cameras change, and our experiments show that these changes can adversely impact the accuracy of insights from video analytics. This is because the camera parameter settings, though optimal at deployment time, are not the best settings for good-quality video capture as the environmental conditions and scenes around a camera change during operation. Capturing poor-quality video adversely affects the accuracy of analytics. To mitigate the loss in accuracy of insights, we propose a novel, reinforcement-learning based system APT that dynamically, and remotely (over 5G networks), tunes the camera parameters, to ensure a high-quality video capture, which mitigates any loss in accuracy of video analytics. As a result, such tuning restores the accuracy of insights when environmental conditions or scene content change. APT uses reinforcement learning, with no-reference perceptual quality estimation as the reward function. We conducted extensive real-world experiments, where we simultaneously deployed two cameras side-by-side overlooking an enterprise parking lot (one camera only has manufacturer-suggested default setting, while the other camera is dynamically tuned by APT during operation). Our experiments demonstrated that due to dynamic tuning by APT, the analytics insights are consistently better at all times of the day: the accuracy of object detection video analytics application was improved on average by ∼ 42%. Since our reward function is independent of any analytics task, APT can be readily used for different video analytics tasks.