Y. Charlie Hu has worked at NEC Laboratories America, Inc.

Posts

Elixir: A System To Enhance Data Quality For Multiple Analytics On A Video Stream

IoT sensors, especially video cameras, are ubiquitously deployed around the world to perform a variety of computer vision tasks in several verticals including retail, health- care, safety and security, transportation, manufacturing, etc. To amortize their high deployment effort and cost, it is desirable to perform multiple video analytics tasks, which we refer to as Analytical Units (AUs), off the video feed coming out of every camera. As AUs typically use deep learning-based AI/ML models, their performance depend on the quality of the input video, and recent work has shown that dynamically adjusting the camera setting exposed by popular network cameras can help improve the quality of the video feed and hence the AU accuracy, in a single AU setting. In this paper, we first show that in a multi-AU setting, changing the camera setting has disproportionate impact on different AUs performance. In particular, the optimal setting for one AU may severely degrade the performance for another AU, and further the impact on different AUs varies as the environmental condition changes. We then present Elixir, a system to enhance the video stream quality for multiple analytics on a video stream. Elixir leverages Multi-Objective Reinforcement Learning (MORL), where the RL agent caters to the objectives from different AUs and adjusts the camera setting to simultaneously enhance the performance of all AUs. To define the multiple objectives in MORL, we develop new AU-specific quality estimator values for each individual AU. We evaluate Elixir through real-world experiments on a testbed with three cameras deployed next to each other (overlooking a large enterprise parking lot) running Elixir and two baseline approaches, respectively. Elixir correctly detects 7.1% (22,068) and 5.0% (15,731) more cars, 94% (551) and 72% (478) more faces, and 670.4% (4975) and 158.6% (3507) more persons than the default-setting and time-sharing approaches, respectively. It also detects 115 license plates, far more than the time-sharing approach (7) and the default setting (0).

APT: Adaptive Perceptual quality based camera Tuning using reinforcement learning

Cameras are increasingly being deployed in cities, enterprises and roads world-wide to enable many applications in public safety, intelligent transportation, retail, healthcare and manufacturing. Often, after initial deployment of the cameras, the environmental conditions and the scenes around these cameras change, and our experiments show that these changes can adversely impact the accuracy of insights from video analytics. This is because the camera parameter settings, though optimal at deployment time, are not the best settings for good-quality video capture as the environmental conditions and scenes around a camera change during operation. Capturing poor-quality video adversely affects the accuracy of analytics. To mitigate the loss in accuracy of insights, we propose a novel, reinforcement-learning based system APT that dynamically, and remotely (over 5G networks), tunes the camera parameters, to ensure a high-quality video capture, which mitigates any loss in accuracy of video analytics. As a result, such tuning restores the accuracy of insights when environmental conditions or scene content change. APT uses reinforcement learning, with no-reference perceptual quality estimation as the reward function. We conducted extensive real-world experiments, where we simultaneously deployed two cameras side-by-side overlooking an enterprise parking lot (one camera only has manufacturer-suggested default setting, while the other camera is dynamically tuned by APT during operation). Our experiments demonstrated that due to dynamic tuning by APT, the analytics insights are consistently better at all times of the day: the accuracy of object detection video analytics application was improved on average by ∼ 42%. Since our reward function is independent of any analytics task, APT can be readily used for different video analytics tasks.

Enhancing Video Analytics Accuracy via Real-time Automated Camera Parameter Tuning

In Video Analytics Pipelines (VAP), Analytics Units (AUs) such as object detection and face recognition running on remote servers critically rely on surveillance cameras to capture high-quality video streams in order to achieve high accuracy. Modern IP cameras come with a large number of camera parameters that directly affect the quality of the video stream capture. While a few of such parameters, e.g., exposure, focus, white balance are automatically adjusted by the camera internally, the remaining ones are not. We denote such camera parameters as non-automated (NAUTO) parameters. In this paper, we first show that environmental condition changes can have significant adverse effect on the accuracy of insights from the AUs, but such adverse impact can potentially be mitigated by dynamically adjusting NAUTO camera parameters in response to changes in environmental conditions. We then present CamTuner, to our knowledge, the first framework that dynamically adapts NAUTO camera parameters to optimize the accuracy of AUs in a VAP in response to adverse changes in environmental conditions. CamTuner is based on SARSA reinforcement learning and it incorporates two novel components: a light-weight analytics quality estimator and a virtual camera that drastically speed up offline RL training. Our controlled experiments and real-world VAP deployment show that compared to a VAP using the default camera setting, CamTuner enhances VAP accuracy by detecting 15.9% additional persons and 2.6%–4.2% additional cars (without any false positives) in a large enterprise parking lot and 9.7% additional cars in a 5G smart traffic intersection scenario, which enables a new usecase of accurate and reliable automatic vehicle collision prediction (AVCP). CamTuner opens doors for new ways to significantly enhance video analytics accuracy beyond incremental improvements from refining deep-learning models.

Why is the video analytics accuracy fluctuating, and what can we do about it?

It is a common practice to think of a video as a sequence of images (frames), and re-use deep neural network models that are trained only on images for similar analytics tasks on videos. In this paper, we show that this “leap of faith” that deep learning models that work well on images will also work well on videos is actually flawed. We show that even when a video camera is viewing a scene that is not changing in any human-perceptible way, and we control for external factors like video compression and environment (lighting), the accuracy of video analytics application fluctuates noticeably. These fluctuations occur because successive frames produced by the video camera may look similar visually but are perceived quite differently by the video analytics applications. We observed that the root cause for these fluctuations is the dynamic camera parameter changes that a video camera automatically makes in order to capture and produce a visually pleasing video. The camera inadvertently acts as an “unintentional adversary” because these slight changes in the image pixel values in consecutive frames, as we show, have a noticeably adverse impact on the accuracy of insights from video analytics tasks that re-use image-trained deep learning models. To address this inadvertent adversarial effect from the camera, we explore the use of transfer learning techniques to improve learning in video analytics tasks through the transfer of knowledge from learning on image analytics tasks. Our experiments with a number of different cameras, and a variety of different video analytics tasks, show that the inadvertent adversarial effect from the camera can be noticeably offset by quickly re-training the deep learning models using transfer learning. In particular, we show that our newly trained Yolov5 model reduces fluctuation in object detection across frames, which leads to better tracking of objects (∼40% fewer mistakes in tracking). Our paper also provides new directions and techniques to mitigate the camera’s adversarial effect on deep learning models used for video analytics applications.

AQuA: Analytical Quality Assessment for Optimizing Video Analytics Systems

Millions of cameras at edge are being deployed to power a variety of different deep learning applications. However, the frames captured by these cameras are not always pristine – they can be distorted due to lighting issues, sensor noise, compression etc. Such distortions not only deteriorate visual quality, they impact the accuracy of deep learning applications that process such video streams. In this work, we introduce AQuA, to protect application accuracy against such distorted frames by scoring the level of distortion in the frames. It takes into account the analytical quality of frames, not the visual quality, by learning a novel metric, classifier opinion score, and uses a lightweight, CNN-based, object-independent feature extractor. AQuA accurately scores distortion levels of frames and generalizes to multiple different deep learning applications. When used for filtering poor quality frames at edge, it reduces high-confidence errors for analytics applications by 17%. Through filtering, and due to its low overhead (14ms), AQuA can also reduce computation time and average bandwidth usage by 25%.

CamTuner: Reinforcement Learning based System for Camera Parameter Tuning to enhance Analytics

Video analytics systems critically rely on video cameras, which capture high quality video frames, to achieve high analytics accuracy. Although modern video cameras often expose tens of configurable parameter settings that can be set by end users, deployment of surveillance cameras today often uses a fixed set of parameter settings because the end users lack the skill or understanding to reconfigure these parameters. In this paper, we first show that in a typical surveillance camera deployment, environmental condition changes can significantly affect the accuracy of analytics units such as person detection, face detection and face recognition, and how such adverse impact can be mitigated by dynamically adjusting camera settings. We then propose CAMTUNER, a framework that can be easily applied to an existing video analytics pipeline (VAP) to enable automatic and dynamic adaptation of complex camera settings to changing environmental conditions, and autonomously optimize the accuracy of analytics units (AUs) in the VAP. CAMTUNER is based on SARSA reinforcement learning (RL) and it incorporates two novel components: a light weight analytics quality estimator and a virtual camera. CAMTUNER is implemented in a system with AXIS surveillance cameras and several VAPs (with various AUs) that processed day long customer videos captured at airport entrances. Our evaluations show that CAMTUNER can adapt quickly to changing environments. We compared CAMTUNER with two alternative approaches where either static camera settings were used, or a strawman approach where camera settings were manually changed every hour (based on human perception of quality). We observed that for the face detection and person detection AUs, CAMTUNER is able to achieve up to 13.8% and 9.2% higher accuracy, respectively, compared to the best of the two approaches (average improvement of 8% for both AUs).