Deep Learning-Based Real-Time Rate Control for Live Streaming on Wireless Networks

Providing wireless users with high-quality video content has become increasingly important. However, ensuring consistent video quality poses challenges due to variable encodedbitrate caused by dynamic video content and fluctuating channel bitrate caused by wireless fading effects. Suboptimal selection of encoder parameters can lead to video quality loss due to underutilized bandwidth or the introduction of video artifacts due to packet loss. To address this, a real-time deep learning-based H.264 controller is proposed. This controller leverages instantaneous channel quality data driven from the physical layer, along with the video chunk, to dynamically estimate the optimal encoder parameters with a negligible delay in real-time. The objective is to maintain an encoded video bitrate slightly below the available channel bitrate. Experimental results, conducted on both QCIF dataset and a diverse selection of random videos from public datasets, validate the effectiveness of the approach. Remarkably, improvements of 10-20 dB in PSNR with respect to the state-of-the art adaptive bitrate video streaming is achieved, with an average packet drop rate as low as 0.002.

Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects

Camouflaged object detection (COD) is the challenging task of identifying camouflaged objects visually blended into surroundings. Albeit achieving remarkable success, existing COD detectors still struggle to obtain precise results in some challenging cases. To handle this problem, we draw inspiration from the prey-vs-predator game that leads preys to develop better camouflage and predators to acquire more acute vision systems and develop algorithms from both the prey side and the predator side. On the prey side, we propose an adversarial trainingframework, Camouflageator, which introduces an auxiliary generator to generate more camouflaged objects that are harder for a COD method to detect. Camouflageator trains the generator and detector in an adversarial way such that the enhanced auxiliary generator helps produce a stronger detector. On the predator side, we introduce a novel COD method, called Internal Coherence and Edge Guidance (ICEG), which introduces a camouflaged feature coherence module to excavate the internal coherence of camouflaged objects, striving to obtain morecomplete segmentation results. Additionally, ICEG proposes a novel edge-guided separated calibration module to remove false predictions to avoid obtaining ambiguous boundaries. Extensive experiments show that ICEG outperforms existing COD detectors and Camouflageator is flexible to improve various COD detectors, including ICEG, which brings state-of-the-art COD performance.

CLAP: Cost and Latency-Aware Placement of Microservices on the Computing Continuum

For microservices-based real-time stream processing applications, computing at the edge delivers fast responses for low workloads, but as workload increases, the response time starts to slow down due to limited compute capacity. Abundant compute capacity in the cloud delivers fast responses even for higher workloads but incurs very high cost of operation. For applications which can tolerate latencies up to a certain limit, using either of them has one or the other drawback and for different applications and edge infrastructures, it is non-trivial to decide when to use only edge resources and when to leverage cloud resources. In this paper, we propose CLAP, which dynamically understands the relationship between workload and application latency, and automatically adjusts placement of microservices across edge and cloud computing continuum, with the goal of jointly reducing latency as well as cost of running microservices based streaming applications. CLAP leverages Reinforcement Learning (RL) technique to learn the optimal placement for a given workload and based on the learnings, adjusts placement of microservices as the application workload changes. We conduct experiments with real-world video analytics applications and show that CLAP adapts placement of microservices in response to varying workloads and achieves low latency for applications in a cost-efficient manner. Particularly, we show that for two real world video analytics applications i.e. human attributes and face recognition, CLAP is able to reduce average cost (across 4 days at different locations) by 47% and 58% for human attributes detection and face recognition application, respectively, while consistently maintaining latency below the tolerable limit.

Deep Learning Gain and Tilt Adaptive Digital Twin Modeling of Optical Line Systems for Accurate OSNR Predictions

We propose a deep learning algorithm trained on varied spectral loads and EDFA working points to generate a digital twin of an optical line system able to optimize line control and to enhance OSNR predictions.

Transformer-Aided Semantic Communications

The transformer structure employed in large language models (LLMs), as a specialized category of deep neural networks (DNNs) featuring attention mechanisms, stands out for their ability to identify and highlight the most relevant aspects of input data. Such a capability is particularly beneficial in addressing a variety of communication challenges, notably in the realm of semantic communication where proper encoding of the relevant data is critical especially in systems with limited bandwidth. In this work, we employ vision transformers specifically for the purpose of compression and compact representation of the input image, with the goal of preserving semantic information throughout the transmission process. Through the use of the attention mechanism inherent in transformers, we create an attention mask. This mask effectively prioritizes critical segments of images for transmission, ensuring that the reconstruction phase focuses on key objects highlighted by the mask. Our methodology significantly improves the quality of semantic communication and optimizes bandwidth usage by encoding different parts of the data in accordance with their semantic information content, thus enhancing overall efficiency. We evaluate the effectiveness of our proposed framework using the TinyImageNet dataset, focusing on both reconstruction quality and accuracy. Our evaluation results demonstrate that our framework successfully preserves semantic information, even when only a fraction of the encoded data is transmitted, according to the intended compression rates.

Local and Global Optimization Methods for Optical Line Control Based on Quality of Transmission

The ever-increasing demand for data traffic in recent decades has pushed network operators to give importance to the aspect of infrastructure control to facilitate its scalability and maximize its capacity. A generic lightpath (LP) is deployed starting from a traffic request between a given pair of nodes in a network. LPs are operated in the network based on an estimate of the quality of transmission (QoT), which is derived from the physical layer characteristics of a selected route. Regardless of the model used to estimate QoT, it is necessary to calibrate themodel to maximize its accuracy and define minimum design margins. The model calibration process depends significantly on the type of data that can be collected in the field (i.e., type of metric, resolution) and therefore on the available monitoring devices. In this work, a systematic evaluation of the QoT estimation is carried out on a multi-span erbium-doped-fiber-amplified optical line system (OLS) using in the first case only total power monitors and in the second experimentally emulating optical channel monitors (OCMs). Given the type of monitoring devices available, three different physical models are calibrated, and six optimization methods are used to define the optimal configuration of the target gain and tilt parameters of the optical amplifiers, jointly optimizing the working point of all amplifiers (global approach) or proceeding span by span (local approach). Subsequently, the OLS was set in each configuration obtained, and the generalized signal-to-noise ratio (GSNR) profile was measured at the end.

Instantaneous Perception of Moving Objects in 3D

The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical, such as behaviors near a stop sign of parking positions. We delve into this under-explored task, examining its unique challenges and developing our solution, accompanied by a carefully designed benchmark. Specifically, due to the lack of correspondences between consecutive frames of sparse Lidar point clouds, static objects might appear to be moving – the so-called swimming effect. This intertwines with the true object motion, thereby posing ambiguity in accurate estimation, especially for subtle motions. To address this, we propose to leverage local occupancy completion of object point clouds to densify the shape cue, and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion, instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches, particularly highlighting our method’s specialized treatment of subtle motions.

LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes

Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which are fused with the implicit grid-based representation for radiance decoding, thereby supplying strongergeometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.

iRAG: An Incremental Retrieval Augmented Generation System for Videos

Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, developing a system for multimodal to text conversion and interactive querying of multimodal data is challenging.To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying.

Radio-Frequency Linear Analysis and Optimization of Silicon Photonic Neural Networks

Broadband analog signal processors utilizing silicon photonics have demonstrated a significant impact in numerous application spaces, offering unprecedented bandwidths, dynamic range, and tunability. In the past decade, microwave photonic techniques have been applied to neuromorphic processing, resulting in the development of novel photonic neural network architectures. Neuromorphic photonic systems can enable machine learning capabilities at extreme bandwidths and speeds. Herein, low-quality factor microring resonators are implemented to demonstrate broadband optical weighting. In addition, silicon photonic neural network architectures are critically evaluated, simulated, and optimized from a radio-frequency performance perspective. This analysis highlights the linear front-end of the photonic neural network, the effects of linear and nonlinear loss within silicon waveguides, and the impact of electrical preamplification.