Deep Learning-Based Real-Time Rate Control for Live Streaming on Wireless Networks

Providing wireless users with high-quality video content has become increasingly important. However, ensuring consistent video quality poses challenges due to variable encodedbitrate caused by dynamic video content and fluctuating channel bitrate caused by wireless fading effects. Suboptimal selection of encoder parameters can lead to video quality loss due to underutilized bandwidth or the introduction of video artifacts due to packet loss. To address this, a real-time deep learning-based H.264 controller is proposed. This controller leverages instantaneous channel quality data driven from the physical layer, along with the video chunk, to dynamically estimate the optimal encoder parameters with a negligible delay in real-time. The objective is to maintain an encoded video bitrate slightly below the available channel bitrate. Experimental results, conducted on both QCIF dataset and a diverse selection of random videos from public datasets, validate the effectiveness of the approach. Remarkably, improvements of 10-20 dB in PSNR with respect to the state-of-the art adaptive bitrate video streaming is achieved, with an average packet drop rate as low as 0.002.

Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects

Camouflaged object detection (COD) is the challenging task of identifying camouflaged objects visually blended into surroundings. Albeit achieving remarkable success, existing COD detectors still struggle to obtain precise results in some challenging cases. To handle this problem, we draw inspiration from the prey-vs-predator game that leads preys to develop better camouflage and predators to acquire more acute vision systems and develop algorithms from both the prey side and the predator side. On the prey side, we propose an adversarial trainingframework, Camouflageator, which introduces an auxiliary generator to generate more camouflaged objects that are harder for a COD method to detect. Camouflageator trains the generator and detector in an adversarial way such that the enhanced auxiliary generator helps produce a stronger detector. On the predator side, we introduce a novel COD method, called Internal Coherence and Edge Guidance (ICEG), which introduces a camouflaged feature coherence module to excavate the internal coherence of camouflaged objects, striving to obtain morecomplete segmentation results. Additionally, ICEG proposes a novel edge-guided separated calibration module to remove false predictions to avoid obtaining ambiguous boundaries. Extensive experiments show that ICEG outperforms existing COD detectors and Camouflageator is flexible to improve various COD detectors, including ICEG, which brings state-of-the-art COD performance.

CLAP: Cost and Latency-Aware Placement of Microservices on the Computing Continuum

For microservices-based real-time stream processing applications, computing at the edge delivers fast responses for low workloads, but as workload increases, the response time starts to slow down due to limited compute capacity. Abundant compute capacity in the cloud delivers fast responses even for higher workloads but incurs very high cost of operation. For applications which can tolerate latencies up to a certain limit, using either of them has one or the other drawback and for different applications and edge infrastructures, it is non-trivial to decide when to use only edge resources and when to leverage cloud resources. In this paper, we propose CLAP, which dynamically understands the relationship between workload and application latency, and automatically adjusts placement of microservices across edge and cloud computing continuum, with the goal of jointly reducing latency as well as cost of running microservices based streaming applications. CLAP leverages Reinforcement Learning (RL) technique to learn the optimal placement for a given workload and based on the learnings, adjusts placement of microservices as the application workload changes. We conduct experiments with real-world video analytics applications and show that CLAP adapts placement of microservices in response to varying workloads and achieves low latency for applications in a cost-efficient manner. Particularly, we show that for two real world video analytics applications i.e. human attributes and face recognition, CLAP is able to reduce average cost (across 4 days at different locations) by 47% and 58% for human attributes detection and face recognition application, respectively, while consistently maintaining latency below the tolerable limit.

Deep Learning Gain and Tilt Adaptive Digital Twin Modeling of Optical Line Systems for Accurate OSNR Predictions

We propose a deep learning algorithm trained on varied spectral loads and EDFA working points to generate a digital twin of an optical line system able to optimize line control and to enhance OSNR predictions.

Transformer-Aided Semantic Communications

The transformer structure employed in large language models (LLMs), as a specialized category of deep neural networks (DNNs) featuring attention mechanisms, stands out for their ability to identify and highlight the most relevant aspects of input data. Such a capability is particularly beneficial in addressing a variety of communication challenges, notably in the realm of semantic communication where proper encoding of the relevant data is critical especially in systems with limited bandwidth. In this work, we employ vision transformers specifically for the purpose of compression and compact representation of the input image, with the goal of preserving semantic information throughout the transmission process. Through the use of the attention mechanism inherent in transformers, we create an attention mask. This mask effectively prioritizes critical segments of images for transmission, ensuring that the reconstruction phase focuses on key objects highlighted by the mask. Our methodology significantly improves the quality of semantic communication and optimizes bandwidth usage by encoding different parts of the data in accordance with their semantic information content, thus enhancing overall efficiency. We evaluate the effectiveness of our proposed framework using the TinyImageNet dataset, focusing on both reconstruction quality and accuracy. Our evaluation results demonstrate that our framework successfully preserves semantic information, even when only a fraction of the encoded data is transmitted, according to the intended compression rates.

Local and Global Optimization Methods for Optical Line Control Based on Quality of Transmission

The ever-increasing demand for data traffic in recent decades has pushed network operators to give importance to the aspect of infrastructure control to facilitate its scalability and maximize its capacity. A generic lightpath (LP) is deployed starting from a traffic request between a given pair of nodes in a network. LPs are operated in the network based on an estimate of the quality of transmission (QoT), which is derived from the physical layer characteristics of a selected route. Regardless of the model used to estimate QoT, it is necessary to calibrate themodel to maximize its accuracy and define minimum design margins. The model calibration process depends significantly on the type of data that can be collected in the field (i.e., type of metric, resolution) and therefore on the available monitoring devices. In this work, a systematic evaluation of the QoT estimation is carried out on a multi-span erbium-doped-fiber-amplified optical line system (OLS) using in the first case only total power monitors and in the second experimentally emulating optical channel monitors (OCMs). Given the type of monitoring devices available, three different physical models are calibrated, and six optimization methods are used to define the optimal configuration of the target gain and tilt parameters of the optical amplifiers, jointly optimizing the working point of all amplifiers (global approach) or proceeding span by span (local approach). Subsequently, the OLS was set in each configuration obtained, and the generalized signal-to-noise ratio (GSNR) profile was measured at the end.

iRAG: An Incremental Retrieval Augmented Generation System for Videos

Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, developing a system for multimodal to text conversion and interactive querying of multimodal data is challenging.To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying.

Radio-Frequency Linear Analysis and Optimization of Silicon Photonic Neural Networks

Broadband analog signal processors utilizing silicon photonics have demonstrated a significant impact in numerous application spaces, offering unprecedented bandwidths, dynamic range, and tunability. In the past decade, microwave photonic techniques have been applied to neuromorphic processing, resulting in the development of novel photonic neural network architectures. Neuromorphic photonic systems can enable machine learning capabilities at extreme bandwidths and speeds. Herein, low-quality factor microring resonators are implemented to demonstrate broadband optical weighting. In addition, silicon photonic neural network architectures are critically evaluated, simulated, and optimized from a radio-frequency performance perspective. This analysis highlights the linear front-end of the photonic neural network, the effects of linear and nonlinear loss within silicon waveguides, and the impact of electrical preamplification.

Efficient Transformer Encoders for Mask2Former-style Models

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three-step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create a derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on input image. Additionally, to change the computational-accuracy trade off, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses >50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. With this observation, we propose a strategy termed PROgressive Token Length SCALing for Efficient transformer encoders (PRO-SCALE) that can be plugged-in to the Mask2Former style segmentation architectures to significantly reduce the computational cost. The underlying principle of PRO-SCALE is: progressively scale the length of the tokens with the layers of the encoder. This allows PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance (?52% GFLOPs reduction with no drop in performance on COCO dataset). We validate our frame work on multiple public benchmarks.