Northeastern University is a global research university recognized for its cooperative education model and leadership in network science, cybersecurity, and AI. It integrates experiential learning with interdisciplinary research. NEC Labs America and Northeastern University work together on scalable graph learning, misinformation detection, and secure federated graph analytics. Please read about our latest news and collaborative publications with Northeastern University.

Posts

Foundational Vision-LLM for AI Linkage and Orchestration

We propose a vision-LLM framework for automating development and deployment of computer vision solutions for pre-defined or custom-defined tasks. A foundational layer is proposed with a code-LLM AI orchestrator self-trained with reinforcement learning to create Python code based on its understanding of a novel user-defined task, together with APIs, documentation and usage notes of existing task-specific AI models. Zero-shot abilities in specific domains are obtained through foundational vision-language models trained at a low compute expense leveraging existing computer vision models and datasets. An engine layer is proposed which comprises of several task-specific vision-language engines which can be compositionally utilized. An application-specific layer is proposed to improve performance in customer-specific scenarios, using novel LLM-guided data augmentation and question decomposition, besides standard fine-tuning tools. We demonstrate a range of applications including visual AI assistance, visual conversation, law enforcement, mobility, medical image reasoning and remote sensing.

Exploring Question Decomposition for Zero-Shot VQA

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/

Improving Cross-Domain Detection with Self-Supervised Learning

Cross-Domain Detection (XDD) aims to train a domain-adaptive object detector using unlabeled images from a target domain and labeled images from a source domain. Existing approaches achieve this either by aligning the feature maps or the region proposals from the two domains, or by transferring the style of source images to that of target images. In this paper, rather than proposing another method following the existing lines, we introduce a new framework complementary to existing methods. Our framework unifies some popular Self-Supervised Learning (SSL) techniques (e.g., rotation angle prediction, strong/weak data augmentation, mean teacher modeling) and adapts them to the XDD task. Our basic idea is to leverage the unsupervised nature of these SSL techniques and apply them simultaneously across domains (source and target) and models (student and teacher). These SSL techniques can thus serve as shared bridges that facilitate knowledge transfer between domains. More importantly, as these techniques are independently applied in each domain, they are complementary to existing domain alignment techniques that relies on interactions between domains (e.g., adversarial alignment). We perform extensive analyses on these SSL techniques and show that they significantly improve the performance of existing methods. In addition, we reach comparable or even better performance than the state-of-the-art methods when integrating our framework with an old well-established method.

Inductive and Unsupervised Representation Learning on Graph Structured Objects

Inductive and unsupervised graph learning is a critical technique for predictive or information retrieval tasks where label information is difficult to obtain. It is also challenging to make graph learning inductive and unsupervised at the same time, as learning processes guided by reconstruction error based loss functions inevitably demand graph similarity evaluation that is usually computationally intractable. In this paper, we propose a general framework SEED (Sampling, Encoding, and Embedding Distributions) for inductive and unsupervised representation learning on graph structured objects. Instead of directly dealing with the computational challenges raised by graph similarity evaluation, given an input graph, the SEED framework samples a number of subgraphs whose reconstruction errors could be efficiently evaluated, encodes the subgraph samples into a collection of subgraph vectors, and employs the embedding of the subgraph vector distribution as the output vector representation for the input graph. By theoretical analysis, we demonstrate the close connection between SEED and graph isomorphism. Using public benchmark datasets, our empirical study suggests the proposed SEED framework is able to achieve up to 10% improvement, compared with competitive baseline methods.