Hans Peter Graf NEC Labs America

Hans Peter Graf

Senior Advisor

Machine Learning


COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

Group Activity Recognition detects the activity collectively performed by a group of actors, which requires compositional reasoning of actors and objects. We approach the task by modeling the video as tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, prior works suffer from scene biases with privacy and ethical concerns. We only use the keypoint modality which reduces scene biases and prevents acquiring detailed visual data that may contain private or biased information of users. We improve the multiscale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and data augmentations tailored to the keypoint signals to aid model training. We demonstrate the model’s strength and interpretability on two widely-used datasets (Volleyball and Collective Activity). COMPOSER achieves up to +5.4% improvement with just the keypoint modality (Code is available at https://github.com/hongluzhou/composer.).

Prediction of Non-Muscle Invasive Bladder Cancer Recurrence using Machine Learning of Quantitative Nuclear Features

Non-muscle invasive bladder cancer (NMIBC) generally has a good prognosis, however, recurrence after transurethral resection (TUR), the standard primary treatment, is a major problem. Clinical management after TUR has been based on risk classification using clinicopathological factors, but these classifications are not complete. In this study, we attempted to predict early recurrence of NMIBC based on machine learning of quantitative morphological features. In general, structural, cellular, and nuclear atypia are evaluated to determine cancer atypia. However, since it is difficult to accurately quantify structural atypia from TUR specimens, in this study, we used only nuclear atypia and analyzed it using feature extraction followed by classification using Support Vector Machine and Random Forest machine learning algorithms. For the analysis, 125 patients diagnosed with NMIBC were used, data from 95 patients were randomly selected for the training set, and data from 30 patients were randomly selected for the test set. The results showed that the support vector machine-based model predicted recurrence within 2 years after TUR with a probability of 90% and the random forest-based model with probability of 86.7%. In the future, the system can be used to objectively predict NMIBC recurrence after TUR.

Learning Higher-order Object Interactions for Keypoint-based Video Understanding

Action recognition is an important problem that requires identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant computation and may capture scene context using various modalities that further increases compute costs. Efficient methods such as those used for AR/VR often only use human-keypoint information but suffer from a loss of scene context that hurts accuracy. In this paper, we describe an action-localization method, KeyNet, that uses only the keypoint data for tracking and action recognition. Specifically, KeyNet introduces the use of object based keypoint information to capture context in the scene. Our method illustrates how to build a structured intermediate representation that allows modeling higher-order interactions in the scene from object and human keypoints without using any RGB information. We find that KeyNet is able to track and classify human actions at just 5 FPS. More importantly, we demonstrate that object keypoints can be modeled to recover any loss in context from using keypoint information over AVA action and Kinetics datasets.

Hopper: Multi-hop Transformer for Spatio-Temporal Reasoning

This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset that requires multi-step reasoning to localize objects of interest correctly.

A Multi-Scale Conditional Deep Model for Tumor Cell Ratio Counting

We propose a method to accurately obtain the ratio of tumor cells over an entire histological slide. We use deep fully convolutional neural network models trained to detect and classify cells on images of H&E-stained tissue sections. Pathologists’ labels consisting of exhaustive nuclei locations and tumor regions were used to trained the model in a supervised fashion. We show that combining two models, each working at a different magnification allows the system to capture both cell-level details and surrounding context to enable successful detection and classification of cells as either tumor-cell or normal-cell. Indeed, by conditioning the classification of a single cell on a multi-scale context information, our models mimic the process used by pathologists who assess cell neoplasticity and tumor extent at different microscope magnifications. The ratio of tumor cells can then be readily obtained by counting the number of cells in each class. To analyze an entire slide, we split it into multiple tiles that can be processed in parallel. The overall tumor cell ratio can then be aggregated. We perform experiments on a dataset of 100 slides with lung tumor specimens from both resection and tissue micro-array (TMA). We train fully-convolutional models using heavy data augmentation and batch normalization. On an unseen test set, we obtain an average mean absolute error on predicting the tumor cell ratio of less than 6%, which is significantly better than the human average of 20% and is key in properly selecting tissue samples for recent genetic panel tests geared at prescribing targeted cancer drugs. We perform ablation studies to show the importance of training two models at different magnifications and to justify the choice of some parameters, such as the size of the receptive field.

Prediction of Early Recurrence of Hepatocellular Carcinoma after Resection using Digital Pathology Images Assessed by Machine Learning

Hepatocellular carcinoma (HCC) is a representative primary liver cancer caused by long-term and repetitive liver injury. Surgical resection is generally selected as the radical cure treatment. Because the early recurrence of HCC after resection is associated with low overall survival, the prediction of recurrence after resection is clinically important. However, the pathological characteristics of the early recurrence of HCC have not yet been elucidated. We attempted to predict the early recurrence of HCC after resection based on digital pathologic images of hematoxylin and eosin-stained specimens and machine learning applying a support vector machine (SVM). The 158 HCC patients meeting the Milan criteria who underwent surgical resection were included in this study. The patients were categorized into three groups: Group I, patients with HCC recurrence within 1 year after resection (16 for training and 23 for test), Group II, patients with HCC recurrence between 1 and 2 years after resection (22 and 28), and Group III, patients with no HCC recurrence within 4 years after resection (31 and 38). The SVM-based prediction method separated the three groups with 89.9% (80/89) accuracy. Prediction of Groups I was consistent for all cases, while Group II was predicted to be Group III in one case, and Group III was predicted to be Group II in 8 cases. The use of digital pathology and machine learning could be used for highly accurate prediction of HCC recurrence after surgical resection, especially that for early recurrence. Currently, in most cases after HCC resection, regular blood tests and diagnostic imaging are used for follow-up observation, however, the use of digital pathology coupled with machine learning offers potential as a method for objective postoprative follow-up observation.

Tripping Through Time: Efficient Localization of Activities in Videos

Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requirement. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for few frames to perform activity classification. In our evaluation over Charades-STA [14], ActivityNet Captions [26] and the TACoS dataset [36], we find that TripNet achieves high accuracy and saves processing time by only looking at 32-41% of the entire video.

15 Keypoints Is All You Need

Pose-tracking is an important problem that requires identifying unique human pose-instances and matching them temporally across different frames in a video. However, existing pose-tracking methods are unable to accurately model temporal relationships and require significant computation, often computing the tracks offline. We present an efficient multi-person pose-tracking method, KeyTrack that only relies on keypoint information without using any RGB or optical flow to locate and track human keypoints in real-time. KeyTrack is a top-down approach that learns spatio-temporal pose relationships by modeling the multi-person pose-tracking problem as a novel Pose Entailment task using a Transformer-based architecture. Furthermore, KeyTrack uses a novel, parameter-free, keypoint refinement technique that improves the keypoint estimates used by the Transformers. We achieved state-of-the-art results on PoseTrack’17 and PoseTrack’18 benchmarks while using only a fraction of the computation used by most other methods for computing the tracking information.

S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

We propose a sequential variational autoencoder to learn disentangled representations of sequential data (e.g., videos and audios) under self-supervision. Specifically, we exploit the benefits of some readily accessible supervision signals from input data itself or some off-the-shelf functional models and accordingly design auxiliary tasks for our model to utilize these signals. With the supervision of the signals, our model can easily disentangle the representation of an input sequence into static factors and dynamic factors (i.e., time-invariant and time-varying parts). Comprehensive experiments across videos and audios verify the effectiveness of our model on representation disentanglement and generation of sequential data, and demonstrate that, our model with self-supervision performs comparable to, if not better than, the fully-supervised model with ground truth labels, and outperforms state-of-the-art unsupervised models by a large margin.

On Novel Object Recognition: A Unified Framework for Discriminability and Adaptability

The rich and accessible labeled data fueled the revolutionary successes of deep learning in object recognition. However, recognizing objects of novel classes with limited supervision information provided, i.e., Novel Object Recognition (NOR), remains a challenging task. We identify in this paper two key factors for the success of NOR that previous approaches fail to simultaneously guarantee. The first is producing discriminative feature representations for images of novel classes, and the second is generating a flexible classifier readily adapted to novel classes provided with limited supervision signals. To secure both key factors, we propose a framework which decouples a deep classification model into a feature extraction module and a classification module. We learn the former to ensure feature discriminability with a standard multi-class classification task by fully utilizing the competing information among all classes within a training set, and learn the latter to secure adaptability by training a meta-learner network which generates classifier weights whenever provided with minimal supervision information of target classes. Extensive experiments on common benchmark datasets in the settings of both zero-shot and few-shot learning demonstrate our method achieves state-of-the-art performance.