Faster Inference refers to the ability to make predictions or decisions more quickly. Inference is the phase of a model’s operation where it processes input data and produces output, such as making predictions or classifications. To achieve faster inference, various optimization techniques are employed, including model quantization, hardware acceleration (such as specialized AI processors), model pruning, and architecture modifications. The goal is to strike a balance between the accuracy of the model and the speed at which it can process input data and produce reliable results.

Posts

Adaptive Memory Networks

Adaptive Memory Networks We present Adaptive Memory Networks (AMN) that processes input-question pairs to dynamically construct a network architecture optimized for lower inference times for Question Answering (QA) tasks. AMN processes the input story to extract entities and stores them in memory banks. Starting from a single bank, as the number of input entities increases, AMN learns to create new banks as the entropy in a single bank becomes too high. Hence, after processing an input-question(s) pair, the resulting network represents a hierarchical structure where entities are stored in different banks, distanced by question relevance. At inference, one or few banks are used, creating a tradeoff between accuracy and performance. AMN is enabled by dynamic networks that allow input dependent network creation and efficiency in dynamic mini-batching as well as our novel bank controller that allows learning discrete decision making with high accuracy. In our results, we demonstrate that AMN learns to create variable depth networks depending on task complexity and reduces inference times for QA tasks.