Contextual Bandits refers to a class of machine learning problems that combine elements of contextual information and bandit problems. In a contextual bandit setting, an agent makes a sequence of decisions, or “pulls arms of a bandit,” while receiving contextual information associated with each decision. The goal is to learn a policy that can adapt to the context and maximize cumulative rewards over time. This framework is often used in scenarios where decisions need to be made sequentially with partial information about the environment.


Hierarchical Imitation Learning with Contextual Bandits for Dynamic Treatment Regimes

Hierarchical Imitation Learning with Contextual Bandits for Dynamic Treatment Regimes Imitation learning has been proved to be effective in mimicking experts’ behaviors from their demonstrations without access to explicit reward signals. Meanwhile, complex tasks, e.g., dynamic treatment regimes for patients with comorbidities, often suggest significant variability in expert demonstrations with multiple sub-tasks. In these cases, it could be difficult to use a single flat policy to handle tasks of hierarchical structures. In this paper, we propose the hierarchical imitation learning model, HIL, to jointly learn latent high-level policies and sub-policies (for individual sub-tasks) from expert demonstrations without prior knowledge. First, HIL learns sub-policies by imitating expert trajectories with the sub-task switching guidance from high-level policies. Second, HIL collects the feedback from its sub-policies to optimize high-level policies, which is modeled as a contextual multi-arm bandit that sequentially selects the best sub-policies at each time step based on the contextual information derived from demonstrations. Compared with state-of-the-art baselines on real-world medical data, HIL improves the likelihood of patient survival and provides better dynamic treatment regimes with the exploitation of hierarchical structures in expert demonstrations.