Policy Gradients refer to a class of optimization algorithms used to train models, known as policies, to make decisions that lead to maximum cumulative rewards in a given environment. Policy gradients are optimization techniques designed to train policies to make decisions that maximize the cumulative rewards an agent receives in an environment. These methods are crucial for training agents in scenarios where the optimal strategy is not immediately apparent and needs to be learned through trial and error.

Common algorithms include REINFORCE, Trust Region Policy Optimization (TRPO), and Proximal Policy Optimization (PPO).


Learning To Simulate

Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling the distribution of synthesized data in order to maximize the accuracy of a model trained on that data. In contrast to prior art that hand-crafts these simulation parameters or adjusts only parts of the available parameters, our approach fully controls the simulator with the actual underlying goal of maximizing accuracy, rather than mimicking the real data distribution or randomly generating a large volume of data. We find that our approach (i) quickly converges to the optimal simulation parameters in controlled experiments and (ii) can indeed discover good sets of parameters for an image rendering simulator in actual computer vision applications.