Abstract
Reinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challenge for reinforcement learning. To this challenge, the paper proposes a reinforcement learning approach called offline multi-policy gradient for latent mixture environments. The proposed method uses an objective of expected return of trajectory with respect to the joint distribution of trajectory and model, and adopts a multi-policy searching algorithm to find the optimal policies based on expectation maximization. We also prove that the off-policy technique of importance sampling and advantage function can be used by offline multi-policy learning with fixed historical trajectories. The effectiveness of our approach is demonstrated by the experiments on both synthetic and real datasets.
Author supplied keywords
Cite
CITATION STYLE
Li, X., Zhang, X., Wang, L., & Yu, G. (2021). Offline Multi-Policy Gradient for Latent Mixture Environments. IEEE Access, 9, 801–812. https://doi.org/10.1109/ACCESS.2020.3045300
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.