Offline Multi-Policy Gradient for Latent Mixture Environments

Xiaoguang Li; Xin Zhang; Lixin Wang; Ge Yu

Journal ArticleOPEN ACCESS

Offline Multi-Policy Gradient for Latent Mixture Environments

IEEE Access (2021) 9 801-812

DOI: 10.1109/ACCESS.2020.3045300

0Citations

6Readers

Abstract

Reinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challenge for reinforcement learning. To this challenge, the paper proposes a reinforcement learning approach called offline multi-policy gradient for latent mixture environments. The proposed method uses an objective of expected return of trajectory with respect to the joint distribution of trajectory and model, and adopts a multi-policy searching algorithm to find the optimal policies based on expectation maximization. We also prove that the off-policy technique of importance sampling and advantage function can be used by offline multi-policy learning with fixed historical trajectories. The effectiveness of our approach is demonstrated by the experiments on both synthetic and real datasets.

Author supplied keywords

Cite

CITATION STYLE

APA

Li, X., Zhang, X., Wang, L., & Yu, G. (2021). Offline Multi-Policy Gradient for Latent Mixture Environments. IEEE Access, 9, 801–812. https://doi.org/10.1109/ACCESS.2020.3045300

Offline Multi-Policy Gradient for Latent Mixture Environments

Abstract

Author supplied keywords

Cite

Register to see more suggestions