Offline Multi-Policy Gradient for Latent Mixture Environments

0Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Reinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challenge for reinforcement learning. To this challenge, the paper proposes a reinforcement learning approach called offline multi-policy gradient for latent mixture environments. The proposed method uses an objective of expected return of trajectory with respect to the joint distribution of trajectory and model, and adopts a multi-policy searching algorithm to find the optimal policies based on expectation maximization. We also prove that the off-policy technique of importance sampling and advantage function can be used by offline multi-policy learning with fixed historical trajectories. The effectiveness of our approach is demonstrated by the experiments on both synthetic and real datasets.

Cite

CITATION STYLE

APA

Li, X., Zhang, X., Wang, L., & Yu, G. (2021). Offline Multi-Policy Gradient for Latent Mixture Environments. IEEE Access, 9, 801–812. https://doi.org/10.1109/ACCESS.2020.3045300

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free