Policy optimization with second-order advantage information

Jiajin Li; Baoxiang Wang; Shengyu Zhang

Conference Proceedings

Policy optimization with second-order advantage information

IJCAI International Joint Conference on Artificial Intelligence (2018) 2018-July 5038-5044

DOI: 10.24963/ijcai.2018/699

1Citations

26Readers

Get full text

Abstract

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide & deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

Cite

CITATION STYLE

APA

Li, J., Wang, B., & Zhang, S. (2018). Policy optimization with second-order advantage information. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2018-July, pp. 5038–5044). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2018/699

Policy optimization with second-order advantage information

Abstract

Cite

Register to see more suggestions