Trust Region Policy Optimization(TRPO)｜DeepLearning論文の原文を読む #8

f:id:lib-arts:20190130163438p:plain

DeepLearning系の研究を中心に論文の読解メモをまとめていきます。
エポックになった有名どころの論文を精読し、所感などをまとめられればと思います。
（読んだ際に忙しくちゃんと読めなかった論文なので一旦Abstractの和訳のみで詳細は後日必要があれば追記します。）

#8では強化学習への応用ということで#7のDQNに引き続き、TRPO(Trust Region Policy Optimization)について取り扱います。

[1502.05477] Trust Region Policy Optimization

以下論文の目次です。

0. Abstract
1. Introduction
2. Preliminaries
3. Monotonic Improvement Guarantee for General Stochastic Policies
4. Optimization of Parameterized Policies
5. Sample-Based Estimation of the Objective and Constraint
6. Practical Algorithm
7. Connections with Prior Work
8. Experiments
9. Discussion

0. Abstract

Abstractは論文の要旨がまとまっているので一文ずつ精読していければと思います。

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO).

和訳：『我々は単調な改善が保証されている方策の最適化の繰り返し演算手順について提案する。理論的に証明された手順のいくつかの近似を行うにあたって、我々はTRPOと名付けた実用的なアルゴリズムを開発した。』

This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks.

和訳：『このアルゴリズムは従来の方策勾配の手法と類似しており、かつニューラルネットワークのような大きな非線形関数の方策の最適化に効率的である。』

Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input.

和訳：『我々が行った実験では、TRPOのロバスト性は様々なタスクで見受けられ、具体的には機械的な水泳、ジャンプ、歩行などの学習や、スクリーン画像をインプットに用いたAtariのゲームの実行などである。

Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

和訳：『理論から外れた近似にも関わらず、TRPOは少しのハイパーパラメータのチューニングで大きな改善を示した』

1. Introduction
2. Preliminaries
3. Monotonic Improvement Guarantee for General Stochastic Policies
4. Optimization of Parameterized Policies
5. Sample-Based Estimation of the Objective and Constraint
6. Practical Algorithm
7. Connections with Prior Work
8. Experiments
9. Discussion