学习笔记: Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016
从这一章开始叫做Looking Deeper。 讲的有心理学(Psychology),神经科学(Neuroscience) 和强化学习的联系, 还有强化学习的应用和案例(Applications and case studies)和前沿(Frontiers)。
基本上需要大量的翻译。这不是我的特长。 所以我的笔记先停在这里了。
2 A simple bandit algorithm 4 Iterative policy evaluation Policy iteration (using iterative policy evaluation) Value iteration 5 First-visit MC policy evaluation (returns V v) Monte Carlo ES (Exploring Starts) On-policy rst-visit MC control (for "-soft policies) Incremental o-policy every-visit MC policy evaluation O-policy every-visit MC control (returns ) 6 Tabular TD(0) for estimating v Sarsa: An on-policy TD control algorithm Q-learning: An o-policy TD control algorithm Double Q-learning 7 n-step TD for estimating V v n-step Sarsa for estimating Q q, or Q q for a given O-policy n-step Sarsa for estimating Q q, or Q q for a given n-step Tree Backup for estimating Q q, or Q q for a given O-policy n-step Q() for estimating Q q, or Q q for a given 8 Random-sample one-step tabular Q-planning Tabular Dyna-Q Prioritized sweeping for a deterministic environment 9 Gradient Monte Carlo Algorithm for Approximating ^v v Semi-gradient TD(0) for estimating ^v v n-step semi-gradient TD for estimating ^v v LSTD for estimating ^v v (O(n2) version) 10 Episodic Semi-gradient Sarsa for Control Episodic semi-gradient n-step Sarsa for estimating ^q q, or ^q q Dierential Semi-gradient Sarsa for Control Dierential semi-gradient n-step Sarsa for estimating ^q q, or ^q q 12 Semi-gradient TD() for estimating ^v v True Online TD() for estimating > v 13 REINFORCE, A Monte-Carlo Policy-Gradient Method (episodic) REINFORCE with Baseline (episodic) One-step Actor-Critic (episodic) Actor-Critic with Eligibility Traces (episodic) Actor-Critic with Eligibility Traces (continuing)