Dimitri P. Bertseka,美国MIT终身教授,美国国家工程院院士,清华大学复杂与网络化系统研究中心客座教授,电气工程与计算机科学领域国际知名作者,著有《非线性规划》《网络优化》《凸优化》等十几本畅销教材和专著。本书的目的是考虑大型且具有挑战性的多阶段决策问题,这些问题原则上可以通过动态规划和*控制来解决,但它们的精确解决方案在计算上是难以处理的。本书讨论依赖于近似的解决方法,以产生具有足够性能的次优策略。这些方法统称为增强学习,也可以叫做近似动态规划和神经动态规划等。
本书的主题产生于*控制和人工智能思想的相互作用。本书的目的之一是探索这两个领域之间的共同边界,并架设一座具有任一领域背景的专业人士都可以访问的桥梁。
Dimitri P. Bertseka,美国MIT终身教授,美国国家工程院院士,清华大学复杂与网络化系统研究中心客座教授。电气工程与计算机科学领域国际知名作者,著有《非线性规划》《网络优化》《凸优化》等十几本畅销教材和专著。
目錄:
1 Exact Dynamic Programming
1.1 DeterministicDynamicProgramming 2
1.1.1 DeterministicProblems 2
1.1.2 TheDynamicProgrammingAlgorithm 7
1.1.3 Approximation inValue Space 12
1.2 StochasticDynamicProgramming 14
1.3 Examples,Variations, and Simplifications 18
1.3.1 Deterministic ShortestPathProblems 19
1.3.2 DiscreteDeterministicOptimization 21
1.3.3 Problemswith aTermination State 25
1.3.4 Forecasts 26
1.3.5 Problems with Uncontrollable State Components 29
1.3.6 PartialState Information andBelief States 34
1.3.7 LinearQuadraticOptimalControl 38
1.3.8 SystemswithUnknownParameters -Adaptive
Control 40
1.4 ReinforcementLearning andOptimalControl - Some
Terminology 43
1.5 Notes and Sources 45
2 Approximation in Value Space
2.1 ApproximationApproaches inReinforcementLearning 50
2.1.1 General Issues ofApproximation inValue Space 54
2.1.2 Off-Line andOn-LineMethods 56
2.1.3 Model-Based Simplification of the Lookahead
Minimization 57
2.1.4 Model-Free off-Line Q-Factor Approximation 58
2.1.5 Approximation inPolicy Space onTop of
ApproximationinValue Space 61
2.1.6 When is Approximation in Value Space Effective? 62
2.2 Multistep Lookahead 64
??ii
viii Contents
2.2.1 Multistep Lookahead and Rolling Horizon 65
2.2.2 Multistep Lookahead and Deterministic Problems 67
2.3 Problem Approximation 69
2.3.1 Enforced Decomposition 69
2.3.2 Probabilistic Approximation - Certainty Equivalent
Control 76
2.4 Rollout and the Policy Improvement Principle 83
2.4.1 On-Line Rollout for Deterministic Discrete
Optimization 84
2.4.2 Stochastic Rollout and Monte Carlo Tree Search 95
2.4.3 Rollout with an Expert 104
2.5 On-Line Rollout for Deterministic Infinite-Spaces Problems -
Optimization Heuristics 106
2.5.1 Model Predictive Control 108
2.5.2 Target Tubes and the Constrained Controllability
Condition 115
2.5.3 Variants of Model Predictive Control 118
2.6 Notes and Sources 120
3 Parametric Approximation
3.1 Approximation Architectures 126
3.1.1 Linear and Nonlinear Feature-Based Architectures 126
3.1.2 Training of Linear and Nonlinear Architectures 134
3.1.3 Incremental Gradient and Newton Methods 135
3.2 Neural Networks 149
3.2.1 Training of Neural Networks 153
3.2.2 Multilayer and Deep Neural Networks 157
3.3 Sequential Dynamic Programming Approximation 161
3.4 Q-Factor Parametric Approximation 162
3.5 Parametric Approximation in Policy Space by
Classification 165
3.6 Notes and Sources 171
4 Infinite Horizon Dynamic Programming
4.1 An Overview of Infinite Horizon Problems 174
4.2 Stochastic Shortest Path Problems 177
4.3 Discounted Problems 187
4.4 Semi-Markov Discounted Problems 192
4.5 Asynchronous Distributed Value Iteration 197
4.6 Policy Iteration 200
4.6.1 Exact Policy Iteration 200
4.6.2 Optimistic and Multistep Lookahead Policy
Iteration 205
4.6.3 Policy Iteration for Q-factors 208
Contents i??
4.7 Notes and Sources 209
4.8 Appendix: MathematicalAnalysis 211
4.8.1 Proofs for Stochastic ShortestPathProblems 212
4.8.2 Proofs forDiscountedProblems 217
4.8.3 ConvergenceofExact andOptimistic
Policy Iteration 218
5 Infinite Horizon Reinforcement Learning
5.1 Approximation in Value Space - Performance Bounds 222
5.1.1 LimitedLookahead 224
5.1.2 Rollout and Approximate Policy Improvement 227
5.1.3 ApproximatePolicy Iteration 232
5.2 FittedValue Iteration 235
5.3 Simulation-BasedPolicy IterationwithParametric
Approximation 239
5.3.1 Self-Learning andActor-CriticMethods 239
5.3.2 Model-Based Variant of a Critic-Only Method 241
5.3.3 Model-FreeVariant of aCritic-OnlyMethod 243
5.3.4 Implementation Issues ofParametricPolicy
Iteration 246
5.3.5 Convergence Issues ofParametricPolicy Iteration -
Oscillations 249
5.4 Q-Learning 253
5.4.1 Optimistic Policy Iteration with Parametric Q-Factor
Approximation- SARSAandDQN 255
5.5 AdditionalMethods -TemporalDifferences 256
5.6 Exact andApproximateLinearProgramming 267
5.7 Approximation inPolicy Space 270
5.7.1 Training byCostOptimization -PolicyGradient,
Cross-Entropy,andRandomSearchMethods 276
5.7.2 Expert-BasedSupervisedLearning 286
5.7.3 ApproximatePolicy Iteration,Rollout, and
ApproximationinPolicySpace 288
5.8 Notes and Sources 293
5.9 Appendix: MathematicalAnalysis 298
5.9.1 Performance Bounds for Multistep Lookahead 299
5.9.2 Performance Bounds for Rollout 301
5.9.3 Performance Bounds for Approximate Policy
Iteration 304
6 Aggregation
6.1 AggregationwithRepresentativeStates 308
6.1.1 Continuous State and Control Space Discretization p 314
6.1.2 Continuous State Space - POMDP Discretization 315
?? Contents
6.2 AggregationwithRepresentativeFeatures 317
6.2.1 Hard Aggregation and Error Bounds 320
6.2.2 AggregationUsingFeatures 322
6.3 Methods for Solving theAggregateProblem 328
6.3.1 Simulation-BasedPolicy Iteration 328
6.3.2 Simulation-Based Value Iteration 331
6.4 Feature-BasedAggregationwith aNeuralNetwork 332
6.5 BiasedAggregation 334
6.6 Notes and Sources 337
6.7 Appendix: MathematicalAnalysis 340
References 345
Index 369
內容試閱:
Turning to the succor of modern computing machines, let us
renounce all analytic tools.
Richard Bellman [Bel57]
From a teleological point of view the particular numerical solution
of any particular set of equations is of far less importance than
the understanding of the nature of the solution.
Richard Bellman [Bel57]
In this book we consider large and challenging multistage decision problems,
which can be solved in principle by dynamic programming DP for short,
but their exact solution is computationally intractable. We discuss solution
methods that rely on approximations to produce suboptimal policies with
adequate performance. These methods are collectively known by several
essentially equivalent names: reinforcement learning, approximate dynamic
programming, and neuro-dynamic programming. We will use primarily the
most popular name: reinforcement learning.
Our subject has benefited greatly from the interplay of ideas from
optimal control and from artificial intelligence. One of the aims of the
book is to explore the common boundary between these two fields and to
form a bridge that is accessible by workers with background in either field.
Another aim is to organize coherently the broad mosaic of methods that
have proved successful in practice while having a solid theoretical andor
logical foundation. This may help researchers and practitioners to find
their way through the maze of competing ideas that constitute the current
state of the art.
There are two general approaches for DP-based suboptimal control.
The first is approximation in value space, where we approximate in some
way the optimal cost-to-go function with some other function. The major
alternative to approximation in value space is approximation in policy
space, whereby we select the policy by using optimization over a suitably
restricted class of policies, usually a parametric family of some form. In
some schemes these two types of approximation may be combined, aiming
to capitalize on the advantages of both. Generally, approximation in value
space is tied more closely to the central DP ideas of value and policy iteration
than approximation in policy space, which relies on gradient-like
descent, a more broadly applicable optimization mechanism.
While we provide a substantial treatment of approximation in policy
space, most of the book is focused on approximation in value space. Here,
the control at each state is obtained by optimization of the cost over a
limited horizon, plus an approximation of the optimal future cost. The
latter cost, which we generally denote by ? J, is a function of the state where
we may be. It may be computed by a variety of methods, possibly involving
simulation andor some given or separately derived heuristicsuboptimal
policy. The use of simulation often allows for implementations that do not
require a mathematical model, a major idea that has allowed the use of DP
beyond its classical boundaries.
We discuss selectively four types of methods for obtaining J?:
a Problem approximation: Here ? J is the optimal cost function of a related
simpler problem, which is solved by exact DP. Certainty equivalent
control and enforced decomposition schemes are discussed in
some detail.
b Rollout and model predictive control: Here ? J is the cost function of
some known heuristic policy. The needed cost values to implement a
rollout policy are often calculated by simulation. While this method
applies to stochastic problems, the reliance on simulation favors deterministic
problems, including challenging combinatorial problems
for which heuristics may be readily implemented. Rollout may also
be combined with adaptive simulation and Monte Carlo tree search,
which have proved very effective in the context of games such as
backgammon, chess, Go, and others.
Model predictive control was originally developed for continuousspace
optimal control problems that involve some goal state, e.g.,
the origin in a classical control context. It can be viewed as a specialized
rollout method that is based on a suboptimal optimization for
reaching a goal state.
c Parametric cost approximation: Here ? J is chosen from within a parametric
class of functions, including neural networks, with the parameters
optimized or trained by using state-cost sample pairs and
some type of incremental least squaresregression algorithm. Approximate
policy iteration and its variants are covered in some detail,
including several actor and critic schemes. These involve policy evaluation
with simulation-based training methods, and policy improve
......
Dimitri P. Bertsekas
June 2019