1 tópicos especiais em aprendizagem prof. reinaldo bianchi centro universitário da fei 2012
TRANSCRIPT
1
Tópicos Especiais em Aprendizagem
Prof. Reinaldo Bianchi
Centro Universitário da FEI
2012
2
Objetivo desta Aula
Aprendizado por Reforço:– Traços de Elegibilidade.– Generalização e Aproximações de funções.
Aula de hoje: capítulos 7 e 8 do Sutton & Barto.
3
Generalization and Function Approximation
Capítulo 8 do Sutton e Barto.
4
Objetivos
Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part.
Overview of function approximation (FA) methods and how they can be adapted to RL
Não tão profundamente como no livro (comentário do Bianchi)
5
Value Prediction with Function Approximation
As usual: Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function
In earlier chapters, value functions were stored in lookup tables.
6
Adapt Supervised Learning Algorithms
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
Training example = {input, target output}
7
Backups as Training Examples
As a training example:
input target output
8
Any FA Method?
In principle, yes:– artificial neural networks– decision trees– multivariate regression methods– etc.
But RL has some special requirements:– usually want to learn while interacting– ability to handle nonstationarity– other?
9
Gradient Descent Methodstranspose
10
Performance Measures
A common and simple one is the mean-squared error (MSE) over a distribution P :
Let us assume that P is always the distribution of states at which backups are done.
The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.
11
Gradient Descent
Iteratively move down the gradient:
12
Gradient Descent Cont.For the MSE given above and using the chain rule:
13
Gradient Descent Cont.
Use just the sample gradient instead:
Since each sample gradient is an unbiased estimate ofthe true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t.
14
But We Don’t have these Targets
15
What about TD(l) Targets?
16
On-Line Gradient-Descent TD(l)
17
Linear Methods
18
Nice Properties of Linear FA Methods The gradient is very simple: For MSE, the error surface is simple:
quadratic surface with a single minumum. Linear gradient descent TD(l) converges:
– Step size decreases appropriately– On-line sampling (states sampled from the on-
policy distribution)– Converges to parameter vector with property:
19
Linear methods mais usados
Coarse Coding Tile Coding (CMAC) Radial Basis Functions Kanerva Coding
20
Coarse Coding
Generalization from state X to state Y depends on the number of their features whose receptive fields
21
Coarse Coding
Generalization in linear function approximation methods is determined by the sizes and shapes of the features' receptive fields. All three of these cases have roughly the same number
and density of features.
22
Coarse Coding
23
Learning and Coarse Coding
Example of feature width's strong effect on initial generalization (first row) and weak effect on accuracy
24
Tile Coding
Binary feature for each tile
Number of features present at any one time is constant
Binary features means weighted sum easy to compute
Easy to compute indices of the freatures present
25
Tile Coding
26
Exemplo: Simulated Soccer
How does agent decide what to do with the ball?
Complexities– Continuous inputs– High dimensionality
Do Artigo: Reinforcement Learning in Simulated Soccer with Kohonen Networks , de Chris White and David Brogan (University of Virginia)
27
Problems
State space explodes exponentially in terms of dimensionality
Current methods of managing state space explosion lack automation
RL does not scale well to problems with complexities of simulated soccer…
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
28
Quantization
Divide State Space into regions of interest– Tile Coding (Sutton & Barto, 1998)
No automated method for regions– granularity– Heterogeneity– location
Prefer a learned abstraction of state space
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
29
Kohonen Networks
Clustering algorithm
Data driven
Agent nearopponent goal
Teammate nearopponent goal
No nearbyopponents
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
30
State Space Reduction
90 continuous valued inputs describe state of a soccer game– Naïve discretization 290 states– Filter out unnecessary inputs still 218
states– Clustering algorithm only 5000 states
• Big Win!!!
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
31
Two Pass Algorithm
Pass 1:– Use Kohonen Network and large training
set to learn state space Pass 2:
– Use Reinforcement Learning to learn utilities for states (SARSA)
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
32
Fragility of Learned Actions
What happens to attacker’s utility if goalie crosses dotted line?
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
33
Results
Evaluate three systems– Control – Random action selection– SARSA– Forcing Function
Evaluation criteria– Goals scored– Time of possession
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
34
Cumulative Score
SARSA vs. Random Policy
0
100
200
300
400
500
600
700
800
900
1 55 109
163
217
271
325
379
433
487
541
595
649
703
757
811
865
919
Games Played
Cu
mu
lati
ve G
oal
s S
core
d
Learning Team
Random Team
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
35
Team with Forcing Functions
SARSA with Forcing Function vs. Random Policy
0
200
400
600
800
1000
12001 65 129
193
257
321
385
449
513
577
641
705
769
833
897
Games Played
Cu
mu
lati
ve S
core
Learning Team with ForcingFunctions
Random Team Against Teamwith Forcing Functions
Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)
36
Can you beat the “curse of dimensionality”? Can you keep the number of features
from going up exponentially with the dimension?
“Lazy learning” schemes:– Remember all the data– To get new value, find nearest neighbors
and interpolate– e.g., locally-weighted regression
37
Can you beat the “curse of dimensionality”? Function complexity, not dimensionality,
is the problem. Kanerva coding:
– Select a bunch of binary prototypes– Use hamming distance as distance
measure– Dimensionality is no longer a problem, only
complexity
38
Algorithms using Function Approximators We now extend value prediction methods
using function approximation to control methods, following the pattern of GPI.
First we extend the state-value prediction methods to action-value prediction methods, then we combine them with policy improvement and action selection techniques.
As usual, the problem of ensuring exploration is solved by pursuing either an on-policy or an off-policy approach.
39
Control with FA
Learning state-action values:
The general gradient-descent rule:
Training examples of the form:
40
Control with FA
Gradient-descent Sarsa(l) (backward view):
41
Linear Gradient Descent Sarsa(l)
42
Linear Gradient Descent Q()
43
Mountain-Car Task
44
Mountain-Car Results
The effect of alpha, lambda and the kind of traces on early performance on the mountain-car task. This study used
five 9 x 9 tilings.
45
Summary
Generalization Adapting supervised-learning function
approximation methods Gradient-descent methods Linear gradient-descent methods
– Radial basis functions– Tile coding– Kanerva coding
46
Summary
Nonlinear gradient-descent methods? Backpropation?
Subleties involving function approximation, bootstrapping and the on-policy/off-policy distinction
47
Conclusion
48
Conclusão
Vimos dois métodos importantes na aula de hoje:– Traços de elegibilidade, que faz uma
generalização temporal do aprendizado.– Aproximadores de função, que
generalizam a função valor aprendida. Generalizam o aprendizado.
49
Fim.