1 tópicos especiais em aprendizagem prof. reinaldo bianchi centro universitário da fei 2012

1

Tópicos Especiais em Aprendizagem

Prof. Reinaldo Bianchi

Centro Universitário da FEI

2012

2

Objetivo desta Aula

Aprendizado por Reforço:– Traços de Elegibilidade.– Generalização e Aproximações de funções.

Aula de hoje: capítulos 7 e 8 do Sutton & Barto.

3

Generalization and Function Approximation

Capítulo 8 do Sutton e Barto.

4

Objetivos

Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part.

Overview of function approximation (FA) methods and how they can be adapted to RL

Não tão profundamente como no livro (comentário do Bianchi)

5

Value Prediction with Function Approximation

As usual: Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function

In earlier chapters, value functions were stored in lookup tables.

6

Adapt Supervised Learning Algorithms

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Training example = {input, target output}

7

Backups as Training Examples

As a training example:

input target output

8

Any FA Method?

In principle, yes:– artificial neural networks– decision trees– multivariate regression methods– etc.

But RL has some special requirements:– usually want to learn while interacting– ability to handle nonstationarity– other?

9

Gradient Descent Methodstranspose

10

Performance Measures

A common and simple one is the mean-squared error (MSE) over a distribution P :

Let us assume that P is always the distribution of states at which backups are done.

The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.

11

Gradient Descent

Iteratively move down the gradient:

12

Gradient Descent Cont.For the MSE given above and using the chain rule:

13

Gradient Descent Cont.

Use just the sample gradient instead:

Since each sample gradient is an unbiased estimate ofthe true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t.

14

But We Don’t have these Targets

15

What about TD(l) Targets?

16

On-Line Gradient-Descent TD(l)

17

Linear Methods

18

Nice Properties of Linear FA Methods The gradient is very simple: For MSE, the error surface is simple:

quadratic surface with a single minumum. Linear gradient descent TD(l) converges:

– Step size decreases appropriately– On-line sampling (states sampled from the on-

policy distribution)– Converges to parameter vector with property:

19

Linear methods mais usados

Coarse Coding Tile Coding (CMAC) Radial Basis Functions Kanerva Coding

20

Coarse Coding

Generalization from state X to state Y depends on the number of their features whose receptive fields

21

Coarse Coding

Generalization in linear function approximation methods is determined by the sizes and shapes of the features' receptive fields. All three of these cases have roughly the same number

and density of features.

22

Coarse Coding

23

Learning and Coarse Coding

Example of feature width's strong effect on initial generalization (first row) and weak effect on accuracy

24

Tile Coding

Binary feature for each tile

Number of features present at any one time is constant

Binary features means weighted sum easy to compute

Easy to compute indices of the freatures present

25

Tile Coding

26

Exemplo: Simulated Soccer

How does agent decide what to do with the ball?

Complexities– Continuous inputs– High dimensionality

Do Artigo: Reinforcement Learning in Simulated Soccer with Kohonen Networks , de Chris White and David Brogan (University of Virginia)

27

Problems

State space explodes exponentially in terms of dimensionality

Current methods of managing state space explosion lack automation

RL does not scale well to problems with complexities of simulated soccer…

Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

28

Quantization

Divide State Space into regions of interest– Tile Coding (Sutton & Barto, 1998)

No automated method for regions– granularity– Heterogeneity– location

Prefer a learned abstraction of state space


29

Kohonen Networks

Clustering algorithm

Data driven

Agent nearopponent goal

Teammate nearopponent goal

No nearbyopponents


30

State Space Reduction

90 continuous valued inputs describe state of a soccer game– Naïve discretization 290 states– Filter out unnecessary inputs still 218

states– Clustering algorithm only 5000 states

• Big Win!!!


31

Two Pass Algorithm

Pass 1:– Use Kohonen Network and large training

set to learn state space Pass 2:

– Use Reinforcement Learning to learn utilities for states (SARSA)


32

Fragility of Learned Actions

What happens to attacker’s utility if goalie crosses dotted line?


33

Results

Evaluate three systems– Control – Random action selection– SARSA– Forcing Function

Evaluation criteria– Goals scored– Time of possession


34

Cumulative Score

SARSA vs. Random Policy

0

100

200

300

400

500

600

700

800

900

1 55 109

163

217

271

325

379

433

487

541

595

649

703

757

811

865

919

Games Played

Cu

mu

lati

ve G

oal

s S

core

d

Learning Team

Random Team


35

Team with Forcing Functions

SARSA with Forcing Function vs. Random Policy

0

200

400

600

800

1000

12001 65 129

193

257

321

385

449

513

577

641

705

769

833

897

Games Played

Cu

mu

lati

ve S

core

Learning Team with ForcingFunctions

Random Team Against Teamwith Forcing Functions


36

Can you beat the “curse of dimensionality”? Can you keep the number of features

from going up exponentially with the dimension?

“Lazy learning” schemes:– Remember all the data– To get new value, find nearest neighbors

and interpolate– e.g., locally-weighted regression

37

Can you beat the “curse of dimensionality”? Function complexity, not dimensionality,

is the problem. Kanerva coding:

– Select a bunch of binary prototypes– Use hamming distance as distance

measure– Dimensionality is no longer a problem, only

complexity

38

Algorithms using Function Approximators We now extend value prediction methods

using function approximation to control methods, following the pattern of GPI.

First we extend the state-value prediction methods to action-value prediction methods, then we combine them with policy improvement and action selection techniques.

As usual, the problem of ensuring exploration is solved by pursuing either an on-policy or an off-policy approach.

39

Control with FA

Learning state-action values:

The general gradient-descent rule:

Training examples of the form:

40

Control with FA

Gradient-descent Sarsa(l) (backward view):

41

Linear Gradient Descent Sarsa(l)

42

Linear Gradient Descent Q()

43

Mountain-Car Task

44

Mountain-Car Results

The effect of alpha, lambda and the kind of traces on early performance on the mountain-car task. This study used

five 9 x 9 tilings.

45

Summary

Generalization Adapting supervised-learning function

approximation methods Gradient-descent methods Linear gradient-descent methods

– Radial basis functions– Tile coding– Kanerva coding

46

Summary

Nonlinear gradient-descent methods? Backpropation?

Subleties involving function approximation, bootstrapping and the on-policy/off-policy distinction

47

Conclusion

48

Conclusão

Vimos dois métodos importantes na aula de hoje:– Traços de elegibilidade, que faz uma

generalização temporal do aprendizado.– Aproximadores de função, que

generalizam a função valor aprendida. Generalizam o aprendizado.

49

Fim.

1 tópicos especiais em aprendizagem prof. reinaldo bianchi centro universitário da fei 2012

Documents