lesson 4: decision trees

41
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Part I – Naive Bayes

Upload: gladys-castillo

Post on 24-May-2015

2.140 views

Category:

Education


6 download

DESCRIPTION

The objective of this lesson is to introduce the most popular algorithms for decision tree induction (ID3, c5.4) based on a greedy search strategy. More information is given in the course site: https://sites.google.com/site/gladyscjaprendizagem/program/decision-trees

TRANSCRIPT

Page 1: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, UA

Bayesian Networks Classifiers

Part I – Naive Bayes

Page 2: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.2

The Supervised Classification Problem

A classifier is a function f: X C that assigns a class label c C ={c1,…, cm} to objects described by a

set of attributes X ={X1, X2, …, Xn} X

Supervised Learning Algorithm

hC

hC C(N+1)

Inputs: Attribute values of x(N+1) Output: class of x(N+1)

Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1)) C

Given: a dataset D with N labeled examples of <X, C>Build: a classifier, a hypothesis hC: X C that can correctly predict the class labels of new objects

Learning Phase:

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD)1(

1x )1(nx

)2(1x )2(

nx

)(1

Nx )( Nnx

)1(1

Nx )1(2

Nx )1( Nnx…

attribute X2

att

ribu

te

X1

x

x x

x

x

xx

x

xx

oo o

oo

o

oo

o

o o

o

o

give credit

don´t give

credit

Page 3: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.3

Statistical Classifiers Treat the attributes X= {X1, X2, …, Xn} and

the class C as random variables A random variable is defined by

the probability density function

Give the probability P(cj | x) that x

belongs to a particular class rather than a simple classification

Probability density function of a random variable and few observations

)(xf )(xf

instead of having the map X C , we have X P(C| X)

The class c* attached to an example is the class with bigger P(cj|

x)

Page 4: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.4

Bayesian Classifiers

Bayesian because the class c* attached to an example x is determined by the Bayes’ TheoremP(X,C)= P(X|

C).P(C)P(X,C)= P(C|

X).P(X)Bayes theorem is the main tool in Bayesian inference

)(

)|()()|(

XP

CXPCPXCP

We can combine the prior distribution and the likelihood of the observed data in order to derive the posterior

distribution.

Page 5: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.5

Bayes TheoremExample

Given: A doctor knows that meningitis causes stiff neck 50% of

the time Prior probability of any patient having meningitis is

1/50,000 Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, what’s the probability he/she has meningitis?0002.0

20/150000/15.0

)()()|(

)|( SP

MPMSPSMP

likelihood

prior

prior

posterior prior x likelihood

)(

)|()()|(

XP

CXPCPXCP

Before observing the data, our prior beliefs can be expressed

in a prior probability distribution that represents the knowledge we have about the

unknown features. After observing the data our

revised beliefs are captured by a posterior distribution over

the unknown features.

Page 6: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.6

)(

)|()()|(

x

xx

P

cPcPcP jj

j

Bayesian Classifier

Bayes Theorem

Maximum a posteriori

classification

How to determine P(cj | x) for each class cj ?

P(x) can be ignored because it is the same for all the classes (a

normalization constant)

)|()(max)(1

*jj

...mjBayes cPcParghc xx

Classify x to the class which has bigger posterior

probability

Bayesian Classificatio

n Rule

Page 7: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.7

“Naïve” because of its very naïve independence assumption:

Naïve Bayes (NB) Classifier

all the attributes are conditionally independent given the class

Duda and Hart (1973); Langley (1992)

P(x | cj) can be decomposed into a product

of n terms, one term for each attribute

“Bayes” because the class c* attached to an example x is

determined by the Bayes’ Theorem

)|()(max)(1

*jj

...mjBayes cPcParghc xx

when the attribute space is high dimensional direct estimation is hard unless

we introduce some assumptions

n

ijiij

...mjNB cxXPcParghc

11

* )|()(max)(xNB

Classification Rule

Page 8: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.8

Xi continuous The attribute is discretized and then treats as a discrete

attribute A Normal distribution is usually assumed

2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj

Xi discrete

1. Estimate P(cj) for each class cj

Naïve Bayes (NB)Learning Phase (Statistical Parameter Estimation)

),;()|( ijijkjki xgcxXP

Given a training dataset D of N labeled examples (assuming complete data)

N

NcP j

j )(ˆ

j

ijkjki N

NcxXP )|(ˆ

2

2

)(2

2

1),;( e

x

xg

ond

e

Nj - the number of examples of the class cj

Nijk - number of examples of the class cj

having the value xk for the attribute Xi

Two options

The mean ij e the standard deviation ij are estimated from D

Page 9: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.9

2. Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj

For real attributes a Normal distribution is usually assumed

Xi | cj ~ N(μij, σ2ij) - the mean ij e the standard deviation ij are estimated from D

Continuous Attributes Normal or Gaussian Distribution

For a variable X ~ N(74, 36), the probability of observing the value 66 is given by: f(x) = g(66; 74, 6) = 0.0273

The d

ensity

pro

bability

fu

nctio

n

- 2 323

N(0,2)

- 2 323

N(0,2)

f(x) is symmetrical around its mean

),;()|( ijijkjki xgcxXP 2

2

)(2

2

1),;( e

x

xg

Page 10: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.10

1. Estimate P(cj) for each class cj

2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj

X1 discrete X2 continuos

Naïve Bayes Probability Estimates

)101 1.73 ()|( 2 .,x;gCxXP

Train

ing d

ata

set D

5

3)(ˆ CP

3

2)|(ˆ CaXP i

Class X1 X2

+ a 1.0

+ b 1.2

+ a 3.0

- b 4.4

- b 4.5

Two classes: + (positive) e - (negative)

Two attributes: X1 – discrete which takes values a e b X2 – continuos

Binary Classification Problem

3

1)|(ˆ CbXP i

5

2)(ˆ CP

)070 4.45 ()|( 2 .,x;gCxXP 2

0)|(ˆ CaXP i 2

2)|(ˆ CbXP i

2+= 1.73, 2+ = 1.10

2-= 4.45, 22+ = 0.07

Example from John & Langley (1995)

Page 11: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.11

Probability Estimates Discrete Attributes

for each class: P(No) = 7/10 P(Yes) = 3/10

for each attribute value and class:

Examples:

P(Status=Married|No) = 4/7

P(Refund=Yes|Yes)=0

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book

N

NcP j

j )(ˆ

j

ijkjki N

NcxXP )|(ˆ

To compute Nijk we need to count the

number of examples of the class cj

having the value xk for the attribute Xi

Page 12: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.12

for each pair attribute-class (Xi, cj)

Example for (Income, Class=No)

If Class=No sample mean = 110 sample standard deviation =

54.54

0072.0)54.54(2

1)|120( )2975(2

)110120( 2

eNoIncomeP

Probability Estimates Continuos Attributes

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book

2

2

)(2

2

1) ()|( e ij

ijkx

ij

ijijkjki ,;xgcxXP

Page 13: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.13

The Balance-scale Problem

Each example has 4 numerical attributes: the left weight (Left_W) the left distance (Left_D) the right weight (Right_W) right distance (Right_D)

The dataset was generated to model psychological experimental results

Left_W

Each example is classified into 3 classes: the balance-scale:

tip to the right (Right)tip to the left (Left)is balanced (Balanced)

3 target rules:If LD x LW > RD x RW tip to the leftIf LD x LW < RD x RW tip to the rightIf LD x LW = RD x RW it is balanced

dataset from the UCI repository

Right_W

Left_D Right_D

Page 14: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.14

The Balance-scale Problem

Left_W Left_D

Right_W Right_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Adapted from © João Gama’s slides “Aprendizagem Bayesiana”

Discretization is applied: each attribute is mapped to 5 intervals

Page 15: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.15

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

The Balance-scale Problem Learning Phase

Build

Contingency tables

Contingency Tables

AttributeAttribute: : LeftLeft_ W_ W

2534496686RightRight

9108810BalancedBalanced

7271614214LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : LeftLeft_ W_ W

2534496686RightRight

9108810BalancedBalanced

7271614214LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : LeftLeft_ D_ D

2737495790RightRight

8109108BalancedBalanced

7770593816LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : LeftLeft_ D_ D

2737495790RightRight

8109108BalancedBalanced

7770593816LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : RightRight_ W_ W

7970583716RightRight

8910108BalancedBalanced

2833496387LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : RightRight_ W_ W

7970583716RightRight

8910108BalancedBalanced

2833496387LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : RightRight_ D_ D

8267573717RightRight

9108108BalancedBalanced

2535446591LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : RightRight_ D_ D

8267573717RightRight

9108108BalancedBalanced

2535446591LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

260

LeftLeft

Classes Classes CountersCounters

56526045

TotalTotalRightRightBalancedBalanced

260

LeftLeft

Classes Classes CountersCounters

56526045

TotalTotalRightRightBalancedBalanced

565 examples

Assuming complete data, the computation of all the required estimates requires a simple scan through the data, an operation of time

complexity O(N n), where N is the number of training examples.

Page 16: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.16

We need to estimate the posterior probabilities P(cj | x) for each class

How NB classifies this example?

The class for this example is the class which has bigger posterior probability

P(Left|x) P(Balanced|x)

P(Right|x)

0.277796 0.135227 0.586978Class = Right

P(cj |x) = P(cj ) x P(Left_W=1 | cj) x P(Left_D=5 | cj )

x P(Right_W=4 | cj ) x P(Right_D=2 | cj ),

cj {Left, Balanced, Right}

The class counters and contingency tables are used to compute the

posterior probabilities for each class

1

LeftLeft_W_W

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

1

LeftLeft_W_W

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

?

max

The Balance-scale Problem Classification Phase

n

ijiij

...mjNB cxXPcParghc

11

* )|()(max)(x

Page 17: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.17

Iris Dataset From the statistician Douglas

Fisher Three flower types (classes):

Setosa Virginica Versicolour

Four continuous attributes Sepal width and length Petal width and length

Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.

dataset from the UCI repository

Page 18: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.18

Iris Dataset

Page 19: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.19

Iris Dataset

Page 20: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.20

Scatter Plot Array of Iris Attributes

The attributes petal width and

petal length provide a moderate

separation of the Irish species

Page 21: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.21

Naïve Bayes Iris Dataset – Continuous Attributes

Normal Probability Density Functions

Attribute: PetalWidth

P(PetalWidth|Setosa)

P(PetalWidth|Versicolor)

P(PetalWidth|Virginica)

Class Iris-setosa mean: 0.244, standard deviation: 0.107

Class Iris-versicolor mean: 1.326, standard deviation: 0.198

Class Iris-virginica mean: 2.026, standard deviation: 0.275

Page 22: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.22

Class Iris-versicolor (0.327): ==============================Attribute sepallength mean: 5.936, standard deviation: 0.516Attribute sepalwidth mean: 2.770, standard deviation: 0.314Attribute petallengthmean: 4.260, standard deviation: 0.470Attribute petalwidthmean: 1.326, standard deviation: 0.198

Naïve Bayes Iris Dataset – Continuous Attributes

Model

Class Iris-setosa (0.327): ========================== Attribute sepallength mean: 5.006, standard deviation: 0.352Attribute sepalwidth mean: 3.418, standard deviation: 0.381Attribute petallength mean: 1.464, standard deviation: 0.174Attribute petalwidth mean: 0.244, standard deviation: 0.107

Class Iris-virginica (0.327): =============================Attribute sepallength mean: 6.588, standard deviation: 0.636Attribute sepalwidthmean: 2.974, standard deviation: 0.322Attribute petallengthmean: 5.552, standard deviation: 0.552Attribute petalwidthmean: 2.026, standard deviation: 0.275

Page 23: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.23

Naïve Bayes Iris Dataset – Continuous Attributes

Classification Phase

maximum values

Page 24: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.24

Naïve Bayes Iris Dataset – Continuous Attributes

Estimate P(cj | x) for each class:

P(setosa|x) P(versicolor|x)

P(virginica|x)

0 0.995 0.005Classe = versicolor

P(cj |x) = P(cj ) x P(sepalLength =5 | cj) x P(sepalWidth =3 | cj ) x P(petalLength =2 | cj ) x P(petalWidth =2 | cj ), cj {setosa, versicolor, virgínica}

max

How NB classifies this example?

The class for this example is the class which has bigger posterior probability

Page 25: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.25

Naïve Bayes Iris Dataset – Continuous Attributes

Estimate P(cj | x) for the versicolor class Class Iris-versicolor (0.327): ==============================Attribute sepallength mean: 5.936, standard deviation: 0.516Attribute sepalwidth mean: 2.770, standard deviation: 0.314Attribute petallengthmean: 4.260, standard deviation: 0.470Attribute petalwidthmean: 1.326, standard deviation: 0.198 P(versicolor |x) = P(versicolor ) x P(sepalLength = 5 | versicolor )

x P(sepalWidth =3 | versicolor) x P(petalLength =2 | versicolor) x P(petalWidth =2 | versicolor )

P(versicolor |x) = 0.327 x g(5; 5.936, 0.516) x g(3; 2.770, 0.314) x g(2; 4.260, 0.470) x g(2; 1.326, 0.198)

n

ijiij

...mjNB cxXPcParghc

11

* )|()(max)(x

Page 26: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.26

Naïve Bayes Continuous Attributes - Discretization

For continuos attributes

After BinDiscretization, bins=3

Page 27: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.27

Naïve Bayes Continuous Attributes - Discretization

Model

Class Iris-versicolor (0.327): ==============================Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020

Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000

Class Iris-virginica (0.327): =============================Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidthrange1: 0.380 range2: 0.580 range3: 0.040 Attribute petallengthrange1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidthrange1: 0.000 range2: 0.100 range3: 0.900

Page 28: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.28

Naïve Bayes Continuous Attributes - Discretization

Class Iris-setosa (0.327):========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000Class Iris-versicolor (0.327): ==============================Attribute sepallengthrange1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020Class Iris-virginica (0.327): =============================Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidthrange1: 0.380 range2: 0.580 range3: 0.040Attribute petallengthrange1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidthrange1: 0.000 range2: 0.100 range3: 0.900

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Atributo Sepallength

Range1

Range2

Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth

Range1

Range2

Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

We can build a conditional probability table (CPT) for each attribute

Page 29: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.29

Naïve Bayes Continuous Attributes - Discretization

Class Iris-setosa (0.327):========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000Class Iris-versicolor (0.327): ==============================Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020Class Iris-virginica (0.327): =============================Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidthrange1: 0.380 range2: 0.580 range3: 0.040Attribute petallengthrange1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidthrange1: 0.000 range2: 0.100 range3: 0.900

Atributo Petallength

Range1

Range2

Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1

Range2

Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Page 30: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.30

Naïve Bayes Iris Dataset – Discretized Attributes

Classification (Implementation) Phase

maximum values

Discretized Examples

Page 31: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.31

Naïve Bayes Classification Phase

How to classify this example ?

P(setosa|x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth =r2 |

setosa) x P(petalLength = r1 | setosa) x P(petalWidth =r3 | setosa )

P(versicolor|x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | versicolor)

x P(petalWidth =r3 | versicolor )

P(virginica|x) = P(virginica) x P(sepalLength = r1 | virginica) x P(sepalWidth =r2 | virginica) x P(petalLength = r1 | virginica) x P(petalWidth =r3 | virginica )

We need to compute the conditional posterior probabilities for each class

sepallengt

sepalwidth

petallength

petallength

r1 r2 r1 r3

Example with discretized attributes

sepallengt

sepalwidth

petallength

petallength

5 3 2 2

Page 32: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.32

Naïve Bayes Continuous Attributes - Discretization

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(setosa | x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth = r2 | setosa ) x P(petalLength = r1 | setosa ) x P(petalWidth = r3 | setosa )P(setosa | x) = 0.327 x 0.940 x 0.720 x 1.000 x 0.000 = 0

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

Page 33: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.33

Naïve Bayes Continuous Attributes - Discretization

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(versicolor | x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth = r2 | versicolor ) x P(petalLength = r1 | versicolor ) x P(petalWidth = r3 | versicolor )P(versicolor | x) = 0.327 x 0.220 x 0.540 x 0.000 x 0.020 = 0

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

Page 34: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.34

Naïve Bayes Continuous Attributes - Discretization

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(virginica | x) = P(virginica) x P(sepalLength = r1 | virginica ) x P(sepalWidth = r2 | virginica ) x P(petalLength = r1 | virginica ) x P(petalWidth = r3 | virginica )P(virginica | x) = 0.327 x 0.020 x 0.580 x 0.000 x 0.900 = 0

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

If all the class probabilities are zero we can not determine the class for this example

Page 35: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.35

Naïve Bayes Laplace Correction

Nijk number of examples in D such that

Xi = xk and C = cj Nj number of examples in D of class cj

k number of possible values of Xi

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

To avoid zero probabilities due to

zero counters we can implement the

Laplace correction

j

ijkjki N

NcxXP )|(ˆ

kN

NcxXP

j

ijkjki

1)|(ˆ

To calculate the conditional probabilities, instead of using the estimate

To use the Laplace correction:

Page 36: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.36

Bayesian SPAM Filters Binary Classification Problem

57 continuous attributes Some word frequencies Some character frequencies Capital letters frequency

Two classes 0 – is not SPAM 1 – is SPAM

Spambase dataset from the UCI repository

Page 37: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.37

Bayesian SPAM Filters Implementation in RapidMiner

Discretization Method

Learning

Testing

Feature Subset Selection Method

Evaluation Method

In this dataset there are no missing values; otherwise we need first to use a method for replace the missing values

Page 38: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.38

Bayesian SPAM Filters Confussion Matrix

SPAM Precision– percentage of e-mails classified as SPAMs which truly are

SPAM Recall – percentage of e-mails classified as SPAMs over the total of examples that are SPAM

sensibilidad

FNTP

TPTPRrecall

FPTP

TPprecision

Concept Learning Problem: Is an email a

SPAM?

True Positive = number of examples classified as positive which truly are

False Positive = number of examples classified as positive which are negative

False Negative = number of examples classified as negative which are positive

Page 39: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.39

Experimental Results with continuous

attributes discretization after FSS discretization before FSS

1. Feature Selection (CFS) 2. Genetic Algorithm (CFS) 3. Wrapper 4. Feature Selection (CFS)Minimal Entropy Discretization 5. Genetic Algorithm (CFS)Minimal Entropy Discretization 

6. Feature Selection (CFS)Frequency Discretization 7. WrapperMinimal Entropy Discretization 8. Minimal Entropy DiscretizationFeature Selection (CFS) 9. Minimal Entropy DiscretizationGenetic Algorithm (CFS) 10. Minimal Entropy DiscretizationWrapper

o Legenda – Ordem dos operadores de pré-processamento utilizados em cada caso:

Page 40: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.40

NB is one of more simple and effective classifiers

NB has a very strong unrealistic independence assumption:

Naïve Bayes Performance

all the attributes are conditionally independent given the value of class

Bias-variance decomposition of Test Error Naive Bayes for Nursery Dataset

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

500 2000 3500 5000 6500 8000 9500 11000 12500

# Examples

Variance

Bias in practice: independence assumption

is violated HIGH BIAS

it can lead to poor classification

However, NB is efficient due to its high

variance management

less parameters LOW VARIANCE

Page 41: Lesson 4: Decision Trees

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.41

reducing the bias resulting from the modeling error by relaxing the attribute independence assumption

one natural extension: Bayesian Network Classifiers

Improving Naïve Bayes reducing the bias of the parameter estimates

by improving the probability estimates computed from data

Rele

van

t w

ork

s:

Web and Pazzani (1998) - “Adjusted probability naive Bayesian induction” in LNCS v 1502

J. Gama (2001, 2003) - “Iterative Bayes”, in Theoretical Computer Science, v. 292

Friedman, Geiger and Goldszmidt (1998) “Bayesian Network Classifiers” in Machine Learning, 29 Pazzani (1995) - “Searching for attribute dependencies in Bayesian Network Classifiers” in Proc.

of the 5th Workshop of Artificial Intelligence and Statistics Keogh and Pazzani (1999) - “Learning augmented Bayesian classifiers…”, in Theoretical

Computer Science, v. 292Rele

van

t w

ork

s: