crm114 teamknn and hyperspace spam sorting1 sorting spam with k-nearest neighbor and hyperspace...

31
CRM114 Team KNN and Hyperspace Spam Sorting 1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian Siefkes 3 Shalendra Chhabra 1,4 1: Mitsubishi Electric Research Labs- Cambridge MA 2: Empresa Brasileira de Telecomunicações Embratel, Rio de Janeiro, RJ Brazil 3: Database and Information Systems Group, Freie Universität Berlin, Berlin-Brandenburg Graduate School in Distributed Information Systems 4: Computer Science and Engineering, University of California, Riverside CA

Upload: beverley-johnston

Post on 16-Dec-2015

259 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 1

Sorting Spam with K-Nearest Neighbor and Hyperspace

ClassifiersWilliam Yerazunis1

Fidelis Assis2

Christian Siefkes3

Shalendra Chhabra1,4

1: Mitsubishi Electric Research Labs- Cambridge MA

2: Empresa Brasileira de Telecomunicações Embratel, Rio de Janeiro, RJ Brazil

3: Database and Information Systems Group, Freie Universität Berlin,

Berlin-Brandenburg Graduate School in Distributed Information Systems

4: Computer Science and Engineering, University of California, Riverside CA

Page 2: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 2

Bayesian is Great.Why Worry?

● Typical Spam Filters are linear classifiers– Consider the “checkerboard” problem

● Markovian requires the nonlinear features to be textually “near” each other– can’t be sure that will work forever because

spammers are clever.● Winnow is just a different weighting +

different chain rule rule

Page 3: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 3

Bayesian is Great.Why Worry?

● Bayesian is only a linear classifier– Consider the “checkerboard” problem

● Markovian requires the nonlinear features to be textually “near” each other– can’t be sure of that; spammers are clever

● Winnow is just a different weighting● KNNs are a very different kind of classifier

Page 4: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 4

Typical Linear Separation

Page 5: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 5

Typical Linear Separation

Page 6: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 6

Typical Linear Separation

Page 7: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 7

Nonlinear Decision Surfaces

Nonlinear decision surfaces requiretremendous amounts of data.

Page 8: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 8

Nonlinear Decision and KNN / Hyperspace

Nonlinear decision surfaces requiretremendous amounts of data.

Page 9: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 9

● Earliest found reference:

E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties

KNNs have been around

Page 10: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 10

● Earliest found reference:

E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties

● In 1951 !

KNNs have been around

Page 11: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 11

● Earliest found reference:

E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties

● In 1951 !● Interesting Theorem: Cover and Hart

(1967)

KNNs are within a factor of 2 in accuracy to the optimal Bayesian filter

KNNs have been around

Page 12: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 12

● Start with bunch of known things and one unknown thing.

● Find the K known things most similar to the unknown thing.

● Count how many of the K known things are in each class.

● The unknown thing is of the same class as the majority of the K known things.

KNNs in one slide!

Page 13: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 13

● How big is the neighborhood K ?● How do you weight your neighbors?

– Equal-vote?– Some falloff in weight?– Nearby interaction – the Parzen window?

● How do you train?– Everything? That gets big...– And SLOW.

Issues with Standard KNNs

Page 14: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 14

● How big is the neighborhood?

We will test with 3, 7, 21 and |corpus|

● How do we weight the neighbors?

We will try equal-weighting, similarity, Euclidean distance, and combinations

thereof.

Issues with Standard KNNs

Page 15: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 15

● How do we train?– To compare with a good Markov classifier we

need to use TOE – Train Only Errors – This is good in that it really speeds up

classification and keeps the database small.– This is bad in that it violates the Cover and

Hart assumptions, so the quality limit theorem no longer applies

– BUT – we will train multiple passes to see if an asymptote appears.

Issues with Standard KNNs

Page 16: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 16

● We found the “bad” KNNs mimic Cover and Hart behavior- they insert basically everything into a bloated database, sometimes more than once!

● The more accurate KNNs inserted fewer examples into their database.

Issues with Standard KNNs

Page 17: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 17

● Use the TREC 2005 SA dataset.● 10-fold validation – train on 90%, test on

10%, repeat for each successive 10% (but remember to clear memory!)

● Run 5 passes (find the asymptote)● Compare it versus the OSB Markovian

tested at TREC 2005.

How do we compare KNNs?

Page 18: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 18

● Use the OSB feature set. This combines nearby words to make short phrases; the phrases are what are matched.

● Example “this is an example” yields:“this is”

“this <skip> an”

“this <skip> <skip> example”

These features are the measurements we classify against

What do we use as features?

Page 19: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 19

Test 1: Equal Weight VotingKNN with K = 3, 7, and 21

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

70

75

80

85

90

95

100

105

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

70

75

80

85

90

95

100

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

70

75

80

85

90

95

100

Good

Spam

Wt. Total

Asymptotic accuracy: 93%, 93%, and 94%(good acc: 98%, spam acc 80% for K = 2 and

7,96% and 90% for K=21)

Time: ~50-75 milliseconds/message

Page 20: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 20

Test 2: Weight by Hamming-1/2

KNN with K = 7 and 21

Asymptotic accuracy: 94% and 92%(good acc: 98%, spam acc 85% for K=7,

98% and 79% for K=21)Time: ~ 60 milliseconds/message

Pass 1 Pass Pass 3 Pass 4 Pass 5

70

72.5

75

77.5

80

82.5

85

87.5

90

92.5

95

97.5

100

Pass Pass Pass Pass Pass

70

72.5

75

77.5

80

82.5

85

87.5

90

92.5

95

97.5

100

Good

SpamWt. Total

Page 21: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 21

Test 3: Weight by Hamming-1/2

KNN with K = |corpus|

Asymptotic accuracy: 97.8%Good accuracy: 98.2%Spam accuracy: 96.9%

Time: 32 msec/message

Good

SpamWt. Total

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

70

72.

75

77.

8

82.5

8

87.5

9

92.5

9

97.5

10

Page 22: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 22

Test 4: Weight by N-dimensional radiation model

(a.k.a. “Hyperspace”)

Weightdocument i

=similarityd

Euclidean distance2

for d=1, 2,3. ..

similarity=∣ featurei∈known text∧ feature

i∈unknown text∣

Page 23: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 23

Test 4: Hyperspace weight,K = |corpus|, d=1, 2, 3

Asymptotic accuracy: 99.3% Good accuracy: 99.64% , 99.66% and

99.59% Spam accuracy: 98.7, 98.4, 98.5%

Time: 32, 22, and 22 milliseconds/message

Good

SpamWt. Total

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

95

95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

Pass Pass Pass Pass Pass 95

95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

Pass Pass Pass Pass Pass

95

95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

Page 24: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 24

Test 5: Compare vs. Markov OSB(thin threshold)

Asymptotic accuracy: 99.1%Good accuracy: 99.6%, Spam accuracy: 97.9%

Time: 31 msec/message

Good

SpamWt. Total

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

95

95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

Page 25: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 25

Test 6: Compare vs. Markov OSB(thick threshold = 10.0 pR)

● Thick Threshold means:

– Test it first

– If it is wrong, train it.

– If it was right, but only by less than the threshold thickness, train it anyway!

● 10.0 pR units is roughly the range between 10% to 90% certainty.

Page 26: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 26

Test 6: Compare vs. Markov OSB(thick threshold = 10.0 pR)

Asymptotic accuracy: 99.5%Good accuracy: 99.6%, Spam accuracy: 99.3%

Time: 19 msec/message

Good

SpamWt. Total

Pass Pass Pass Pass Pass

95

95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

Page 27: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 27

● Small-K KNNs are not very good for sorting spam.

Conclusions:

Page 28: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 28

● Small-K KNNs are not very good for sorting spam.

● K=|corpus| KNNs with distance weighting are reasonable.

Conclusions:

Page 29: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 29

● Small-K KNNs are not very good for sorting spam

● K=|corpus| KNNs with distance weighting are reasonable

● K=|corpus| KNNs with hyperspace weighting are pretty good.

Conclusions:

Page 30: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 30

● Small-K KNNs are not very good for sorting spam.

● K=|corpus| KNNs with distance weighting are reasonable.

● K=|corpus| KNNs with hyperspace weighting are pretty good.

● But thick-threshold trained Markovs seem to be more accurate, especially in single-pass training.

Conclusions:

Page 31: CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

CRM114 Team KNN and Hyperspace Spam Sorting 31

Thank you! Questions?

Full source is available at

http://crm114.sourceforge.net

(licensed under the GPL)