CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

Download CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian

Post on 16-Dec-2015

221 views

Category:

Documents

8 download

Embed Size (px)

TRANSCRIPT

<ul><li> Slide 1 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian Siefkes 3 Shalendra Chhabra 1,4 1: Mitsubishi Electric Research Labs- Cambridge MA 2: Empresa Brasileira de Telecomunicaes Embratel, Rio de Janeiro, RJ Brazil 3: Database and Information Systems Group, Freie Universitt Berlin, Berlin-Brandenburg Graduate School in Distributed Information Systems 4: Computer Science and Engineering, University of California, Riverside CA </li> <li> Slide 2 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting2 Bayesian is Great. Why Worry? Typical Spam Filters are linear classifiers Consider the checkerboard problem Markovian requires the nonlinear features to be textually near each other cant be sure that will work forever because spammers are clever. Winnow is just a different weighting + different chain rule rule </li> <li> Slide 3 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting3 Bayesian is Great. Why Worry? Bayesian is only a linear classifier Consider the checkerboard problem Markovian requires the nonlinear features to be textually near each other cant be sure of that; spammers are clever Winnow is just a different weighting KNNs are a very different kind of classifier </li> <li> Slide 4 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting4 Typical Linear Separation </li> <li> Slide 5 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting5 Typical Linear Separation </li> <li> Slide 6 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting6 Typical Linear Separation </li> <li> Slide 7 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting7 Nonlinear Decision Surfaces Nonlinear decision surfaces require tremendous amounts of data. </li> <li> Slide 8 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting8 Nonlinear Decision and KNN / Hyperspace Nonlinear decision surfaces require tremendous amounts of data. </li> <li> Slide 9 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting9 Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties KNNs have been around </li> <li> Slide 10 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting10 Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties In 1951 ! KNNs have been around </li> <li> Slide 11 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting11 Earliest found reference: E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties In 1951 ! Interesting Theorem: Cover and Hart (1967) KNNs are within a factor of 2 in accuracy to the optimal Bayesian filter KNNs have been around </li> <li> Slide 12 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting12 Start with bunch of known things and one unknown thing. Find the K known things most similar to the unknown thing. Count how many of the K known things are in each class. The unknown thing is of the same class as the majority of the K known things. KNNs in one slide! </li> <li> Slide 13 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting13 How big is the neighborhood K ? How do you weight your neighbors? Equal-vote? Some falloff in weight? Nearby interaction the Parzen window? How do you train? Everything? That gets big... And SLOW. Issues with Standard KNNs </li> <li> Slide 14 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting14 How big is the neighborhood? We will test with 3, 7, 21 and |corpus| How do we weight the neighbors? We will try equal-weighting, similarity, Euclidean distance, and combinations thereof. Issues with Standard KNNs </li> <li> Slide 15 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting15 How do we train? To compare with a good Markov classifier we need to use TOE Train Only Errors This is good in that it really speeds up classification and keeps the database small. This is bad in that it violates the Cover and Hart assumptions, so the quality limit theorem no longer applies BUT we will train multiple passes to see if an asymptote appears. Issues with Standard KNNs </li> <li> Slide 16 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting16 We found the bad KNNs mimic Cover and Hart behavior- they insert basically everything into a bloated database, sometimes more than once! The more accurate KNNs inserted fewer examples into their database. Issues with Standard KNNs </li> <li> Slide 17 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting17 Use the TREC 2005 SA dataset. 10-fold validation train on 90%, test on 10%, repeat for each successive 10% (but remember to clear memory!) Run 5 passes (find the asymptote) Compare it versus the OSB Markovian tested at TREC 2005. How do we compare KNNs? </li> <li> Slide 18 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting18 Use the OSB feature set. This combines nearby words to make short phrases; the phrases are what are matched. Example this is an example yields: this is this an this example These features are the measurements we classify against What do we use as features? </li> <li> Slide 19 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting19 Test 1: Equal Weight Voting KNN with K = 3, 7, and 21 Asymptotic accuracy: 93%, 93%, and 94% (good acc: 98%, spam acc 80% for K = 2 and 7, 96% and 90% for K=21) Time: ~50-75 milliseconds/message </li> <li> Slide 20 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting20 Test 2: Weight by Hamming -1/2 KNN with K = 7 and 21 Asymptotic accuracy: 94% and 92% (good acc: 98%, spam acc 85% for K=7, 98% and 79% for K=21) Time: ~ 60 milliseconds/message </li> <li> Slide 21 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting21 Test 3: Weight by Hamming -1/2 KNN with K = |corpus| Asymptotic accuracy: 97.8% Good accuracy: 98.2%Spam accuracy: 96.9% Time: 32 msec/message </li> <li> Slide 22 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting22 Test 4: Weight by N-dimensional radiation model (a.k.a. Hyperspace) </li> <li> Slide 23 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting23 Test 4: Hyperspace weight, K = |corpus|, d=1, 2, 3 Asymptotic accuracy: 99.3% Good accuracy: 99.64%, 99.66% and 99.59% Spam accuracy: 98.7, 98.4, 98.5% Time: 32, 22, and 22 milliseconds/message </li> <li> Slide 24 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting24 Test 5: Compare vs. Markov OSB (thin threshold) Asymptotic accuracy: 99.1% Good accuracy: 99.6%, Spam accuracy: 97.9% Time: 31 msec/message </li> <li> Slide 25 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting25 Test 6: Compare vs. Markov OSB (thick threshold = 10.0 pR) Thick Threshold means: Test it first If it is wrong, train it. If it was right, but only by less than the threshold thickness, train it anyway! 10.0 pR units is roughly the range between 10% to 90% certainty. </li> <li> Slide 26 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting26 Test 6: Compare vs. Markov OSB (thick threshold = 10.0 pR) Asymptotic accuracy: 99.5% Good accuracy: 99.6%, Spam accuracy: 99.3% Time: 19 msec/message </li> <li> Slide 27 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting27 Small-K KNNs are not very good for sorting spam. Conclusions: </li> <li> Slide 28 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting28 Small-K KNNs are not very good for sorting spam. K=|corpus| KNNs with distance weighting are reasonable. Conclusions: </li> <li> Slide 29 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting29 Small-K KNNs are not very good for sorting spam K=|corpus| KNNs with distance weighting are reasonable K=|corpus| KNNs with hyperspace weighting are pretty good. Conclusions: </li> <li> Slide 30 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting30 Small-K KNNs are not very good for sorting spam. K=|corpus| KNNs with distance weighting are reasonable. K=|corpus| KNNs with hyperspace weighting are pretty good. But thick-threshold trained Markovs seem to be more accurate, especially in single-pass training. Conclusions: </li> <li> Slide 31 </li> <li> CRM114 TeamKNN and Hyperspace Spam Sorting31 Thank you! Questions? Full source is available at http://crm114.sourceforge.net (licensed under the GPL) </li> </ul>