classificaÇÃo automÁtica de documentos …

142
CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS TEMPORALMENTE ROBUSTA

Upload: others

Post on 26-Jul-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS

TEMPORALMENTE ROBUSTA

Page 2: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 3: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

THIAGO CUNHA DE MOURA SALLES

CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS

TEMPORALMENTE ROBUSTA

Dissertação apresentada ao Programa de Pós-Graduação em Ciência da Computação do Ins-tituto de Ciências Exatas da Universidade Fe-deral de Minas Gerais como requisito parcialpara a obtenção do grau de Mestre em Ciênciada Computação.

ORIENTADOR: MARCOSANDRÉ GONÇALVES

COORIENTADOR: LEONARDO CHAVES DUTRA DA ROCHA

Belo Horizonte

Março de 2011

Page 4: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 5: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

THIAGO CUNHA DE MOURA SALLES

AUTOMATIC DOCUMENT CLASSIFICATION

TEMPORALLY ROBUST

Dissertation presented to the Graduate Pro-gram in Computer Science of the Federal Uni-versity of Minas Gerais in partial fulfillmentof the requirements for the degree of Masterin Computer Science.

ADVISOR: MARCOSANDRÉ GONÇALVES

CO-ADVISOR: LEONARDO CHAVES DUTRA DA ROCHA

Belo Horizonte

March 2011

Page 6: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

c© 2011, Thiago Cunha de Moura Salles.Todos os direitos reservados.

Salles, Thiago Cunha de Moura.S168c Classificação Automática de Documentos

Temporalmente Robusta / Thiago Cunha de Moura Salles.— Belo Horizonte, 2011.

xxxvi, 106 f. : il. ; 29cm

Dissertação (mestrado) — Universidade Federal deMinas Gerais. Departamento de Ciência da Computação.

Orientador: Marcos André Gonçalves.Coorientador: Leonardo Chaves Dutra da Rocha.

1. Computação - Teses. 2. Recuperação da informação -Teses. I. Orientador. II. Título.

CDU 519.6*73 (043)

Page 7: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 8: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 9: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

“In times of change, learners inherit the Earth,

while the learned find themselves beautifully equipped

to deal with a world that no longer exists.”

(Eric Hoffer)

ix

Page 10: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 11: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Resumo

Classificação Automática de Documentos (CAD) é um tópico de pesquisa de grande relevân-

cia na comunidade de aprendizado de máquina e recuperação deinformação, e diversos al-

goritmos para CAD foram propostos na literatura. A maioria de algoritmos para CAD, no

entanto, assume uma distribuição estática dos dados. Essa premissa é comumente violada

em dados reais. Neste trabalho, lidamos com os desafios relacionados à dinâmica temporal

observada em coleções textuais. Apresentamos evidências sobre a existência de três efeitos

temporais em três coleções reais, que são refletidos por variações observadas ao longo do

tempo na distribuição das classes, nas similaridades entrepares de classes e nos relaciona-

mentos entre termos e classes.Quantificamos, então, o impacto de tais efeitos temporais em

quatro algoritmos tradicionais para CAD, realizando uma série de projetos fatoriais comple-

tos. Mostramos que tais efeitos afetam as coleções de forma distinta e impactam na eficácia

dos algoritmos para CAD em diferentes proporções. As análises quantitativas realizadas

provêm informações valiosas para um melhor entendimento acerca do comportamento dos

algoritmos para CAD quando diante de distribuições de dadosque variam ao longo do tempo,

e apontam requisitos importantes para a proposta de modelosde classificação mais acurados.

Baseado nas análises conduzidas, com o intuito deminimizaro impacto de tais efeitos em

algoritmos para CAD, introduzimos umafunção de ponderação temporal(TWF) que reflete

a natureza variável das coleções textuais e propomos uma metodologia para determinar tanto

a expressão quanto os parâmetros da mesma. Tal metodologia foi aplicada a três coleções

textuais. Três algoritmos tradicionais para CAD (kNN, Rocchio e Naïve Bayes) foram es-

tendidos a fim de incorporar a TWF, segundo duas estratégias propostas, obtendo o que

chamamos de classificadores temporalmente robustos. Os classificadores temporalmente ro-

bustos obtiveram ganhos significativos em eficácia em relação às suas versões tradicionais.

xi

Page 12: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 13: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Abstract

Automatic Document Classification (ADC) continues to be a relevant research topic in the

machine learning and information retrieval communities, and several ADC algorithms have

been proposed. However, the majority of ADC algorithms assume that the underlying data

distribution does not change over time. In this work, we are concerned with the challenges

imposed by the temporal dynamics observed in textual datasets. We provide evidence of

the existence of three main temporal effects in three textual datasets, reflected by variations

observed over time in the class distribution, in the pairwise class similarities, and in the

relationships between terms and classes. We thenquantify, using a series of full factorial

design experiments, the impact of these effects on four wellknown ADC algorithms. We

show that these temporal effects affect each analyzed dataset differently, and that they re-

strict the performance of each considered ADC algorithm at different extents. The reported

quantitative analyses provide valuable insights to betterunderstand the behavior of ADC al-

gorithms when faced with non-static (temporal) data distributions and highlight important

requirements for the proposal of more accurate classification models. Based on the per-

formed analyses, in order tominimizethe impact of temporal effects in ADC algorithms, we

introduce atemporal weighting function(TWF) which reflects the varying nature of textual

datasets and propose a methodology to determine its expression and parameters. We ap-

plied this methodology to three textual datasets and then proposed two strategies to extend

three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF,

which we call temporally-aware classifiers. Experiments showed that the temporally-aware

classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art

algorithms in almost all cases.

xiii

Page 14: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 15: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Resumo Estendido

Introdução

Classificação Automática de Documentos (CAD) é um tópico de pesquisa de grande relevân-

cia nas comunidades de Aprendizado de Máquina e Recuperaçãode Informação. De fato, o

desenvolvimento de algoritmos eficazes e eficientes para CADtem se mostrado de grande

importância, dada a crescente complexidade e escala dos cenários de aplicação atuais, como

a Web. A tarefa de CAD consiste no aprendizado de modelos que associam documentos a

classes semanticamente coesas, baseado em um conjunto de documentos previamente clas-

sificados. Esses modelos são componentes chave para dar suporte e melhorar uma variedade

de tarefas, tais como o projeto de diretórios de tópicos, identificação de estilos de escrita,

organização de bibliotecas digitais, auxílio aos usuáriospara uma melhor interação com

máquinas de buscas, dentre outras.

O problema

Para melhor entendermos o problema estudado neste trabalho, faz-se necessário apresen-

tarmos brevemente a tarefa de CAD, considerando o paradigmasupervisionado. O objetivo

principal de CAD é predizer a classe (desconhecida) de um novo documento, baseado em um

conjunto de documentos previamente classificados (Sebastiani, 2002). Sejadi = (~xi, ci) um

documento cuja representação vetorial (“bag of words”) é dada por~xi e cuja classeci ∈ C

é um atributo categórico proveniente de um conjunto finitoC de classes. Assim, o objetivo

de CAD pode ser definido como o aprendizado de uma aproximaçãodiscreta da distribuição

a posterioridas classesP (ci|di), que reflete o relacionamento preditivo entre documentos e

classes. Esse aprendizado é realizado de acordo com o conjunto de documentos previamente

classificados (conjunto de treinamento).

A aproximação deP (ci|di) pode ser realizada tanto via estimativa direta, quanto via

estimativa indireta (pela regra de Bayes). A primeira estratégia define os chamados clas-

sificadores discriminativos, caracterizados por aprenderas fronteiras inter-classes de forma

xv

Page 16: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

a minimizar a taxa de erros (ou alguma métrica correlata), literalmente discriminando as

classes, sem realizar nenhuma suposição referente à funçãode densidade de probabilidade

de cada classe. Por outro lado, a segunda estratégia, que define os chamados classificadores

generativos, valem-se da estimativa tanto da probabilidade condicionalP (di|ci) das classes

quanto da probabilidadea priori P (ci) das mesmas para estimar a probabilidadea posteriori

almejada. Nesse caso, presume-se um modelo tanto para as densidadesP (di|ci) quanto para

as probabilidades a prioriP (ci), sendo os parâmetros do modelo estimados com base no

conjunto de treinamento. Então, a probabilidadea posteriorié obtida por meio da aplicação

da regra de Bayes:

P (ci|di) =P (ci) · P (di|ci)

c′∈C P (c′) · P (di|c′), (1)

ondeP (ci) eP (di|ci) denotam, respectivamente, a probabilidadea priori e condicionais das

classes.

A premissa básica adotada pela vasta maioria de algoritmos para CAD é que os dados

de treinamento, utilizados para construir um modelo de classificação, são amostras aleatórias

provenientes de uma distribuição de dados estacionária. Entretanto, este pode não ser o caso.

De fato, em diversos (talvez a maioria) dos problemas de classificação reais, os dados uti-

lizados para treinamento podem não ser provenientes da mesma distribuição que governa os

dados a serem classificados, em virtude da dinâmica temporaldos mesmos. Por exemplo,

sistemas para filtragem de spams e recomendação de itens de informação são naturalmente

confrontados por dados inerentemente dinâmicos. Assim, o sucesso de algoritmos de classi-

ficação pode ser comprometido quando diante de dados não-estáticos.

Conforme analisado porKelly et al. (1999), as variações observadas nas distribuições

de dados se refletem em, no mínimo, três aspectos:

• Variações nas probabilidadesa priori—P (ci);

• Variações nas probabilidadesa posteriori—P (ci|di);

• Variações nas probabilidades condicionais—P (di|ci).

Note que, de acordo com a Equação1, comop(ci|di) depende dep(di|ci), tanto os classi-

ficadores discriminativos quanto os generativos que assumem uma distribuição estacionária

de dados podem ter sua efetividade limitada quando aplicados a distribuições não-estáticas

de dados.

Neste trabalho, estamos particularmente interessados no impacto da dinâmica temporal

observada em dados textuais em algoritmos para CAD. Devido àdinâmica do conhecimento

e, até mesmo, das linguagens, as características de coleções textuais podem apresentar vari-

xvi

Page 17: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

ações ao longo do tempo. De fato, como analisado porMourão et al.(2008), três efeitos

temporais que, em última análise, podem ser vistos como manifestações dos três aspectos

apontados anteriormente, se mostraram significativos em duas coleções textuais reais. O

primeiro efeito,CD (“Class Distribution variation”), refere-se a variações na distribuição

das classes ao longo do tempo (ou seja, as frequências relativas das classes não se man-

tém estáticas). O segundo efeito,TD (“Term Distribution variation”), refere-se às variações

observadas ao longo do tempo na distribuição dos termos, refletido por variações na repre-

sentatividade dos mesmos em relação às classes em que ocorrem. Finalmente, o terceiro

efeito,CS (“Class Similarity variation”) refere-se às variações nas similaridades entre pares

de classes na medida em que o tempo avança. De fato, duas classes podem se mostrar simi-

lares (ou dissimilares) entre si em um determinado momento,e essa similaridade se reduzir

(ou aumentar) ao longo do tempo. Ainda, em (Mourão et al., 2008) os autores evidenciaram

que essa evolução temporal é um desafio para algoritmos de aprendizado que, por sua vez,

podem ter sua efetividade limitada ao negligenciar tal aspecto.

Neste trabalho, avançamos o conhecimento na área por meio daquantificaçãoe mini-

mizaçãodo impacto dos efeitos temporais em algoritmos para CAD. Pormeio da realização

de uma série de projetos fatoriais completos,quantificamosa extensão dos efeitos tempo-

rais em diferentes coleções textuais, bem como o impacto dosmesmos em quatro algoritmos

tradicionais para CAD. Baseado no conhecimento obtido com essa caracterização mais apro-

fundada, desenvolvemos estratégias para minimizar o impacto de tais efeitos em três algo-

ritmos, alcançando resultados competitivos com o estado daarte em classificação automática

de documentos, com um menor custo computacional.

Análise Quantitativa dos Efeitos Temporais em

Classificação Automática de Documentos

A fim de quantificarmos o impacto dos efeitos temporais em algoritmos para CAD, primeira-

mente revisitamos a caracterização reportada em (Mourão et al., 2008), em que os autores

apresentam evidências sobre a existência dos três efeitos temporais discutidos em duas

coleções textuais reais: ACM-DL e MEDLINE. A primeira é composta por24897 docu-

mentos provenientes da Biblioteca Digital da ACM, distribuídos em11 classes disjuntas, e

criados entre1980 e 2002. A segunda é composta por861454 documentos classificados em

7 classes relacionadas à Medicina, criados entre1970 e1985. Incluímos, ainda, uma terceira

coleção, proveniente do domínio de notícias, a fim de prover evidências quanto à existência

dos efeitos temporais na mesma. Trata-se da AG-NEWS, uma coleção composta por835795

documentos, distribuídos entre11 classes disjuntas, criados em um intervalo de573 dias.

xvii

Page 18: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Potencialmente, trata-se uma coleção mais dinâmica que as demais.

De fato, ao caracterizarmos essa coleção em função dos efeitos temporais, de acordo

com a metodologia proposta em (Mourão et al., 2008), tornou-se claro que a AG-NEWS

é acometida pelos três efeitos temporais. Como um exemplo, apresentamos na Figura1

a distribuição relativa das classes observada ao longo do tempo (utilizando uma unidade

temporal semanal). Claramente, a distribuição das classesvaria. Mais detalhes a cerca de tal

caracterização são encontrados no texto completo da dissertação.

Figura 1: Variação na Distribuição de Classes—AG-NEWS.

Projeto Fatorial Completo

Uma vez evidenciada a existência dos efeitos temporais nas três coleções textuais adotadas,

partimos, então, para uma caracterização mais aprofundadados mesmos, quantificando como

eles afetam as coleções e a efetividade de quatro algoritmospara CAD amplamente utiliza-

dos pela comunidade de Aprendizado de Máquina, a saber,Rocchio, K Nearest Neighbors

(kNN), Naïve BayeseSupport Vector Machine(SVM).

Dadosk fatores, que podem assumirn níveis (valores possíveis), e uma variável re-

sposta, um projeto fatorialnkr busca quantificar o impacto de cada fator (bem como as in-

terações entre eles), na variável resposta, por meio der replicações experimentais. No nosso

caso, objetivamos quantificar o impacto dos efeitos temporais (fatores), e suas interações, na

efetividade de algoritmos para CAD (variável resposta). Consideramos dois possíveis níveis:

nível baixo e nível alto, referindo-se a uma baixa influênciae uma alta influência dos efeitos

temporais, respectivamente.

Um primeiro aspecto a ser tratado para a realização do projeto fatorial consiste em

isolar os níveis de cada fator. Isso se dá por meio da partiçãode documentos da coleção

sob estudo em grupos apresentando níveis baixo e altos de influência dos efeitos temporais.

xviii

Page 19: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Para tanto, propomos alguns mecanismos para realizar esse isolamento, conforme descrito

a seguir:

Distribuição de Classes (CD): Mensuramos a variação da distribuição de cada classec ao

longo do tempo, por meio do Coeficiente de Variação (CVc =σc

µc) da proporção relativa

de c em cada ponto no tempo. Para isso, calculamos a proporçãoPc,p de ocorrência

de documentos na classec considerando cada ponto no tempop e obtivemos tanto a

médiaµc quanto o desvio padrãoσc desses valores. Assim, associamos a cada classec

seu respectivo Coeficiente de VariaçãoCVc. Definimos, então, um limiarδCD tal que

documentos pertencentes a classes cujoCV é inferior aδCD são associados ao nível

baixo (grupoCD↓) e os demais são associados ao nível alto (grupoCD↑).

Distribuição de Termos (TD): A fim de isolar os níveis baixo e alto desse efeito tempo-

ral, propomos uma métrica chamada “Nível de Estabilidade doDocumento” (DSL—

Document StabilityLevel). ODSL de um documentod denota a densidade de termos

estáveis (ou seja, de termos que apresentam baixa variação em suas representatividades

em relação às classes) que compõemd. Definimos um limiarδTD para isolar os dois

níveis e, documentos cujoDSL é inferior aδTD são associados ao nível baixo (grupo

TD↓) e os demais são associados ao nível alto (grupoTD↑).

Similaridade de Classes (CS): Para isolar os níveis associados a esse efeito, consideramos

as variações observadas ao longo do tempo nas similaridadesentre pares de classes.

Considere o par de classes〈ci, cj〉, comi 6= j. Para cada ponto no tempop, definimos

Vi,p eVj,p como os vocabulários das classesci e cj observados emp, respectivamente,

sendo compostos pelosk termos mais representativos para tais classes nesse ponto

no tempo, de acordo com a métricaInformation Gain. Calculamos a similaridade

de cosseno entre ambos os vocabulários e mensuramos a variabilidade observada ao

longo do tempo por meio do Coeficiente de Variação. Assim, para cada classeci,

mensuramos a variabilidade observada na similaridade delacom as demais classes

cj 6= ci. Como anteriormente, definimos um limiarδCS a fim de separar os documentos

associados às classes com menor variabilidade (grupoCS↓) daqueles associados às

classes com maior variabilidade (grupoCS↑).

Realizado o isolamento dos níveis para cada fator, observamos uma alta correlação en-

tre os efeitos temporaisCD eCS. De fato, essa correlação inviabiliza a condução de um pro-

jeto fatorial23r (ou seja, com os três efeitos temporais considerados simultaneamente). As-

sim, adotamos uma estratégia de experimentação par-a-par,avaliando o impacto dos efeitos

CD eTD (projeto fatorialCD×TD) e o impacto dos efeitosCS eTD (projeto fatorialCS×TD)

xix

Page 20: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

nos algoritmos para CAD, de forma isolada. Para cada combinação entre os algoritmos para

CAD e coleções adotados, executamos o par de projetos fatoriaisCD×TD eCS×TD.

Principais Resultados

A realização dos projetos fatoriais mencionados nos revelauma série de informações perti-

nentes acerca do comportamento das coleções sob o prisma temporal, bem como referentes

ao comportamento dos algoritmos para CAD quando aplicados acoleções cujas característi-

cas variam ao longo do tempo.

Primeiramente, mostramos que os efeitos temporais ocorremde forma mais promi-

nente nas coleções ACM-DL e AG-NEWS do que na MEDLINE. Mais especificamente,

com99% de confiança, obtivemos as seguintes ordenações parciais:

CDMEDLINE < CDACM−DL ∼ CDAG−NEWS,

CSMEDLINE < CSACM−DL ∼ CSAG−NEWS,

TDMEDLINE < TDACM−DL < TDAG−NEWS.

Em segundo lugar, considerando a coleção ACM-DL, o impacto dos efeitosCD eCS

se mostraram estatisticamente equivalentes ao impacto do efeitoTD, enquanto considerando

as coleções MEDLINE e AG-NEWS, foi observado que tantoCD quantoCS se mostraram

mais proeminentes que o efeitoTD.

Ainda, os quatro algoritmos para CAD analisados foram impactados negativamente

pelos efeitos temporais, em termos de efetividade da classificação. De fato, as maiores

degradações em efetividade foram observadas quando os algoritmos foram aplicados às

coleções mais dinâmicas (ACM-DL e AG-NEWS). Considerando os algoritmos isolada-

mente, a análise quantitativa realizada nos possibilitou um melhor entendimento a cerca das

forças e fraquezas dos classificadores em relação aos três efeitos temporais estudados. Por

exemplo, o classificador SVM se mostrou mais robusto ao efeito TD, sendo impactado de

forma marcante pelos demais efeitos. Tal comportamento pode ser justificado pelas próprias

características do classificador, conforme discutido na dissertação. Mostramos, também, que

os outros três classificadores sob estudo são bastante sensíveis aos três efeitos temporais.

Apresentamos na Tabela1 a ordenação parcial dos algoritmos, para cada coleção de dados

adotada, em relação ao impacto dos efeitos temporais observados. Os relacionamentos re-

portados evidenciam o fato de que, além de os algoritmos paraCAD serem negativamente

afetados pelos efeitos temporais, a degradação observada épeculiar a cada algoritmo e a

cada coleção de dados.

xx

Page 21: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Efeito ColeçãoTemporal ACM-DL MEDLINE AG-NEWSCD SVM > NB∼ KNN ∼ RO RO > SVM > NB > KNN RO∼ KNN > SVM ∼ NBCS SVM > KNN ∼ RO > NB RO > SVM∼ NB > KNN RO∼ KNN ∼ NB > SVMTD SVM ∼ KNN ∼ RO∼ NB SVM > RO∼ NB ∼ KNN RO > NB > KNN > SVM

Tabela 1: Um Estudo Comparativo Sobre o Impacto dos Efeitos Temporais em cada Algo-ritmo para CAD—Rocchio (RO), SVM, Naïve Bayes (NB) e KNN.

Os resultados obtidos com essa análise, portanto, corroboram nosso argumento de que

a dimensão temporal é um aspecto de grande importância que, apesar dos desafios intrínsecos

associados à dinâmica temporal, deve ser apropriadamente considerado para o desenvolvi-

mento de modelos de classificação acurados.

Classificação Automática de Documentos

Temporalmente Robusta

Baseado nas lições aprendidas com a caracterização temporal descrita, propomos algumas

estratégias paraminimizaro impacto dos efeitos temporais em algoritmos para CAD quando

aplicados a dados provenientes de distribuições que variamao longo do tempo. Tais es-

tratégias se baseiam no uso do que chamamos de Função de Ponderação Temporal (TWF,

de “Temporal Weighting Function”). Propomos primeiramente uma metodologia, baseada

em uma série de testes estatísticos, para determinar a expressão e os parâmetros da TWF,

de forma a melhor descrever o processo evolutivo subjacenteque governa a variação dos

dados. Instanciamos tal metodologia considerando as três coleções textuais descritas anteri-

ormente. Descobrimos, então, que a TWF’s associadas às coleções ACM-DL e MEDLINE

seguem uma distribuição lognormal, com99% de confiança. Entretanto, os mesmos testes

adotados falharam considerando-se a AG-NEWS. Portanto, a TWF associada à coleção AG-

NEWS segue uma distribuição distinta, e outros testes (potencialmente mais complexos—o

que pode impossibilitar seu uso por quem não possua as habilidades estatísticas necessárias)

tornam-se necessários. De fato, para a classificação temporalmente robusta, apenas os va-

lores reais positivos associados às distâncias temporais são necessários. Assim, para pos-

sibilitar a aplicabilidade desses classificadores a casos em que os testes necessários para

determinar a TWF sejam mais complexos (ou, até mesmo desconhecidos), oferecemos uma

estratégia automática para determinar tal função, sem a necessidade da realização de qual-

quer teste estatístico.

Uma vez definida a TWF, é necessário prover mecanismos para incorporá-la no ar-

xxi

Page 22: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

cabouço de classificação. Propomos, então, três estratégias para tal:

TWF aplicada a Documentos: Essa estratégia consiste em ponderar cada documento de

treino pela TWF, de acordo com a distância temporal entre elee o documento a ser

classificado. Dessa forma, documentos de treino provenientes de pontos no tempo em

que a distribuição dos dados diverge daquela observada no momento de criação do

documento a ser classificado têm sua influência minimizada naregra de decisão. Na

Figura2 apresentamos uma descrição esquemática para essa estratégia.

Figura 2: TWF Aplicada a Documentos.

TWF aplicada a Pontuações:Nesse caso, as pontuações produzidas por um classificador

tradicional, considerando um conjunto de treinamento composto por documentos cuja

classec é transformada para a classe derivada〈c, p〉 (ondep denota o ponto no tempo

em que o documento foi criado) são consideradas, atrelando os padrões observados não

apenas às classes, mas também ao ponto no tempo em que foram observadas. Assim,

as pontuações obtidas pelo classificador tradicional para cada〈c, p〉 são agregadas por

meio de uma soma ponderada, onde os pesos são dados pela TWF. Apresentamos na

Figura3 uma descrição gráfica dessa estratégia.

Figura 3: TWF Aplicada a Pontuações.

TWF aplicada a Pontuações (Versão Estendida):Essa estratégia particiona os documen-

tos de treinamento em sub-grupos compostos por documentos criados em um mesmo

xxii

Page 23: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

ponto no tempo (logo, sem variação temporal). Classificadores tradicionais são então

aplicados a cada partição de documentos, a fim de classificar odocumento de teste

baseado nesses diversos conjuntos de treinamento. As pontuações referentes à classe

c, obtidas para cada partição de dados, são agregadas de acordo com uma soma pon-

derada, sendo os pesos dados pela TWF. A representação esquemática dessa estratégia

encontra-se na Figura4.

Figura 4: TWF Aplicada a Pontuações (Versão Estendida).

As três estratégias descritas foram implementadas utilizando três classificadores, a

saber, Rocchio, KNN e Naïve Bayes.

Principais Resultados

Avaliamos experimentalmente a efetividade e eficácia dos classificadores propostos. Para

tal, adotamos uma estratégia de validação cruzada em 10 partes e validamos estatisticamente

os resultados utilizando um teste-t de dupla calda, com99% de confiança.

Os classificadores temporalmente robustos obtiveram melhorias estatisticamente sig-

nificativas quando comparados às abordagens tradicionais na maioria dos casos. Como um

exemplo, considere a Tabela2. Como podemos observar, todas as versões temporalmente

robustas do Rocchio e KNN obtiveram resultados estatisticamente superiores às suas ver-

sões tradicionais (em termos de MacroF1 e MicroF1). Observamos, também, que a versão

temporal do Naïve Bayes, baseada na aplicação da TWF em pontuações incorreu em per-

das significativas. Atribuímos esse problema ao desbalanceamento de classes artificialmente

aumentado por essa estratégia, bem como à quantidade reduzida de documentos de treino

associados às classes〈c, p〉 para realizar estimativas acuradas a cerca da distribuiçãodos da-

xxiii

Page 24: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

dos. A estratégia estendida de aplicação da TWF em pontuações busca atenuar o problema

do desbalanceamento (embora, continue sendo desfavorecida pela escassez de dados).

Algoritmo Rocchio KNN Naïve BayesMétrica macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Tradicional 57.39 68.24 58.48 71.84 57.27 73.24TWF 60.02 70.64 59.92 73.84 60.78 74.11

em documentos (+4.58)N (+3.52)N (+2.46)N (+2.78)N (+6.13)N (+1.19)•TWF 59.85 72.47 62.02 74.45 44.85 63.93

em pontuações (+4.29)N (+6.20)N (+6.05)N (+3.63)N (-27.69)H (-14.56)HTWF 59.27 71.39 59.78 73.85 56.23 72.35

em pontuações est. (+3.28)N (+4.62)N (+2.22)N (+2.80)N (-1.84)• (+1.23)•

Tabela 2: Resultados Obtidos Incorporando a TWF Definida Estatisticamente ao Rocchio,KNN e Naïve Bayes—ACM-DL.

Avaliamos o uso da estratégia automatizada para a determinação da TWF. Para fins de

exemplificação, reportamos na Tabela3 os resultados referentes à coleção ACM-DL obti-

dos pelos classificadores temporalmente robustos utilizando a TWF determinada por essa

estratégia. De fato, o procedimento para determinação automática da TWF se mostrou efi-

caz, de forma que seu uso rendeu resultados estatisticamente equivalentes àqueles obtidos

utilizando a TWF determinada estatisticamente, conforme podemos observar ao contrastar

as Tabelas2 e 3. Comparamos, também, a efetividade dessa estratégia ao utilizar todo o

conjunto de treinamento ou apenas10% do mesmo para determinar a TWF (linhas “100%

deD” e “10% deD”, respectivamente). Conforme podemos observar, com apenas 10% do

treinamento é possível determinar a TWF de forma acurada e obter resultados estatistica-

mente equivalentes àqueles obtidos utilizando todo o treinamento. Claramente, determinar

a TWF com uma amostra reduzida do treino leva a uma drástica redução no tempo de ex-

ecução. Por exemplo, a determinação da TWF utilizando o Rocchio demanda4.49 ± 0.04

segundos quando utilizando todo o treinamento, ao passo que, considerando10% do treino,

o tempo de execução cai para apenas0.77s± 0.02s, valor esse desprezível se comparado ao

tempo gasto pela tarefa de classificação.

Finalmente, comparamos nossos melhores classificadores temporais com o estado da

arte SVM em termos de efetividade e eficiência. Conforme podemos observar na Tabela4,

nossos melhores classificadores apresentaram eficácia estatisticamente equivalente (ou, até

mesmo superior) quando comparado ao SVM, com um tempo de execução (dado pelo tempo

despendido tanto para treinar quanto para testar) bastanteinferior—mesmo considerando

o fato de os classificadores temporais apresentarem umoverheadreferente à consideração

do aspecto temporal, e serem naturalmente classificadores postergados. Claramente, isso

evidencia a qualidade das soluções propostas.

xxiv

Page 25: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Algoritmo Rocchio KNN Naïve BayesMétrica macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Tradicional 57.39 68.24 58.48 71.84 57.27 73.24

TWF (100% deD) 60.21 70.70 60.08 73.88 61.38 74.60em documentos (+4.91)N (+3.60)N (+2.74)N (+2.84)N (+7.18)N (+1.86)•

TWF (10% deD) 60.52 70.88 61.02 74.27 61.44 74.24em documentos (+5.45)N (+3.87)N (+4.84)N (+3.82)N (+7.28)N (+1.36)•

TWF (100% deD) 60.47 72.90 61.88 74.53 45.16 64.55em pontuações (+5.47)N (+6.83)N (+5.81)N (+3.74)N (-26.82)H (-13.46)H

TWF (10% deD) 59.68 72.40 61.37 73.77 44.47 64.58em pontuações (+3.99)N (+6.10)N (+4.94)N (+2.69)N (-28.78)H (-13.41)H

TWF (100% deD) 59.96 71.99 59.80 73.95 56.28 72.73em pontuações est. (+4.48)N (+5.49)N (+2.26)N (+2.94)N (-1.76)• (-0.70)•TWF (10% deD) 59.85 71.79 59.76 73.85 56.19 72.70em pontuações est. (+4.29)N (+5.20)N (+2.19)N (+2.80)N (-1.89)• (-0.74)•

Tabela 3: Resultados Obtidos Incorporando a TWF Definida Automaticamente ao Rocchio,KNN e Naïve Bayes—ACM-DL.

AlgoritmoMétrica

macF1(%) micF1.(%) Tempo (s)

SVM 59.91 73.88 144.10±5.30

Rocchio com TWF 60.47 72.909.00±0.00

em pontuações (+0.93) • (−1.34) •KNN com TWF 59.78 73.88

11.03±0.48em documentos (−0.22) • (+0.00) •KNN com TWF 61.88 74.53

10.10±0.31em pontuações (+3.29) N (+0.88) •

Naïve Bayes com TWF 61.38 74.609.10±0.32

em documentos (+2.45) N (+0.97) •

Tabela 4: Melhores classificadores temporaisversusSVM—ACM-DL.

Conclusões

Nesse trabalho apresentamos uma análise quantitativa sobre o impacto dos efeitos temporais

em quatro algoritmos para CAD amplamente utilizados pela comunidade de Aprendizado de

Máquina, aplicados a três coleções textuais reais, com dinâmica temporais potencialmente

distintas. Mostramos que, contrariamente à suposição adotada pela maioria dos algoritmos

de aprendizado, em que os dados seguem uma distribuição estática, as coleções estudadas

apresentam dinâmica temporal distinta, com variações na distribuição dos dados. Tais vari-

ações temporais potencialmente limitam a eficácia dos classificadores. De fato, a análise

conduzida mostrou que todos os quatro classificadores estudados foram negativamente afe-

xxv

Page 26: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

tados pelos efeitos temporais, sendo as degradações mais prominentes observadas quando

aplicados às coleções mais dinâmicas (ACM-DL e AG-NEWS). Assim, a dimensão tem-

poral se mostra como um importante aspecto a ser consideradocom o intuito de prover

classificadores acurados.

Além da quantificação do impacto dos efeitos temporais em algoritmos para CAD,

propomos três estratégias mara minimizar tal impacto. Taisestratégias baseiam-se na apli-

cação do que chamamos de Função de Ponderação Temporal (TWF,de “Temporal Weighting

Function”). Propomos tanto uma metodologia estatística quanto um procedimento automa-

tizado, para determinar a TWF. Os resultados obtidos com a aplicação dos classificadores

temporalmente robustos mostraram que considerar a informação temporal leva a ganhos es-

tatisticamente significativos quando comparados às abordagens tradicionais. Ainda, os clas-

sificadores propostos, que obtiveram os melhores resultados, se mostraram competitivos ao

classificador estado da arte SVM, tanto em termos de eficácia quanto em termos de tempo

de execução.

xxvi

Page 27: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

List of Figures

4.1 Class Distributions in the Three Reference Datasets. . .. . . . . . . . . . . . . 27

4.2 Class Distribution Temporal Variation in Each Reference Dataset. . . . . . . . 31

4.3 Term Distribution Temporal Variation of Each ReferenceDataset. . . . . . . . 32

4.4 Determining the Lower and Upper Levels ofCD andCS—ACM-DL. . . . . . 43

4.5 Determining the Lower and Upper Levels of TD—ACM-DL. . . .. . . . . . . 44

4.6 Determining the Lower and Upper Levels ofCD andCS—MEDLINE. . . . . 46

4.7 Determining the Lower and Upper Levels ofTD—MEDLINE. . . . . . . . . . 47

4.8 Determining the Lower and Upper Levels ofCD andCS—AG-NEWS. . . . . 48

4.9 Determining the Lower and Upper Levels ofTD—AG-NEWS. . . . . . . . . . 49

4.10 Cumulative Distribution Function of Document Stability Level Values. . . . . . 54

5.1 Dδ Distribution (Scaled to[0, 1] Interval). . . . . . . . . . . . . . . . . . . . . 66

5.2 Fitted Temporal Weighting Function with Log-Transformed Data. . . . . . . . 68

5.3 Estimated Temporal Weighting Function. . . . . . . . . . . . . .. . . . . . . 71

5.4 Graphical Representation of TWF in Documents. . . . . . . . .. . . . . . . . 72

5.5 Graphical Representation of TWF in Scores. . . . . . . . . . . .. . . . . . . . 75

5.6 Graphical Representation of Extended TWF in Scores. . . .. . . . . . . . . . 77

5.7 Relative〈c, p〉 Sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.8 Relative〈c, p〉 Sizes for AG-NEWS Dataset. . . . . . . . . . . . . . . . . . . .88

xxvii

Page 28: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 29: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

List of Tables

2.1 Contingency Table for Classification Effectiveness Evaluation. . . . . . . . . . 11

4.1 Adopted Class Identifiers for each Reference Dataset. . .. . . . . . . . . . . . 26

4.2 Pairwise Class Similarity (standard deviations) in ACM-DL. . . . . . . . . . . 33

4.3 Pairwise Class Similarity (standard deviations) in MEDLINE. . . . . . . . . . 33

4.4 Pairwise Class Similarity (standard deviations) in AG-NEWS. . . . . . . . . . 34

4.5 Factorial Design—ACM-DL. . . . . . . . . . . . . . . . . . . . . . . . . .. . 45

4.6 Factorial Design—MEDLINE. . . . . . . . . . . . . . . . . . . . . . . . .. . 48

4.7 Factorial Design—AG-NEWS. . . . . . . . . . . . . . . . . . . . . . . . .. . 50

4.8 Comparative Study: The Impact of the Temporal Effects the ADC Algorithms. . 57

5.1 D’Agostino’s D-Statistic Test of Normality. . . . . . . . . .. . . . . . . . . . 66

5.2 Temporal DistancesversusTerms. . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 Estimated Parameters for Both Datasets, with 99% Confidence Intervals. . . . . 67

5.4 Results Obtained with theStatistically DefinedTWF—ACM-DL. . . . . . . . . 81

5.5 Results Obtained with theStatistically DefinedTWF—MEDLINE. . . . . . . . 81

5.6 Results Obtained for the Least and Most Frequent Classes〈c, p〉 Sampling for

Naïve Bayes—MEDLINE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.7 Results Obtained with theEstimatedTWF—ACM-DL. . . . . . . . . . . . . . 86

5.8 Results Obtained with theEstimatedTWF—MEDLINE. . . . . . . . . . . . . 86

5.9 Results Obtained with theEstimatedTWF—AG-NEWS. . . . . . . . . . . . . 87

5.10 Effectiveness Comparison: Best Performing Temporally-Aware Classifiersver-

susSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.11 Execution Time (in seconds) of each Explored ADC Algorithm. . . . . . . . . 91

5.12 Execution Time Comparison: Best Performing Temporally-Aware Classifiers

versusSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.13 Execution Time of the TWF Estimation using the Rocchio Classifier. . . . . . . 93

xxix

Page 30: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 31: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

List of Algorithms

1 Factorial Design Procedure. . . . . . . . . . . . . . . . . . . . . . . . . .. 37

2 Automatic TWF Determination . . . . . . . . . . . . . . . . . . . . . . . .70

3 Rocchio-TWF-Doc: Rocchio with Temporal Weighting in Documents . . . 73

4 KNN-TWF-Doc: KNN with Temporal Weighting in Documents . . .. . . 74

5 Naïve Bayes TWF-Doc: Naïve Bayes with Temporal Weighting in Documents75

6 TWF-Sc: Temporal Weighting in Scores . . . . . . . . . . . . . . . . . .. 76

7 TWF-Sc-Ext: Extended Temporal Weighting in Scores . . . . . .. . . . . 78

xxxi

Page 32: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 33: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Contents

Resumo xi

Abstract xiii

Resumo Estendido xv

List of Figures xxvii

List of Tables xxix

List of Algorithms xxxi

1 Introduction 1

1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Dissertation Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . .. . 2

1.3 Work Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Preliminaries: Basic Concepts 9

2.1 Automatic Document Classification . . . . . . . . . . . . . . . . . .. . . 9

2.2 Evaluation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.3 Temporal Representation of Documents . . . . . . . . . . . . . . .. . . . 13

3 Related Work 15

3.1 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

3.2 Strategies Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

3.2.1 Detecting Data Variations . . . . . . . . . . . . . . . . . . . . . . .17

3.2.2 Dealing with Data Variations . . . . . . . . . . . . . . . . . . . . .17

3.2.3 Characterizing Data Variations . . . . . . . . . . . . . . . . . .. . 20

xxxiii

Page 34: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

3.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

4 A Quantitative Analysis of Temporal Effects on ADC 23

4.1 Experimental Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

4.1.1 Reference Datasets . . . . . . . . . . . . . . . . . . . . . . . . . .25

4.1.2 ADC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Characterization of Temporal Effects on Textual Datasets . . . . . . . . . . 29

4.2.1 Class Distribution Temporal Variation . . . . . . . . . . . .. . . . 30

4.2.2 Term Distribution Temporal Variation . . . . . . . . . . . . .. . . 31

4.2.3 Class Similarity Temporal Variation . . . . . . . . . . . . . .. . . 32

4.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

4.3.1 Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

4.3.2 Applying2kr Design in the Characterization of Temporal Effects .38

4.3.3 Quantifying the Impact of Temporal Effects on ADC . . . .. . . . 42

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

4.4.1 Impact of Temporal Effects on the Reference Datasets .. . . . . . 51

4.4.2 Impact of Temporal Effects on the ADC Algorithms . . . . .. . . . 53

4.4.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58

5 Temporally-Aware Algorithms for ADC 61

5.1 Temporal Weighting Function . . . . . . . . . . . . . . . . . . . . . . .. . 64

5.2 Fully-Automated TWF Definition . . . . . . . . . . . . . . . . . . . . .. 68

5.3 Temporally-aware ADC . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

5.3.1 Temporal Weighting in Documents . . . . . . . . . . . . . . . . . .72

5.3.2 Temporal Weighting in Scores . . . . . . . . . . . . . . . . . . . .74

5.3.3 Extended Temporal Weighting in Scores . . . . . . . . . . . . .. . 76

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78

5.4.1 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . .79

5.4.2 Experiments with the Statistically Defined TWF . . . . . .. . . . . 80

5.4.3 Experiments with the Estimated TWF . . . . . . . . . . . . . . . .85

5.4.4 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .89

5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93

6 Conclusions and Future Work 95

6.1 A Quantitative Analysis of Temporal Effects on ADC . . . . .. . . . . . . 95

6.2 Temporally-Aware Algorithms for ADC . . . . . . . . . . . . . . . .. . . 96

6.2.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

xxxiv

Page 35: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98

Bibliography 101

xxxv

Page 36: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 37: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Chapter 1

Introduction

In this chapter, we discuss the main motivations and arguments that support this work. We

also briefly describe our work and explicitly state our contributions.

1.1 Context and Motivation

Text classification is still one of the major information retrieval problems, and developing

robust and accurate classification models continues to be ingreat need as a consequence of

the increasing complexity and scale of current applicationscenarios, such as the Web. The

task of Automatic Document Classification (ADC) aims at creating models that associate

documents with semantically meaningful categories. Thesemodels are key components for

supporting and enhancing a variety of other tasks such as automated topic tagging (that is,

assigning labels to documents), building topic directories, identifying the writing style of a

document, organizing digital libraries, improving the precision of Web searching, and even

helping users to interact with search engines.

Similarly to other machine learning techniques, ADC usually follows a supervised

learning strategy: a training set of already classified documents is employed for creating a

classifier. Once built, the classifier is used for predictingclasses for a new set of unclassified

documents. The majority of supervised algorithms considerthat all (pre-classified) docu-

ments provide equally important information to discover the features that better identify a

(previously unclassified) document’s class. However, thismay not hold in practice due to

several factors such as the document’s timeliness, the venue in which it was published, its

authors, among other factors (de M. Palotti et al., 2010).

1

Page 38: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

2 CHAPTER 1. INTRODUCTION

1.2 Dissertation Hypothesis

In the following, we state the fundamental hypotheses that serve as guidance to this work:

• The temporal evolution of textual data limits the performance of ADC classifiers;

• Distinct textual datasets present differing dynamical behavior;

• Different ADC algorithms may be distinctively affected by the temporal evolution of

data;

• The temporal evolution of data may be explored to devise moreeffective classification

models.

1.3 Work Description

In this work, we are particularly concerned with the impact that thetemporal effectsmay have

on ADC algorithms. Due to several factors, such as the dynamics of knowledge and even the

dynamics of languages, the characteristics of a textual dataset may change over time. For

example, the relative proportion of documents belonging todifferent classes may change as

consequence of the so-called virtual concept drift (Tsymbal, 2004). Thus, density-based clas-

sifiers, which are sensitive to class distribution, may not work well, since the “assumed” class

frequencies observed from an independent training set may not represent the “true” frequen-

cies observed when the test document was created (Yang and Zhou, 2008; Zhang and Zhou,

2010). As we shall see, not only the temporal variations in class frequencies may affect

classification effectiveness, but also the relationships between terms and classes. That is,

the distribution of terms among classes may vary over time, due to changes in writing style,

term usage, and so on. Consider, for instance, the termspheromoneandant colony. Before

the 1990s, they referred exclusively to documents in the area of Natural Sciences. However,

after the introduction of the technique ofAnt Colony Optimizationin the area of Artificial

Intelligence, these terms became relevant for classifyingComputer Science documents too.

In such scenarios, the classification effectiveness may deteriorate over time. Therefore, the

temporal dynamics of the data is an important aspect that must be taken into account in the

learning of more accurate classification models.

As a matter of fact,Mourão et al.(2008) have recently distinguished three different

temporal effects that may affect the performance of automatic classifiers. The first effect is

theclass distribution variation, which accounts for the impact of the temporal evolution on

the relative frequencies of the classes. The second effect is theterm distribution variation,

which refers to changes in the terms’ representativeness with respect to the classes as time

Page 39: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

1.3. WORK DESCRIPTION 3

goes by. The third effect is theclass similarity variation, which considers how the similarity

among classes, as a function of the terms that occur in their documents, changes over time.

The authors showed that accounting for the temporal evolution of documents poses a chal-

lenge to learning a classification model, which is usually less effective when such factors are

neglected, as assumptions made when the model is built (thatis, learned) may no longer hold

due to temporal effects.

Despite these previous studies, to the best of our knowledge, a deeper and thoroughly

analysis about how and to which extent these temporal effects really impact ADC algorithms

has not been performed yet. A key aspect to be addressed in this task concerns the peculiar

behavior that each temporal effect may present in differentdatasets. For example, while

some datasets may present large class distribution variations over time, other datasets may,

in contrast, present a more significant variability on term distribution. Moreover, different

ADC algorithms may be distinctively affected by these effects due to their sensitivity or

robustness to each specific effect. In other words, the best strategy to handle temporal effects

may depend on the specific characteristics of both the dataset and the ADC algorithm used,

thus turning the learning of a more accurate classification model that deals with these effects

an even more challenging task.

In sum, two important questions that must be answered in order to better understand the

impact of temporal effects are:(i) Which temporal effects influence more in each dataset?

(ii) What is the behavior of each ADC algorithm when faced with different levels of each

temporal effect?In fact, it has already been established that these temporaleffects do exist

in some collections and affect negatively one specific algorithm, namely the SVM classifier

(Mourão et al., 2008). In this work, we take a step further towards answering the posed

questions, by proposing a factorial experimental design (Jain, 1991) aimed at quantifying

the impact of the temporal effects in four representative ADC algorithms, considering three

textual datasets with differing characteristics in their temporal evolution.

Hence, the first part of this dissertation aims atquantifyingthe impact of temporal

effects in ADC algorithms and provides as contributions:(i) a re-visitation of the character-

ization reported in (Mourão et al., 2008), with the inclusion of a third dataset belonging to

a distinct and more dynamic domain, in order to strengthen the argument for the existence

of such temporal effects;(ii) the proposal of a methodology to enable a deeper study of the

aforementioned temporal effects, by means of a factorial experimental design aimed at un-

covering how each temporal effect affects each ADC algorithm and textual dataset;(iii) an

instantiation of that methodology considering three real textual datasets and four well known

ADC algorithms, along with a detailed study regarding the impact of the temporal effects on

them. Specifically, we focus on four traditional ADC algorithms, namely Rocchio, K Nearest

Neighbors (KNN), Naïve Bayes and Support Vector Machine (SVM), and on three different

Page 40: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4 CHAPTER 1. INTRODUCTION

and widely used textual collections covering long time periods, namely, ACM-DL (22 con-

secutive years), MEDLINE (15 consecutive years) and, finally, AG-NEWS (573 consecutive

days).

As we shall see, there is a higher impact of the temporal effects in the ACM-DL and

AG-NEWS datasets when compared to the MEDLINE dataset. In the ACM-DL dataset,

the impact of class distribution and class similarity variations are statistically equivalent to

the impact of the term distribution variation, whereas MEDLINE and AG-NEWS are more

impacted by the first two effects. These findings motivate thedevelopment of strategies to

handle the temporal effects in ADC algorithms according to each dataset specific dynamical

behavior. Furthermore, all four analyzed ADC algorithms suffered a negative impact of

the temporal effects in terms of classification effectiveness. Indeed, the most significant

performance losses were observed when these algorithms were applied to the most dynamic

ACM-DL and AG-NEWS datasets. Extending the results presented in (Mourão et al., 2008)

by quantifyingthe impact of each temporal effect in the ADC algorithms, we here show

that the SVM classifier is more resilient to the term distribution effect, while still being

impacted by the other two effects. We also show that the otherthree algorithms, on the other

hand, are very sensitive to all three effects. These resultscorroborate our argument that the

temporal dimension is an important aspect that has to be considered when learning accurate

classification models.

Based on the performed quantitative analysis of the impact of temporal effects in ADC

algorithms, the second part of this dissertation focus on how to minimizetheir impact in

ADC algorithms. We propose a strategy to incorporate temporal information to document

classifiers, aiming at improving their effectiveness by properly handling data with varying

distributions. Our strategy is based on the evolution of theterm-class relationship over time,

captured by a metric ofdominance. We start by determining atemporal weighting function

for a collection according to its characteristics, based ona series of statistical tests performed

to determine its expression, and a curve fitting procedure todetermine its parameters. We

found that this function follows a lognormal distribution for two datasets we used, namely

ACM-DL and MEDLINE. However, the set of statistical tests performed to define the TWF

expressions for ACM-DL and MEDLINE datasets was not able to properly define the TWF

expression regarding the AG-NEWS dataset, which does not follow a (log-)normal distri-

bution. Indeed, the required tests may be prohibitively complex to perform depending on

the dataset characteristics, limiting the practical applicability of this strategy. Thus, we also

propose an automatic procedure to learn the TWF function, without the needs to perform

such statistical tests.

The final step is to incorporate the temporal weighting function to ADC algorithms

and we propose three strategies that follow a lazy classification approach. In the three strate-

Page 41: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

1.4. CONTRIBUTIONS 5

gies, the weights assigned to each example depend on the notion of a temporal distanceδ,

defined as the difference between the time of creationp of a training example and a reference

time pointpr. The first strategy, namedtemporal weighting in documents, weights training

instances according toδ. The second strategy, calledtemporal weighting in scores, takes

into account the scores (e.g., similarities, probabilities) produced by a traditional classifier

applied to a modified training set where the class of each training documentc is mapped to

a derived classc 7→ 〈c, p〉, with p denoting the training document’s creation point in time,

ultimately tying together the observed patters and both theclass and temporal information.

A weighted sum of the learned scores is then performed, according to the TWF, and used to

make the final classification decision. Finally, the third strategy, namedextended temporal

weighting in scores, partitions the training setD in sub-groups of documentsDp with the

same creation point in timep. Then, a classification model is built based on eachDp in iso-

lation. The classes’ scores are then produced for eachDp and, as before, they are aggregated

using the TWF to weight them. We specifically show how these strategies are implemented

in three traditional ADC algorithms, namely, Rocchio, k Nearest Neighbors (KNN), and

Naïve Bayes.

We evaluated our strategies using three actual textual datasets that span for decades

(ACM-DL and MedLine) or for several months (AG-NEWS). The temporally-aware clas-

sifiers achieved significant improvements on classificationeffectiveness, even matching or

outperforming the state of the art SVM classifier in some cases with a drastically reduced

execution time.

1.4 Contributions

The specific contributions of this work are:

• aquantificationof the impact of three main temporal effects in four widely used ADC

algorithms. More specifically,

– we re-visit the characterization reported in (Mourão et al., 2008), by including

a third dataset belonging to a distinct and more dynamic domain, in order to

strengthen the argument for the existence of variations in textual data;

– we propose a methodology to enable a deeper study of the threetemporal ef-

fects, by means of a factorial experimental design aimed at uncovering how each

temporal effect affects each ADC algorithm and textual dataset;

Page 42: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

6 CHAPTER 1. INTRODUCTION

– we instantiate that methodology considering three real textual datasets and four

ADC algorithms, and provide a detailed study regarding the impact of the tem-

poral effects on them;

• the proposal of strategies tominimizethe impact of the temporal effects in ADC algo-

rithms. Again, more specifically,

– we introduce a temporal weighting function to capture the varying behavior of

textual datasets, and propose two strategies to devise it;

– we extend three well known ADC algorithms to incorporate such function, de-

vising the temporally-aware algorithms for ADC;

– we perform an extensive experimental analysis in order to assess the benefits of

considering the temporal dynamics of data.

In the following we enumerate the already published work as direct contributions of

this dissertation, along with some work published during the MS.C. course:

• Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Gonçalves, M. A., and Meira Jr.,

W. Temporally-aware algorithms for document classification. In Proceedings of the

International ACM SIGIR Conference on Research & Development of Information

Retrieval, pages 307–314, Genebra, Switzerland, 2010.

• Salles, T., Rocha, L., Mourão, F., Pappa, G. L., Cunha, L., Gonçalves, M. A., and

Meira Jr., W.Automatic document classification temporally robust. Journal of Infor-

mation and Data Management, 1(2):199–212, 2010.

• Salles, T., Rocha, L., Mourão, F., Pappa, G. L., Cunha, L., Gonçalves, M. A., and

Meira Jr., W.Classificação Automática de Documentos Robusta Temporalmente. In

XXIV Simpósio Brasileiro de Banco de Dados, pages 106–119, Fortaleza, Brazil,

2009.

• Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Gonçalves, M. A., and Meira Jr., W.A

Quantitative Analysis of the Temporal Effects on AutomaticDocument Classification.

In Journal of Machine Learning Research, 2011 (submitted).

• Pappa, G. L., Zadrozny, B., Rocha, L.,Salles, T., Meira Jr., W., Gonçalves, M. A.

Exploiting Contexts to Deal with Uncertainty in Classification. In Proceedings of the

First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, pages

19–22, Paris, France, 2009.

Page 43: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

1.4. CONTRIBUTIONS 7

• de M. Palotti, J. R.,Salles, T., Pappa, G.L., Gonçalves, M. A., and Meira Jr., W.As-

sessing Documents’ Credibility with Genetic Programming. IEEE Congress on Evo-

lutionary Computation, 2011 (to appear).

• de M. Palotti, J. R.,Salles, T., Pappa, G. L., Arcanjo, F., Gonçalves, M. A., and Meira

Jr., W. Estimating the credibility of examples in automatic document classification.

Journal of Information and Data Management, 1(3):439–454,2010.

• Figueiredo, F., Rocha, L., Couto, T.,Salles, T., Gonçalves, M. A., Meira Jr., W.Word

Co-occurence Features for Text Classification. Information Systems, 2011 (in press).

Page 44: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

8 CHAPTER 1. INTRODUCTION

1.5 Roadmap

This work is structured in five chapters. The remainder of this work is organized as follows.

Chapter 2: In this chapter we briefly describe the supervised ADC task and some evaluation

strategies. We also present some of the notational conventions adopted in this work.

Chapter 3: In this chapter we describe related work. We start by discussing some of the

application scenarios where time poses as an important aspect to be considered. Then,

we discuss some of the efforts towards either detecting or handling variations on the

data distributions. We distinguish two broad areas for doing so: concept drift and

adaptive document classification.

Chapter 4: In this chapter we provide evidence of the existence of temporal effects. We

provide an extensive characterization of the properties ofthree textual datasets with

relation to the extent of each temporal effects on them, and quantify the impact of

the temporal effects on four well known ADC algorithms (i.e., Rocchio, K Nearest

Neighbors, Naïve Bayes and Support Vector Machine).

Chapter 5: In this chapter we propose three strategies, based on atemporal weighting

function(TWF), to address and minimize the impact of the temporal effects in three

extended versions of three ADC algorithms. We start by introducing the TWF and

proposing two strategies to determine it. Then, we describehow to modify three ADC

algorithms (namely, Rocchio, K Nearest Neighbors and NaïveBayes) in order to in-

corporate the TWF into them. we propose three strategies fordoing so.

Chapter 6: Finally, in this chapter we conclude the dissertation, summarize our main find-

ings and propose some directions for further investigation.

Page 45: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Chapter 2

Preliminaries: Basic Concepts

In this work, we are mainly concerned with Automatic Document Classification (ADC), a

well studied subject related to the classification problem,1 considering a supervised learning

paradigm. This section serves two main purposes: (i) briefly describe the supervised ADC

task and some evaluation strategies, in order to provide thereader with some basic notions

on the subject; and (ii ) present some notational conventions adopted in this work.

2.1 Automatic Document Classification

The purpose of supervised ADC algorithms is to predict the unknown class of a document,

based on a set of already classified documents (Sebastiani, 2002). Let di = (~xi, ci) be

a document, where~xi denotes its vectorial (bag of words) representation andci ∈ C a

categorical attribute (or response variable) indicating its class (C is a finite set composed

by all the possible classes). The main goal of an ADC algorithm is thus to learn a discrete

approximation of the class a posteriori probability distributionP (ci|di), which underlies the

relationships between documents and their associated classes. This probability distribution

is learned according to a training set composed by already classified documents. There are

two approaches for doing so, either based on a direct estimation of P (ci|di), or based on an

indirect estimation ofP (ci|di).

The first approach, which defines the so calleddiscriminativeclassifiers, learns the

class boundaries that minimize the error rate (or some correlated measure), ultimatelydis-

criminating between classes (that is, they learn class boundaries) without making any as-

sumption regarding the probability density function for each class. On the other hand, the

second approach, which defines thegenerativeclassifiers, learns the class conditional prob-

ability distribution and the a priori class probabilities to estimate the class a posteriori prob-1Also known as the discrimination problem in the statistics literature.

9

Page 46: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

10 CHAPTER 2. PRELIMINARIES: BASIC CONCEPTS

ability distributionP (ci|di). In this case, one should assume a model for the class densities

P (di|ci) and its parameters are estimated from the training set. For example, a normal dis-

tribution may be chosen, and its mean and variance parameters are estimated according to

the already classified data. Then the class a posteriori probability distributionP (ci|di) is

estimated according to the Bayes’ rule:

P (ci|di) =P (ci) · P (di|ci)

c′∈C P (c′) · P (di|c′), (2.1)

whereP (ci) denotes the class priors andP (di|ci) denotes the class densities.

Informally, given a training set of already classified documents with feature measure-

ments, we build a classification model, or learner, which will enable us to classify a new

unseen document. A good learner is one that accurately predicts such class. In the per-

spective of function approximation, this translates into finding a good approximationf of

f : DU 7→ C, that underlies the predictive relationship between the documents and their

associated classes, based on the training setD ⊂ DU , whereDU denotes the input space

composed by both classified and unclassified documents.

In order to assess how good an approximation is, one should consider the generaliza-

tion capabilities of the approximatedf . Recall that thef is an approximation based on the

training set, that isf : D 7→ C. The quality of such approximation refers to how wellf

predict the classes of unseen documents (i.e., documentsd′ 6∈ D). This is assessed by the

generalization capability off . Clearly, a functionf that accurately predicts the class of doc-

uments fromD may not be accurate to predict the class of documents fromDU \ D (i.e., the

set of unclassified documents)2. In this case, we say thatf is overfitted w.r.t.D. Hence,

there exists a trade-off between the complexity off (the more complexf is, the more spe-

cific patterns observed in the training set are learned) and the generalization power off (the

more specific patterns observed inDU \ D may not be observed inD).

It has been already proved that, asymptotically, the discriminative classifiers are supe-

rior to the generative ones (Vapnik, 1998), with several reported experiments corroborating

this superiority (Drummond, 2006). In fact, if there are not enough training examples, the

parametric model is deemed to overfit, decreasing its generalization power (Hastie et al.,

2009). However, some authors claim, based on experimental evaluation, that with realistic

training set sizes, the generative classifiers can also perform as well as or better than the

discriminative ones. This comes true if the assumed parametric model used by the genera-

tive classifier is correct. In this case, the class priors become a useful information which is

ignored by the discriminative classifiers. As will be described in Section4.1, in this work

we consider both generative classifiers (represented by theNaïve Bayes classifier) and dis-

2A \ B denotes the set difference betweenA andB and is a set composed by elements inA but not inB.

Page 47: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

2.2. EVALUATION TECHNIQUES 11

criminative classifiers (represented by Rocchio, K NearestNeighbors and Support Vector

Machine classifiers).

2.2 Evaluation Techniques

An important aspect to be considered is how to evaluate the effectiveness of a classifier (that

is, its accuracy in classifying unseen data or, in other words, its generalization power), as-

sessed by first learning a classification model based on the training set and then applying it to

classify a set of unseen documents (the test set). Some measures of classification effective-

ness are then used to assess the quality of the classificationmodel learned. Several measures

for this purpose were proposed in the literature and some of them are widely used by the

machine learning community. Perhaps the most used measuresare the precision, recall and

the F1 measure. In order to describe each of these measures, consider the contingency table

represented in Table2.1 (also known as confusion matrix), whereTP , TN , FP andFN

denote, respectively, the number of true positives, true negatives, false positives and false

negatives, defined as:

True Positive (TP): positive test document correctly into the positive class,

True Negative (TN): negative test document correctly classified into the negative class,

False Positive (FP):negative test document incorrectly classified into the positive class,

False Negative (FN):positive test document incorrectly classified into the negative class.

The precisionp of a performed classification denotes the fraction of all documents

assigned to the positive classci by the classifier that really belong toci. In terms of the

contingency table, this translates into

p =TP

TP + FP.

Positive Ground TruthClass= ci ci Not ci

Predictionci TP FP

Not ci FN TN

Table 2.1: Contingency Table for Classification Effectiveness Evaluation.

Page 48: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

12 CHAPTER 2. PRELIMINARIES: BASIC CONCEPTS

The recallr of a performed classification denotes the fraction of all documents that

belong to the positive classci that were correctly assigned toci by the classifier. Again, in

terms of the contingency table, this can be expressed as

r =TP

TP + FN.

Finally, the F1 measure is defined as the harmonic mean of the precision and the recall,

given by

F1 =2pr

p+ r.

There are two conventional methods to evaluate classification algorithms when applied to

problems with more than two classes, namely by micro-averaging and macro-averaging the

F1 measure. The micro-averaged F1 (microF1) is calculated from a global contingency table

(similarly to Table2.1), with the precision and recall being calculated as a sum of each entry

of the table:

pmicro =

∑|C|i=1 TPi

∑|C|i=1 TPi + FPi

,

rmicro =

∑|C|i=1 TPi

∑|C|i=1 TPi + FNi

.

In contrast, the macro-averaged F1 (macroF1) is calculated by first calculating the precision

and recall values for each class and computing their averagevalue:

pmacro =1

|C|

|C|∑

i=1

TPi

TPi + FPi

,

rmacro =1

|C|

|C|∑

i=1

TPi

TPi + FNi

.

Notice that the main difference between both strategies is that the microF1 is a document-

pivoted measure that gives equal weights to the documents while the macroF1 measure is a

class-pivoted measure that gives equal weights to the classes.

Since the ADC task is inherently a stochastic process, it is fundamental to adopt some

evaluation strategies that guarantee the statistical validity of the obtained classification re-

sults, which is achieved by replicating the experiments using different training sets to learn

a classification model. For this purpose, the cross validation strategy has become a standard

in the machine learning community. There are, at least, two usual strategies for cross val-

idation, the K-fold cross validation and the repeated random sub-sampling (Kohavi, 1995).

Page 49: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

2.3. TEMPORAL REPRESENTATION OFDOCUMENTS 13

The K-fold cross validation consists of randomly splittingthe data intoK independent folds.

At each iteration, one fold is retained as the test set, and the remainingK − 1 folds are

used as training set. The repeated random sup-sampling consists of randomly selecting a

fraction of documents from the dataset, without replacement, to compose the test set, and

the remaining documents retained as the training set. This is performed for each replication.

Since in the K-fold cross validation the size of the folds aredependent of the number of iter-

ations, it becomes more suitable to medium/large sized datasets, while the repeated random

sub-sampling is usually adopted to small sized datasets when the number of replications is

large.

For more details on ADC and evaluation strategies, we refer the reader to

(Baeza-Yates and Ribeiro-Neto, 2011; Hastie et al., 2009; Manning et al., 2008).

2.3 Temporal Representation of Documents

In this work we deal with the documents’ timeliness, represented by their creation points in

time. We consider time as a discrete attribute associated todocuments. Thus, we represent

the documents by a tripledi = (~xi, ci, pi), where~xi denotes the vectorial “bag of words”

representation ofdi, ci denotes its associated class andpi denotes its creation point in time.

An important aspect to consider refers to the temporal unit used. The temporal unit

should be the minimum time interval between relevant changes observed in data and is,

clearly, dataset dependent. For example, since scientific conferences are usually annual,

relevant changes usually occur yearly, and the temporal unit should be one year. On the

other hand, the temporal unit to be used for data from published news articles should be

more fine grained (e.g., one day or one month).

Page 50: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 51: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Chapter 3

Related Work

In this chapter, we discuss some related work. First, we report some efforts spent on the

dissertation’s target problem, that is, the impact of varying data distributions in learning

algorithms when applied to some important scenarios. Then,we focus our attention on some

works aimed at either detecting or dealing with such problem.

3.1 Problem Overview

A fundamental assumption of the vast majority of automatic classifiers is that the data used

to learn a classification model are random samples independently and identically distributed

(i.i.d.) from a stationary distribution that governs the test data.However, this may not be the

case. In fact, in many (perhaps most) real-world classification problems, the training data

may not be randomly drawn from the same distribution as test data (to which the classifier

will be applied) when there exists variations in the underlying data distribution. Hence,

the success of classification algorithms may be diminished when faced to real-world time-

varying data. As argued byAlonso et al.(2007), “time is an important dimension of any

information space and can be very useful in information retrieval”.

As analyzed byKelly et al. (1999), the observed variations in the data distributions

may be reflected by, at least, three aspects:

1. Varying a priori class probabilities—P (ci);

2. Varying class a posteriori probabilities—P (ci|di);

3. Varying class densities—P (di|ci).

Notice that, according to Equation2.1, sincep(ci|di) depends onp(di|ci), both the

generative and discriminative classifiers that do assume a static underlying data distribution

15

Page 52: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

16 CHAPTER 3. RELATED WORK

are deemed to be error prone when faced with non-stationary data. This problem becomes

critical as it is not a hard task to enumerate real-world examples of scenarios in which auto-

matic classification procedures are applied to inherently dynamic data. For example, in spam

filtering applications the ultimate goal is to filter out undesired spam messages. However,

spammers actively change the nature of their messages to elude the spam filters, and devel-

oping strategies that take into account such dynamic behavior becomes a necessary task to

guarantee the effectiveness of the filters (Fdez-Riverola et al., 2007).

Another example relates to the information filtering techniques employed by personal

assistance applications aimed at personalizing the flow of information according to the user

interests. A specific type of information filtering technique is to recommend information

items to users, according to their interests. This is accomplished by predicting which in-

formation items meet the users interests, based on their profiles. Clearly, changes in user

interests are problematic and should be addressed in order to guarantee an effective recom-

mendation. Thus, modeling the temporal dynamics of user interests should be a key when

designing them. Indeed, recently there was an open competition for the best filtering algo-

rithm to predict user ratings for films, based on previous ratings (NetFlix Prize). The winners

of the contest explored the temporal aspect as one of the keysto the problem, considering

that both the movie popularities ans user preferences change over time (Koren, 2010). This

reinforces the importance of a proper handling of dynamicaldata. Other examples are au-

tomatic credit card fraud detection (Wang et al., 2003), where previously observed patterns

regarding fraudulent credit card transactions are used to learn a classification model that is

able to predict the legitimacy of new transactions. However, such patterns also change over

time, and should be taken into account in order to avoid fraudulent transactions. It should

be clear from now that variations on the data distribution pose as an important problem to be

tackled in order to improve the effectiveness of learning algorithms.

In this work, we focus on the temporal dynamics observed in textual datasets. As a

matter of fact, due to several factors, such as the dynamics of knowledge and even the dy-

namics of languages, the characteristics of textual data may change over time (Mourão et al.,

2008). As previously discussed, automatic document classifiersmay have trouble with such

kind of data. Thus, this work tackles the following problem:

Problem 1 (Problem Statement)The majority of automatic document classifiers assume a

stationary data distribution. However, in (perhaps most) real-world classification problems

this premise is violated, being an important task to consider the temporal dynamics of data

in order to boost the effectiveness of the classifiers.

Page 53: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

3.2. STRATEGIES OVERVIEW 17

3.2 Strategies Overview

Although ADC is a widely studied subject, the analysis of temporal aspects in this class of

algorithms is quite recent—it has been studied only in the last decade. Most previous studies

have focused on detecting and dealing with these effects to improve classification quality,

whereas we are aware of only one prior effort towards characterizing the impact of temporal

effects on ADC effectiveness.

3.2.1 Detecting Data Variations

We start by reviewing previous attempts todetectsignificant changes in the underlying

data distribution due to temporal effects.Gama et al.(2004) presented a method to detect

changes in the distribution of the training examples by means of an online classifier that

performs a sequence of trials to perform the classification.On each trial, it makes some

predictions and receives a feedback accounting for the classification error, in order to detect

significant changes in the data at hand. This approach is ableto detect both gradual and

abrupt changes. Similarly,Nishida and Yamauchi(2009) propose a system to detect and

predict changing distributions by managing a set of offline and online classifiers to account

for, respectively, data variations and classifiers’ prediction errors. Furthermore, the system

also performs a clustering step to allow the prediction of future variations. Other studies

explore statistical tests to detect drift (Dries and Rückert, 2009; Nishida and Yamauchi,

2007). In (Dries and Rückert, 2009), for instance, the authors propose three adaptive

tests that are capable to adapt to different (gradual or abrupt) changing behaviors. In

(Nishida and Yamauchi, 2007), the authors propose to classify a set of examples belonging

to a recent time window, and to compare the achieved accuracyagainst the one obtained

with a global classifier that considers all available data. The basic idea is that statistically

significant decreases in accuracy suggest data variations.Such solution is able to quickly

detect drift when the window size is small, at the cost of being susceptible to data sparseness.

3.2.2 Dealing with Data Variations

Previous efforts todeal with varying data distributions can be categorized into twobroad

areas, namely, adaptive document classification and concept drift.

3.2.2.1 Adaptive Document Classification

Adaptive document classification (Cohen and Singer, 1999) embodies a set of techniques to

deal with changes in the underlying data distribution so as to improve the effectiveness of

Page 54: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

18 CHAPTER 3. RELATED WORK

document classifiers through incremental and efficient adaptation of the classification mod-

els. Adaptive document classification brings three main challenges to document classifica-

tion (Liu and Lu, 2002). The first one is the definition of a context and how it may be ex-

ploited to devise more accurate classification models. A context is a semantically significant

set of documents. Previous research suggests that they may be determined through at least

two strategies: identification of neighbor terms to a certain keyword (Lawrence and Giles,

1998), and identification of terms that indicate the scope and semantics of the document

(Caldwell et al., 2000). In our case, the strategies to deal with varying data distributions

explore the stability of terms, which can be seen as a kind of (temporal) context—but in a

finer granularity (i.e., terms). The second challenge is howto build the classification models

incrementally (Kim et al., 2004), whereas the third challenge relates to the computational

efficiency of the resulting classifiers. Here, we do not consider the incremental construction

of classification models. Our temporally-aware classifiersuse the temporal information to

learn more accurate classification models, instead of updating them in a incremental fashion.

Clearly this is a natural extension of our work and we intend to consider it in the future.

3.2.2.2 Concept Drift

Concept or topic drift (Tsymbal, 2004) comprises another relevant set of efforts to deal with

varying data distributions in classification. A prevailingapproach to address concept drift is

to completely retrain the classifier according to a sliding window, which ultimately involves

example selection techniques. A number of previous studiesfall into this category. For in-

stance, the method presented in (Klinkenberg and Joachims, 2000) maintains a window with

examples sufficiently “close” to the current target concept, and automatically adjusts the

window size so that the estimated generalization error is minimized. In (Žliobaite, 2009),

a classification model is built using training examples which are close to the test in terms

of both time and space. The methods presented in (Klinkenberg, 2004) either maintain an

adaptive time window on the training data, or select representative training examples, or

weight them.Widmer and Kubat(1996) describe a set of algorithms that react to concept

drift in a flexible way and can take advantage of situations where contexts reappear. The

main idea of these algorithms is to keep only a window of currently trusted examples and

hypothesis, and store concept descriptions in order to reuse them if a previous context reap-

pears. In (Rocha et al., 2008), the authors introduce the concept oftemporal context, defined

as a subset of the dataset that minimizes the impact of temporal effects in the performance

of classifiers. They also propose an algorithm, namedChronos, to identify these contexts

based on the stability of the terms in the training set. Temporal contexts are used to sample

the training examples for the classification process, and examples considered to be outside

Page 55: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

3.2. STRATEGIES OVERVIEW 19

the temporal context are discarded by the classifier.

Unlike previous efforts that use a single window to determine drift in the data,

Lazarescu et al.(2004) present a method that uses three windows of different sizesto es-

timate the change in the data. While algorithms that use a window of fixed size impose

hard constraints over drift patterns, those that use heuristics to adjust the window size to the

current extent of concept drift often involve lots of parameters to be calibrated. In order to

provide some theoretical basis for the choice of window size, Kuncheva and Žliobaite(2009)

developed a framework for relating the classification errorto the window size, aiming at pro-

viding an optimal window size choice. Such optimal choice leads to statistically significant

improvements in window-based strategies. Following this direction, in (Bifet and Gavaldà,

2006) the authors propose a window-based strategy for drifting data streams, that automat-

ically chooses the optimal window size, called ADWIN. This approach keeps a windowW

with the most recent data and splits it in two adjacent sub-windowsW0 andW1. Using statis-

tical tests to compare both windows, it detects when a drift occurred. In this case, all possible

adjacent sub-windows must be considered. Clearly, this is acostly operation (both in terms

of time and memory). In (Bifet and Gavaldà, 2007), the authors propose an improvement

for ADWIN, called ADWIN2, with the same effectiveness guarantees of ADWIN and more

efficient data structures.

Window-based approaches may be considered too rigid since it may miss valuable

information laying outside of the window. Accordingly, a second approach to deal with con-

cept drift consists in properly weighting training examples while building the classification

model in order to reflect the temporal variations in the underlying data distribution, instead

of simply discarding them.1 Following this direction,Koychev(2000) defined a linear time-

based utility function to account for variations in the datadistribution such that the impact

of the examples on the classification model decreases with time. Experimental evaluation

conducted with the Naïve Bayes and the ID3 algorithms showedthe effectiveness of such ap-

proach. In (Klinkenberg and Rüping, 2003), the authors defined an exponential time-based

function in order to weight examples based on their age. The reported experimental evalua-

tion showed that weighting examples in drifting scenarios leads to significant improvements

over fixed-window strategies, while being outperformed by an adaptive-window approach.

However, such time-based utility functions are typically defined in a very ad-hoc manner

(e.g., linear functions, exponential functions, etc), without any theoretical justification built

from changes in data patterns.

Thus, the following question remains unanswered:how can we properly define such

time-based utility function?In order to answer that question, not only the temporal distance

1In this sense, window-based approaches could be though as a type of binary weighting function.

Page 56: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

20 CHAPTER 3. RELATED WORK

between training and test examples should be considered, but also the varying characteristics

of the underlying data distribution. Following this direction, in this work we report a statis-

tical analysis of the temporal effects on three textual datasets in order to define atemporal

weighting function(TWF) which properly models the changing behavior of the underlying

data distribution, reflecting its dynamical nature, and capturing both the temporal distance

between training and test examples and the variations of thecharacteristics of the dataset

(Salles et al., 2010b). We also propose three instance weighting strategies thatemploy the

temporal weighting function to deal with these temporal effects (Salles et al., 2010a). We

applied these strategies to three well known ADC algorithms, namely, Rocchio, KNN and

Naïve Bayes and, as reported in Section5.4, we found that the new temporally-aware classi-

fiers achieve statistically significant gains over their traditional counterparts.

Another common approach to deal with concept drift focuses on the combination

of various classification models generated from different algorithms (ensembles) for clas-

sification, pruning or adapting the weights according to recent data (Folino et al., 2007;

Kolter and Maloof, 2003; Scholz and Klinkenberg, 2007). Scholz and Klinkenberg(2007)

proposed a boosting-like method to train a classifier ensemble from data streams. It natu-

rally adapts to concept drift and allows one to quantify the drift in terms of its base learners.

The algorithm was shown to outperform learning algorithms that ignore concept drift. In the

same direction,Kolter and Maloof(2003) presented a technique that maintains an ensemble

of base learners, predicts instance classes using a weighted-majority vote of these “experts”,

and dynamically creates and deletes experts in response to changes in performance. Ad-

ditionally, Folino et al.(2007) proposed to build an ensemble of classifiers using genetic

programming to inductively generate decision trees. In spite of these prior proposals, one

important challenge of approaches based on classifier ensembles is the efficient management

of multiple models. As a matter of fact, one of our proposed strategies are based on the

combination of various classification models, but with a much simpler way to manage them,

by exploiting the TWF.

3.2.3 Characterizing Data Variations

In addition to the aforementioned studies, which aim at either detecting or exploiting the

changes in data distribution, in (Forman, 2006) the author provides a characterization of

varying data distributions in the textual data domain, where the concept drift problem is

studied considering three main types of data variations: (i) shifting class distribution, which

is reflected by the observed variations over time in the proportion of documents assigned

to each class; (ii ) shifting subclass distribution, which accounts for varying feature distribu-

tions; and, finally, (iii ) the fickle concept drift, that denotes the cases where documents are

Page 57: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

3.2. STRATEGIES OVERVIEW 21

assigned to distinct classes at different points in time. Moreover, in that work, the author

proposes a visualization tool aimed at analyzing the feature space (in a binary classification

setting) and thus providing clues about the varying behavior of the most predictive features as

time goes by. A real textual dataset, composed by news articles, was characterized according

to the three mentioned drifting patterns, and was shown to bea very dynamic dataset.

Following this direction, in (Mourão et al., 2008), the authors provide a characteriza-

tion of these changes in terms of three maintemporal effects: (i) the class distribution varia-

tion, that accounts for the impact of the temporal evolutionon the relative frequencies of the

classes;(ii) the term distribution variation, which refers to changes inthe representativeness

of the terms with respect to the classes as time goes by; and, finally, (iii) the class similarity

variation, which considers how the similarity among classes, as a function of the terms that

occur in their documents, changes over time. In fact, the class distribution variation and the

term distribution variation effects correspond to the shifting class distribution and the shift-

ing subclass distribution discussed in (Forman, 2006), respectively. Furthermore, while the

class similarity variation effect is not analyzed in (Forman, 2006), the fickle drifting pattern

is not considered in (Mourão et al., 2008). As a matter of fact, the fickle drift type, which

corresponds to the change of class of a given document due to some eventual correction, is

probably the most difficult case to be handled. These are veryrare events which may not af-

fect the classifier effectiveness and even the strategies discussed in (Forman, 2006) to handle

concept drift do not deal with this case. Hence, here we focuson the three temporal effects

analyzed in (Mourão et al., 2008), adopting the authors’ proposed nomenclature.

Building upon the characterization reported in both studies, we here propose a method-

ology to enable a deeper study of temporal effects. We propose to use a factorial experimental

design toquantifyto which extent each of these variations impact ADC algorithms, accord-

ing to datasets with distinct temporal dynamics. This quantitative analysis is an advance to

the aforementioned studies, since both analyze the variations in the data distribution in a

purely qualitativemanner. We also instantiate the proposed methodology usingthree real

textual datasets and four traditional ADC algorithms. In comparison with previous work,

our characterization methodology and results contribute directly to the definition of more

successful strategies to deal with and to exploit temporal effects. They also provide valuable

insights into the behavior of the analyzed algorithms when faced with changing distributions.

It is interesting to notice that, while the majority of the aforementioned works aimed

at dealing with varying data distributions typically consider scenarios characterized by the

classification of future data (with older data becoming obsolete as time goes by), here we

propose an approach to classify documents in scenarios where we may have information

about both the past and the future when classifying the test data, and this information may

change over time. For example, considering a training set composed by documents created

Page 58: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

22 CHAPTER 3. RELATED WORK

between the years1980 and2011, when classifying a test document created at year2000 we

take into account both past and future data. It should be noticed, however, that our approach

may be easily adapted for scenarios where we only have past information, such as Adaptive

Document Classification and Concept Drift.

3.3 Chapter Summary

We discussed in this chapter the importance of considering the temporal dynamics of data

in machine learning techniques. We also reported some work aimed at either detecting or

handling variations in the data distribution in automatic classification tasks. We saw that the

main approaches for detecting data distribution variations are based on statistical tests and

classifier ensembles. Moreover, we discussed three main techniques for handling varying

data distributions (instance selection, instance weighting and ensembles) along with their

merits and drawbacks. Throughout the discussion, we pointed out how our work advances

the current research efforts.

Page 59: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Chapter 4

A Quantitative Analysis of Temporal

Effects on ADC

In this chapter, we are particularly concerned with the impact thattemporal effectsmay have

on ADC algorithms. Due to several factors, such as the dynamics of knowledge and even the

dynamics of languages, the characteristics of a textual dataset may change over time. For

example, the relative proportion of documents belonging todifferent classes may change as

consequence of the so-called virtual concept drift (Tsymbal, 2004). Thus, density-based clas-

sifiers, which are sensitive to class distribution, may not work well, since the “assumed” class

frequencies observed from an independent training set may not represent the “true” frequen-

cies observed when the test document was created (Yang and Zhou, 2008; Zhang and Zhou,

2010). As we shall see, not only the temporal variations in class frequencies may affect

classification effectiveness, but also the relationships between terms and classes. That is,

the distribution of terms among classes may vary over time, due to changes in writing style,

term usage, and so on. In such scenarios, the classification effectiveness may deteriorate over

time. Therefore, the temporal dynamics of the data is an important aspect that must be taken

into account in the learning of more accurate classificationmodels.

As a matter of fact,Mourão et al.(2008) have recently distinguished three different

temporal effects that may affect the performance of automatic classifiers. The first effect is

theclass distribution variation, which accounts for the impact of the temporal evolution on

the relative frequencies of the classes. The second effect is theterm distribution variation,

which refers to changes in the terms’ representativeness with respect to the classes as time

goes by. The third effect is theclass similarity variation, which considers how the similarity

among classes, as a function of the terms that occur in their documents, changes over time.

The authors showed that accounting for the temporal evolution of documents poses a chal-

lenge to learning a classification model, which is usually less effective when such factors are

23

Page 60: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

24 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

neglected, as assumptions made when the model is built (thatis, learned) may no longer hold

due to temporal effects.

Despite these previous studies, to the best of our knowledge, a deeper and thoroughly

analysis about how and to which extent these temporal effects really impact ADC algorithms

has not been performed yet. A key aspect to be addressed in this task concerns the peculiar

behavior that each temporal effect may present in differentdatasets. For example, while

some datasets may present large class distribution variations over time, other datasets may,

in contrast, present a more significant variability on term distribution. Moreover, different

ADC algorithms may be distinctively affected by these effects due to their sensitivity or

robustness to each specific effect. In other words, the best strategy to handle temporal effects

may depend on the specific characteristics of both the dataset and the ADC algorithm used,

thus turning the learning of a more accurate classification model that deals with these effects

an even more challenging task.

In sum, two important questions that must be answered in order to better understand the

impact of temporal effects are:(i) Which temporal effects influence more in each dataset?

(ii) What is the behavior of each ADC algorithm when faced with different levels of each

temporal effect?In fact, it has already been established that these temporaleffects do exist

in some collections and affect negatively one specific algorithm, namely the SVM classifier

(Mourão et al., 2008). In this chapter, we take a step further towards answering the posed

questions, by proposing a factorial experimental design (Jain, 1991) aimed at quantifying

the impact of the temporal effects in four representative ADC algorithms, considering three

textual datasets with differing characteristics in their temporal evolution.

The original contributions of this chapter are:(i) a re-visitation of the characteriza-

tion reported in (Mourão et al., 2008), with the inclusion of a third dataset belonging to a

distinct and more dynamic domain, in order to strengthen theargument for the existence of

such temporal effects;(ii) the proposal of a methodology to enable a deeper study of the

aforementioned temporal effects, by means of a factorial experimental design aimed at un-

covering how each temporal effect affects each ADC algorithm and textual dataset;(iii) an

instantiation of that methodology considering three real textual datasets and four well known

ADC algorithms, along with a detailed study regarding the impact of the temporal effects on

them. Specifically, we focus on four traditional ADC algorithms, namely Rocchio, K Nearest

Neighbors (KNN), Naïve Bayes and Support Vector Machine (SVM), and on three different

and widely used textual collections covering long time periods, namely, ACM-DL (22 con-

secutive years), MEDLINE (15 consecutive years) and, finally, AG-NEWS (573 consecutive

days).

As we shall see, there is a higher impact of the temporal effects in the ACM-DL and

AG-NEWS datasets when compared to the MEDLINE dataset. In the ACM-DL dataset,

Page 61: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.1. EXPERIMENTAL WORKLOAD 25

the impact of class distribution and class similarity variations are statistically equivalent to

the impact of the term distribution variation, whereas MEDLINE and AG-NEWS are more

impacted by the first two effects. These findings motivate thedevelopment of strategies to

handle the temporal effects in ADC algorithms according to each dataset specific dynamical

behavior. Furthermore, all four analyzed ADC algorithms suffered a negative impact of

the temporal effects in terms of classification effectiveness. Indeed, the most significant

performance losses were observed when these algorithms were applied to the most dynamic

ACM-DL and AG-NEWS datasets. Extending the results presented in (Mourão et al., 2008)

by quantifyingthe impact of each temporal effect in the ADC algorithms, we here show

that the SVM classifier is more resilient to the term distribution effect, while still being

impacted by the other two effects. We also show that the otherthree algorithms, on the other

hand, are very sensitive to all three effects. These resultscorroborate our argument that the

temporal dimension is an important aspect that has to be considered when learning accurate

classification models.

This chapter is organized as follows: In Section4.1we describe the workload used in

our experimental design, that is, the reference datasets and the analyzed ADC algorithms.

An extension of the characterization done byMourão et al.(2008), providing evidence of

the existence of temporal effects in three textual datasets, is presented in Section4.2. Next,

Section4.3 describes the factorial experimental approach proposed asa methodology to

provide a more precise picture of the impact of temporal effects on different ADC algorithms,

whereas the results of applying the proposed methodology onthe considered datasets and

ADC algorithms are discussed in Section4.4. Finally, Section4.5summarize our findings.

4.1 Experimental Workload

In this section, we present the experimental workload used in our analysis and in the remain-

ing chapters. We provide a brief description of the three reference datasets (Section4.1.1) as

well as the four ADC algorithms analyzed (Section4.1.2).

4.1.1 Reference Datasets

The three reference datasets considered in our study consist of sets of textual documents,

each one assigned to a single class (a single label problem).For clarity purposes, throughout

this paper we refer to each class by a corresponding identifier, as listed, for each dataset, in

Table4.1. The considered datasets are:

ACM-DL: a subset of the ACM Digital Library with24897 documents containing

articles related to Computer Science created between 1980 and 2002. We considered only

Page 62: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

26 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

the first level of the taxonomy adopted by ACM, including11 classes, which remained the

same throughout the period of analysis. The distribution ofthe24897 documents among the

11 classes, in the entire time period, is presented in Figure4.1a.

MEDLINE: a derived subset of the MedLine dataset, with861454 documents classi-

fied into7 distinct classes related to Medicine, and created between the years of 1970 and

1985. The class distribution of the861454 documents during the entire time period is de-

picted in Figure4.1b.

AG-NEWS: a collection of835795 news articles, classified into11 distinct classes,

that spans over573 days. This dataset presents some interesting characteristics that are

typical of news datasets. For instance, some topics appear and disappear very suddenly due

to periodical or ephemeral events. Moreover, there is a higher variability in the meaning of

the terms, along with a greater extent of class imbalance, due to the very dynamic nature of

the news domain. The class distribution, spanning the whole573 day period, is shown in

Figure4.1c.

ACM-DL MEDLINE AG-NEWS0. General Literature 0. Aids 0. Business1. Hardware 1. Bioethics 1. Science & Technology2. Computer Systems Organization2. Cancer 2. Entertainment3. Software 3. Complementary Medicine 3. Sports4. Data 4. History 4. United States5. Theory of Computation 5. Space Life 5. World6. Mathematics of Computing 6. Toxicology 6. Health7. Information Systems 7. Top News8. Computing Methodologies 8. Europe9. Computer Applications 9. Italia

10. Computing Milieux 10. Top Stories

Table 4.1: Adopted Class Identifiers for each Reference Dataset.

These datasets potentially present distinct evolution patterns, due to their own cha-

racteristics. In particular, we expect that MEDLINE exhibits a more stable behavior, in

comparison to the other two datasets, since it represents a more consolidated knowledge

area. Thus, we expect a tendency of newly inserted terms becoming stable along the years.

In contrast, we expect a higher dynamism in AG-NEWS, a natural behavior of news datasets,

which tend to present higher variability in their characteristics (for example, variations in

class distributions according to transient events, hot topics, and so on).

Page 63: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.1. EXPERIMENTAL WORKLOAD 27

(a) ACM-DL (b) MEDLINE (c) AG-NEWS

Figure 4.1: Class Distributions in the Three Reference Datasets.

4.1.2 ADC Algorithms

We selected four representative and widely used ADC algorithms to conduct our study. These

algorithms are:

Rocchio: an eager classifier that uses the centroid of a class to find boundaries be-

tween classes. The centroid of a class is defined as the average vector computed over its

training examples. When classifying a new exampled′, Rocchio associates it with the class

represented by the closest centroid tod′.

KNN: a lazy classifier that assigns to a test documentd′ the majority class among those

of its k nearest neighbor training documents in the vector space. Unlike Rocchio, KNN de-

termines the decision boundary locally, considering each training document independently.

We here use cosine similarity to determine the nearest neighbors of a test document.

Naïve Bayes (NB):a probabilistic learning method that aims at inferring a model for

each class by assigning to a test documentd′ the class associated with the most probable

model that would have generated it. Here, we adopt the Multinomial Naïve Bayes approach

(Manning et al., 2008), since it is widely used for probabilistic text classification. The pos-

terior class probabilitiesP (d′|c) are defined as

P (d′|c) = η × P (c)×∏

v∈d′

P (t|c), (4.1)

whereη denotes a normalizing factor,P (c) is the class prior probability andP (t|c) denotes

the conditional probability of observingt having already observedc. The NB classifier

assigns to a test exampled′ the classc with the highest posterior probabilityP (d′|c).

Support Vector Machine (SVM): the SVM classifier aims at finding an optimal sep-

arating hyperplane between the positive and negative training documents, maximizing the

distance (margin) to the closest points from either class. GivenN training documents repre-

Page 64: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

28 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

sented as pairs(xi, yi), wherexi is the weighted feature vector of theith training document

andyi ∈ {−1,+1} the set membership of the document, SVM tries to maximize themargin

between them on the training data, which leads to better classification effectiveness on test

data. We may state the problem as

minβ,β0

1

2‖β‖2, subject to

yi(xTi β + β0) ≥ 1, (4.2)

whereβ is a vector normal to the hyperplane (the so-called weight vector),β0 is its intercept,

and0 ≤ i ≤ N .

After introducing Lagrange multipliersαi (0 ≤ i ≤ N) for each inequality constraints

in Equation4.2, along with slack variablesξi to account for non-separable data (a bounded

tolerable training error rate), we form the following Lagrangian (primal):

LP =1

2‖β‖2 + C

N∑

i=1

ξi −N∑

i=1

αi[yi(xTi β + β0)− (1− ξi)]−

N∑

i=1

µiξi, (4.3)

which we minimize with respect toβ, β0 andξi, whereµi are Lagrange multipliers employed

to enforceξi > 0. Setting the corresponding derivatives to zero, this yields:

β =

N∑

i=1

αiyixi (4.4)

0 =N∑

i=1

αiyi (4.5)

αi = C − µi, (4.6)

whereαi ≥ 0, µi ≥ 0 andξi ≥ 0, ∀i. By substitution into4.3, we get the so-called La-

grangian Wolfe (dual) function:

LD =N∑

i=1

αi −1

2

N∑

i=1

N∑

j=1

αiαjyiyjxTi xj .

Furthermore, the solution must satisfy the Karush-Kuhn-Tucker (KKT) conditions,

which include, along with Equations4.4, 4.5and4.6, the following ones:

Page 65: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.2. CHARACTERIZATION OF TEMPORAL EFFECTS ONTEXTUAL DATASETS 29

αi[yi(xTi β + β0)− (1− ξi)] = 0 (4.7)

µiξi = 0

yi(xTi β + β0)− (1− ξi) ≥ 0,

where0 ≤ i ≤ N .

Finally, the solution forβ is β =∑N

i=1 αiyixi, with non-zeroαi for support points

which lie in the support vectors. The solution forβ0 may be devised by Equation4.7, nor-

mally averaging the solutions regarding the support pointsto achieve numerical stability.

Thus, we can express the SVM’s decision function as:

F = sign(xTβ + β0),

where the sign of the score is used to predict the example’s class. Since SVM is a bi-

nary classifier, it should be adapted to to handle multi-class classification problems. The

two most common strategies for doing so are the one-against-one and the one-against-all

(Manning et al., 2008).

4.2 Characterization of Temporal Effects on Textual

Datasets

In this section, we briefly describe the characterization reported in (Mourão et al., 2008),

which uncovered three main temporal effects that affect theACM-DL and MEDLINE

datasets:(i) the class distribution variation;(ii) the term distribution variation; and, finally,

(iii) the class similarity variation. More importantly, we also extend this prior characteriza-

tion to include a third, distinct, and more dynamic dataset,namely AG-NEWS. Our main

goal is to strengthen the argument for the existence of temporal effects in the reference

datasets, thus motivating our quantitative analysis of their impact on ADC algorithms when

applied to these datasets.

Before proceeding, we must first discretize the temporal dimension in order to capture

the variabilities in the characteristics of the explored datasets. Time can be seen as a dis-

cretization of natural changes inherent to any knowledge area. Detectable changes, however,

may occur at different time scales, depending on the characteristics of the given knowl-

edge area. In the case of ACM-DL and MEDLINE, which are sets ofscientific articles, we

adopted yearly intervals for identifying such changes, as scientific conferences usually occur

Page 66: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

30 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

once per year. For the AG-NEWS dataset, we adopted, instead,a daily granularity, which

should more accurately capture changes in a set of news articles. Next, we discuss the main

findings of the characterization of each temporal effect in the three datasets.

4.2.1 Class Distribution Temporal Variation

The impact of temporal evolution on class distribution (CD)relates to the variation of the

fraction of documents assigned to each class over time. CD temporal variation should be

properly considered to avoid undesirable classifier’s bias. For instance, as mentioned before,

if CD varies significantly, the “assumed” class distribution may not reflect the “true” class

distribution observed when test data was created. Notice that, as an extreme case, classes

may appear and disappear as consequence of splits and joins of existing classes. For example,

the sub-classesInformation RetrievalandArtificial Intelligencein the ACM-DL Computing

Classification System (CCS) belonged to the same class—Applications—in 1964. Currently,

each one belongs to a different class:Information Retrievalbelongs toInformation Systems,

whereasArtificial Intelligencebelongs toComputing Methodologies.

To assist the analysis of the CD temporal variation in each dataset, Figure4.2shows the

class probability distributions for each year of ACM-DL andMEDLINE (as inMourão et al.

2008) and for each week of AG-NEWS.1 The figure illustrates the variation in terms of the

representativeness of the classes, that is, in terms of the fraction of document occurrences

in each class, as time goes by. As the figures show, most classes, particularly in ACM-

DL and AG-NEWS, exhibit frequent oscillations in their representativeness, whereas others

become more or less representative with time. For instance,theMathematics of Computing

class, in ACM-DL, became less representative with time, whereas the AG-NEWSWorld

class presented a peak in its representativeness between the 25th and37th weeks. Another

interesting case is the MEDLINEAids class. Although it contains documents dating from

1970, the fraction of documents belonging to it only became significant after1985.

These results illustrate that one needs to be very careful when creating classification

models in order to avoid generating a biased model that may not be accurate for the dataset to

be tested. The fact that the fractions of documents in several classes are constantly changing

over time, as can be seen for several classes in all three datasets in Figure4.2, makes this a

real problem that must be taken into account.

1We show AG-NEWS results on a weekly basis to improve graph readability

Page 67: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.2. CHARACTERIZATION OF TEMPORAL EFFECTS ONTEXTUAL DATASETS 31

(a) ACM-DL (b) MEDLINE

(c) AG-NEWS

Figure 4.2: Class Distribution Temporal Variation in Each Reference Dataset.

4.2.2 Term Distribution Temporal Variation

Term Distribution (TD) variation is related to how the distribution of terms among the classes

changes over time as a consequence of terms appearing, disappearing, and having variable

discriminative power across classes. Take the following example of two classes,Mythology

andAstrophysics. Besides being the god of hell in Greek mythology, Pluto was also con-

sidered to be a planet until mid-2006. Up to this date, documents with the term Pluto had

a higher probability of being classified in theAstrophysicsclass due to the great amount of

references that mention Pluto as a planet. From this date on,since Pluto is not considered to

be a planet anymore, there has been a significant reduction inthe number of documents re-

ferring to it in this context. In mythology, however, the reference of Pluto did not present any

sensible variation. In this case, the term Pluto lost discriminative power in theAstrophysics

class and gained it in theMythologyclass. Intuitively, we may state that TD evolution usually

happens gradually, so that the distribution of terms observed at time periods that are closer

time-wise tend also to be more similar.

Page 68: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

32 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

In order to characterize the TD temporal variation effect, we define, for each class and

each point in time, theclass vocabularyas the set of terms that have the highest values of

info-gain (Forman, 2003). The vocabulary of a class at a given point in timet represents that

class int. We then compare the vocabularies produced for the same class across all points

in time using the normalized cosine similarity between them. Figure4.3shows the average

cosine similarities as we vary the time distance between thevocabularies. For the sake of

clarity, we present results for a subset of the classes of each dataset, since the same behavior

is observed for all classes. Clearly, for all three datasets, the class vocabularies are varying

significantly over time. For the less stable ACM-DL and AG-NEWS datasets, the similarities

drop significantly even for a time distance equal to 1.2

(a) ACM-DL (b) MEDLINE (c) AG-NEWS

Figure 4.3: Term Distribution Temporal Variation of Each Reference Dataset.

Since the class vocabulary changes significantly with time,it becomes clear that a clas-

sification model generated considering documents created at certain period of time may be

less effective when tested using documents from another period of time, because the vocabu-

lary may have changed in a way that the assumptions made when learning the classifier may

no longer hold, that is, the discriminative terms may not be the same. Such difficulty turns

out to be a very interesting challenge as well.

4.2.3 Class Similarity Temporal Variation

Finally, class similarity (CS) variation relates to how thepairwise similarity among classes,

as a function of the terms that occur in their documents, varies over time. The similarity

between two arbitrary classes may change over time due to themigration and variation of the

frequency of the terms in their vocabularies: two classes may be similar at a given moment,

and become less similar later in the future, and vice versa.

2At time distance zero, the similarity, for all classes, is equal to 1, since we are comparing a vocabulary toitself, which obviously corresponds to the maximum possible similarity value.

Page 69: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.2. CHARACTERIZATION OF TEMPORAL EFFECTS ONTEXTUAL DATASETS 33

In order to analyze the CS temporal variation, we calculate the cosine similarity

between the vocabularies of each pair of distinct classes atany given point in time (years for

ACM-DL and MEDLINE, and days for AG-NEWS). Tables4.2, 4.3and4.4show the results

for MEDLINE, ACM-DL and AG-NEWS, respectively. Each entry in the tables comprises

the standard deviation of the similarities between the associated pairs of classes computed

over all points in time. As we can observe, the similarities between some pairs of classes

vary significantly with time. For example, the similaritiesbetweenGeneral Literature(id 0)

andTheory of Computation(id 5) in ACM-DL, Complementary Medicine(id 3) andHistory

(id 4) in MEDLINE, andWorld andTop Stories(ids 5 and10, respectively) in AG-NEWS

have standard deviations equal to0.29, 0.21 and0.33, respectively. This means that these

pairs of classes may have been very similar in some periods, but also loosely related in

others. Thus, the difficulty in separating them apart variessignificantly as time goes by.

Class ID 0 1 2 3 4 5 6 7 8 9 10

0 0 0.14 0.12 0.12 0.12 0.29 0.14 0.13 0.14 0.12 0.291 - 0 0.08 0.13 0.11 0.12 0.11 0.12 0.10 0.10 0.132 - - 0 0.10 0.09 0.10 0.08 0.07 0.08 0.10 0.133 - - - 0 0.09 0.06 0.09 0.10 0.11 0.12 0.134 - - - - 0 0.05 0.08 0.09 0.10 0.13 0.135 - - - - - 0 0.14 0.13 0.07 0.06 0.296 - - - - - - 0 0.13 0.10 0.09 0.157 - - - - - - - 0 0.10 0.08 0.158 - - - - - - - - 0 0.11 0.139 - - - - - - - - - 0 0.1210 - - - - - - - - - - 0

Table 4.2: Pairwise Class Similarity (standard deviations) in ACM-DL.

Class ID 0 1 2 3 4 5 6

0 0 0.19 0.16 0.18 0.19 0.18 0.191 - 0 0.04 0.20 0.17 0.19 0.122 - - 0 0.04 0.03 0.04 0.053 - - - 0 0.21 0.08 0.054 - - - - 0 0.20 0.115 - - - - - 0 0.056 - - - - - - 0

Table 4.3: Pairwise Class Similarity (standard deviations) in MEDLINE.

Summarizing this discussion, there is clear evidence of temporal variations in the class

and term distributions, as well as on the similarities amongclasses in all three analyzed

Page 70: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

34 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

Class ID 0 1 2 3 4 5 6 7 8 9 10

0 0 0.17 0.16 0.23 0.15 0.18 0.20 0.15 0.17 0.01 0.161 - 0 0.20 0.25 0.24 0.24 0.21 0.19 0.20 0.02 0.152 - - 0 0.24 0.18 0.10 0.20 0.18 0.26 0.04 0.403 - - - 0 0.30 0.30 0.27 0.20 0.31 0.01 0.164 - - - - 0 0.19 0.19 0.21 0.26 0.01 0.195 - - - - - 0 0.23 0.16 0.28 0.04 0.336 - - - - - - 0 0.15 0.22 0.01 0.137 - - - - - - - 0 0.13 0.02 0.158 - - - - - - - - 0 0.02 0.249 - - - - - - - - - 0 0.0510 - - - - - - - - - - 0

Table 4.4: Pairwise Class Similarity (standard deviations) in AG-NEWS.

datasets. These variations may ultimately affect the performance of classifiers. In the fol-

lowing section, we detail the proposed methodology to quantify the impact of each of these

three temporal effects on ADC algorithms.

4.3 Experimental Design

In this section, we describe our proposed methodology to assess the impact of the identified

temporal effects on each ADC algorithm and textual dataset.The core component of our

methodology is a factorial experimental design (Jain, 1991). This technique has already

been applied in multiple contexts toquantifythe effect of different factors and inter-factor

interactions on a given response variable (see examples inde Lima et al. 2010; Jain 1991;

Orair et al. 2010; Vaz de Melo et al. 2008). However, to the best of our knowledge, this is

the first time it is applied in the specific context of temporaleffects and ADC algorithms. As

will be discussed below, the application of this technique in this context brings challenges of

its own.

We start by, in Section4.3.1, reviewing the factorial design procedure in general terms.

We discuss how it can be applied to evaluate the impact of temporal effects on ADC algo-

rithms in Section4.3.2, presenting its application on the four selected ADC algorithms and

the three chosen textual datasets in Section4.3.3

4.3.1 Factorial Design

Givenk factors (the so-calledindependent variables), which can assumen levels (possible

values), and a response variable, a full factorialnk experimental design aims at quantifying

Page 71: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.3. EXPERIMENTAL DESIGN 35

the impact of each individual factor as well as of all inter-factor interactions (of all orders)

on a given response variable. In other words, it aims at quantifying theeffectof these factors

and interactions on the variations observed in the responseacross a series ofnk experiments,

carefully designed to cover all possible configurations of factor levels.

To conduct thenk design, the parameters that affect the system under study must be

carefully controlled, in order to avoid misleading conclusions due to unexpected effects.

Thus, one has to be able to isolate and carefully vary thefactors, which are parameters re-

lated to the goals of the study and thus selected to be analyzed, while controlling the other

parameters, which are kept fixed. Usually, factors are varied from smaller to larger val-

ues, based on the assumption of monotonicity, that is, the response variable continuously

increases (or decreases) as the factor value becomes larger. In many scenarios, the system

under study presents an inherent variability, and, thus, measurements are susceptible to in-

accuracies, referred to asexperimental errors. In such cases, the impact of the factors and

of their interactions should be assessed in comparison to such errors, and an experimental

design withr replications (nkr) should be adopted. This is done by replicating the mea-

surements for each factor-level combinationr times. It is important to emphasize the need

of controlling all parameters with significant impact on thesystem, by either treating them

as factors or keeping them fixed, as the effect of uncontrolled parameters can not be distin-

guished from experimental errors.

Such experimental design is typically used as a primary toolto help one sort factors

and inter-factor interactions in terms of their impact on the response variable, thus provid-

ing quantitative evidence of which factors (and/or interactions) are more relevant for further

(more detailed) investigation. The examination of every possible factor-level combination

enables one to have a complete picture of the system behaviorregarding the factors con-

sidered. However, it comes at the expense of a potentially very costly study. The required

number of experiments (i.e.,nkr experiments) may be too large and unfeasible to perform

due to resource and time constraints. One of the most recommended strategies to reduce

the number of required experiments consists in reducing thenumber of levels considered for

each factor (Jain, 1991). As a matter of fact, for an initial assessment, one can consider only

two levels (lower and upper levels) of each factor, thus performing a2kr factorial design. By

doing so, one can determine the relative importance of all factors and interactions, and leave

for a more detailed study the analysis of more levels of the most relevant factors.

We describe the main steps of a2kr factorial design using, for illustration purposes,

k = 2 factors, referred to asA andB. The22r design aims to fit an additive model that

characterizes the impact of each factorA andB as well as of its interactionAB on the

response variabley. This model is given by:

Page 72: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

36 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

y = q0 + qAxA + qBxB + qABxAxB + ε, (4.8)

whereq0 is the mean value of the response variable,qA, qB andqAB stand for the effects

associated with factorsA, B and interactionAB, andε denotes the experimental errors. For

each factorf ∈ {A,B}, a variablexf is defined as

xf =

−1 if f is at the lower level,

+1 if f is at the upper level.

Thus,qf denotes the extent of the variation on the global averageq0 imposed by factorf , on

average.

The 22r experimental design can be summarized into five steps. Step 1consists of

parameter estimation, in which we computeq0 and the effectsqA, qB andqAB. Once the

effects have been computed, the model can be used to estimatethe response for any given

factor values (x-values). For instance, the estimated response when factorsA andB are at

levelsxAi andxBi, respectively, is computed as:

yi = q0 + qAxAi + qBxBi + qABxAixBi (4.9)

The importance of a factor can be measured by the proportion of the total variation in

the response variable that can be explained by it. Thus, in Step 2, we compute the variation

of responsey across all experiments that can be explained by each factor (SSA, SSB, SSAB,

respectively) as well as the variation that remains unexplained, being thus credited to ex-

perimental errors (SSE). In other words, we computeSSf = 2krq2f (f ∈ {A,B,AB}),

andSSE =∑2k

i=1

∑rj=1 e

2ij , where erroreij denotes the difference between the estimated

response for theith experiment (yi) and the value measured in itsjth replication (yij). The

total variation, referred to asSum of Squares Total(SST ), is also computed as the sum of

SSA, SSB, SSAB andSSE.

Next (Step 3), we express theSSf (f ∈ {A,B,AB}) andSSE as percentages of

the total variationSST , so as to more easily assess the importance of each factor andof

the experimental errors in the observed response variations. Factors (and interactions) that

explain a higher percentage of the total variation are considered more important and thus,

are candidate for further analysis.

Since the effects are computed from a sample, they are indeedrandom variables, and

could take different values if another set of experiments was performed. Thus, it is necessary

to compute their associated confidence intervals (Step 4). We do so by first computing the

rooted mean square of errors (RMSE) and the standard deviationsf of each effectqf (f ∈

Page 73: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.3. EXPERIMENTAL DESIGN 37

{A,B,AB}). RMSE denotes the standard error of the estimates, thus measuringhow well

the model explains the observations. It is computed as the square root of the ratio ofSSE

to the degrees of freedom associated with the experimental errors (in the current design,

22(r − 1)).3 The 100(1 − α)% two-sided confidence intervals are computed using either

a Student’st distribution orz distribution, depending on the degrees of freedom22(r − 1)

(seeJain, 1991). Any effect whose confidence interval does not include zerois statistically

significant with the given confidence.

Finally, in Step 5 we assess the model quality, by means of thecoefficient of determina-

tion R2. This is done by comparing the unexplained variation (SSE) with the total variation

(SST ), being a measure of goodness of the fit for the additive modelin Equation4.8. The

closer to 1, the better the fitted model.

The general procedure to perform a2kr design, for any values ofk andr, is presented

in Algorithm 1.

Algorithm 1 Factorial Design Procedure.function FACTORIALDESIGN

Step 1: Estimate model parameters (i.e., grand mean and factor effects)

q0 ←1

2kr

∑2k

i=1

∑r

j=1 yij

qf ←1

2k∑2k

i=1 xfiyi, wheref ∈ [1, 2k − 1] andyi =1

r

∑r

j=1 yij

Step 2: Compute total variation as well as variation due to each factor and to experimental errorsSSf ← 2krq2f , wheref ∈ [1, 2k − 1]

SSE ←∑2k

i=1

∑r

j=1 e2ij , whereeij = yij − yi

SST ←∑2k−1

f=1 SSf + SSE

Step 3: Compute percentage of variation each factor/error is responsible for

Pf ←SSf

SST

× 100, wheref ∈ [1, 2k − 1]

PE ←SSE

SST

× 100

Step 4: Compute confidence intervals of the effects

RMSE ←√

SSE

2k(r − 1)

sf ←RMSE√

2kr, wheref ∈ [1, 2k − 1]

CIf ← qf ± t[1−α

2;2k(r−1)]sf , wheref ∈ [1, 2k − 1]

Step 5: Assess model accuracy by the coefficient of determinationR2 ← 1− SSE

SST

end function

3In a general2kr design, the degrees of freedom of the experimental errors isgiven by2k(r − 1).

Page 74: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

38 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

4.3.2 Applying 2kr Design in the Characterization of Temporal

Effects

In this section we describe how the2kr design can be applied to quantify the impact of

temporal aspects on ADC algorithms, considering differentdatasets. As we focus on three

different temporal aspects, namely the class distributiontemporal variation (CD), the term

distribution temporal variation (TD) and the class similarity temporal variation (CS), our

experimental design takesk = 3 factors. The two levels considered for each factor, which we

call “lower” and “upper” levels, defined below, refer thus tothe degree of temporal variation

observed on it. Given a reference dataset and an ADC algorithm, the goal is to partition the

document set into23 groups corresponding to all possible factor-level configurations, and

then evaluate the algorithm for each configuration, considering the grouped documents. We

then apply the2kr design procedure, described in Algorithm1, to quantify the effect of each

factor and inter-factor interaction on the effectiveness of the ADC algorithm. The response

variabley is thus the classification effectiveness, which is here assessed by the commonly

usedF1 measure.F1, the harmonic mean between the precisionp and the recallr, given by:

F1 =2pr

p+ r,

where precision is the percentage of documents assigned by the classifier to classci that were

correctly classified, and recall is the percentage of documents belonging to classci that were

correctly classified.4

For each configuration, we run a numberr of replications following a cross-validation

strategy, commonly adopted by the machine learning community. There are, at least, two

usual approaches for doing so: the K-fold cross validation and the repeated random sub-

sampling. The K-fold cross validation consists of randomlysplitting the data intoK inde-

pendent folds. At each iteration, one fold is retained as thetest set, and the remainingK − 1

folds are used as training set. The repeated random sup-sampling consists of randomly se-

lecting a fraction of documents from the dataset, without replacement, to compose the test

set, and the remaining documents retained as the training set. This is performed for each

replication. Since in the K-fold cross validation the size of the folds are dependent of the

4The describedF1 measure corresponds to the overall performance of the methods across all classes. Usinga per-class variation of the measure (also known as Macro-F1) would imply in having to consider anotherparameter in the analysis: the class imbalance. In order to focus our analysis on the time-related factors, thegoal of the present study, we would have to isolate or controlthis parameter. However, possible solutions toisolate this parameter (for example, under- or oversampling Lin et al. 2009; Liu et al. 2007) are typically veryhard to perform in practice without affecting the temporal factors, which ultimately could compromise ourstudy. Thus, we leave for future work the consideration of this metric in our experimental design. We note,however, that, in the absence of very skewed class distributions, both variations of theF1 metric tend to producecompatible results.

Page 75: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.3. EXPERIMENTAL DESIGN 39

number of iterations, it becomes more suitable to medium/large sized datasets, while the re-

peated random sub-sampling is usually adopted to small sized datasets when the number of

replications is large.

One challenge to build our factorial design is how to define the23 groups of documents.

For that, we must quantify the temporal variation of each factor in the set of documents of

the reference dataset, define the two levels for each factor and, based on them, group the

documents according to all possible factor-level combinations. The following three sections

(Sections4.3.2.1-4.3.2.3) describe how we performed these steps for theCD, TD andCS

factors. Note that, sinceCD andCS relate exclusively to the characteristics of the class to

which a document belongs, we define theCD andCS levels associated with a document

based on the corresponding values of its class.TD, on the other hand, relates to the rela-

tionships among terms and classes. Thus, in order to define the TD level associated with

a document, isolating this factor from the others, we adopt afiner grained approach that

analyzes the document’s contents. After defining the factorlevels, we discuss a few other

aspects that require attention to avoid misleading results(Section4.3.2.4).

4.3.2.1 Class Distribution: Lower and Upper Levels

LetC andP be the set of classes and points in time observed in the reference dataset, respec-

tively. To isolate the class distribution effect into lowerand upper levels, we consider the

relative sizes of the classes (i.e., fraction of the datasetdocuments assigned to the classes) at

each point in timep ∈ P. For each classc ∈ C, we compute the coefficient of variation CV

(that is, the ratio of the standard deviation to the mean) of the relative size ofc for all values

of p. The CV is used since it is dimensionless and scale invariant, and thus more appropriate

to deal with temporal changes in class distribution observed in the reference dataset.

We then partition the documents into two groups based on a given thresholdδCD: those

whose classes present CV values lower thanδCD are assigned to the “lower” group (CD↓, with

associated variablexCD = −1), while those whose classes present CV values higher than

δCD are assigned to the “upper” group (CD↑, with associated variablexCD = +1). We defer

to Section4.3.3the details regarding how we define theδCD threshold.

4.3.2.2 Term Distribution: Lower and Upper Levels

We determine theTD level to which a document belongs computing thedocument stability

level, which is characterized by the density of the documents terms that are stable. In order

to assess the stability of a given term, we use the concept ofstability period(Rocha et al.,

2008).

Page 76: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

40 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

Definition 1 (Stability period) Let DF (t, c, p) be the number of documents belonging to

classc ∈ C that contain termt and that were created at the point in timep ∈ P. A stability

periodSt,pr of a termt consideringpr ∈ P as the reference point in time, is the set of points

in timep present in the largest continuous period of time,5 starting frompr and growing both

to the past and the future, until there exists some classc such that

DOMINANCE(t, c, p) =DF (t, c, p)

c′∈C DF (t, c′, p)> α,

for some predefined0 < α ≤ 1.

We characterize the stability of a termt, regarding a reference point in timepr, by the

term stability level(TSL), defined as:

TSL(t, pr) =|STABILITY PERIOD(t, pr)|

|P|

We then use the TSL to estimate the document stability level (DSL) of a given docu-

mentd. Let p be the point in time whend was created. We define the DSL ofd as:

DSL(d) =

t∈d TSL(t, p)

|{t′|t′ ∈ d}|

As we can observe,0 ≤ DSL(d) ≤ 1, where the lower bound (DSL(d) = 0) occurs

for documents without stable terms, and the upper bound (DSL(d) = 1) occurs for docu-

ments composed only by termst with maximalTSL(t, p) (that is, terms that have stability

periods with maximum duration regarding the time whend was created).

The documents are then partitioned into the two groups: those with DSL less than a

pre-defined thresholdδTD are assigned to the “lower” group (TD↓, with associated variable

xTD = −1) and the remaining documents are assigned to the “upper” group (TD↑, with

associated variablexTD = +1). Again, we defer the definition of such threshold to the

Section4.3.3.

4.3.2.3 Class Similarity: Lower and Upper Levels

The “lower” group (CS↓, with associated variablexCS = −1) is composed by documents

whose classes are more stable in terms of their similaritieswith other classes during the

whole period covered in the reference dataset. Accordingly, the “upper” group (CS↑, with5We consider the same definition of stability period asRocha et al.(2008), adopting acontinuousperiod

of time, due to computational feasibility. Considering non-continuous intervals increases the search spaceexponentially with the number of points in time (2|P| possible intervals to be considered). This is a safe decisionbecause, as we can observe in Figure4.3, the variations observed in the relationships between terms and classesare smooth (that is, we do not observe any abrupt steps in the curves).

Page 77: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.3. EXPERIMENTAL DESIGN 41

associated variablexCS = +1) is composed by documents whose classes present higher

variability in their similarities with other classes. To quantify this variability for a classc,

we first compute the similaritysim(Vc,p,Vc′,p) wherec, c′ ∈ C, c 6= c′ andVc,p denotes the

c’s vocabulary at the point in timep ∈ P. The vocabulary of a classc at timep consists of

the top-K terms with highest Information Gain (Forman, 2003) in c at that time. We then

compute the coefficient of variation CV of the(|C| − 1)|P| pooled similarities.6 We separate

documents into two groups based on the CV values of their classes and on a pre-defined

thresholdδCS, which will be further discussed in Section4.3.3.

4.3.2.4 Other Challenging Aspects

As a requirement for a well conducted experimental design, we must control the parameters

that may influence the responses but are not the target of the analysis (i.e., are not treated

as factors in the design). One such parameter is thesamplingeffect, characterized by the

differences in classification effectiveness obtained by varying the size of the training set. As

it is well known, as the training set used by supervised learning strategies becomes larger,

the more information becomes available to build the classification model, which ultimately

influences the effectiveness of the classifier. If we neglectsuch matter, and consider different

training set sizes for each factor-level combination, we may mask the actual impact of the

temporal effects on the ADC algorithms. Clearly, we must isolate the sampling effect to

remove its influence on the response variable. Therefore, for each experimental replication,

we randomly selected the same number of documents for each ofthe2k partitions, according

to the size of the smaller partition. This ensures training sets with equal sizes across all

factor-level combinations, thus isolating the sampling effect.

One important dataset-dependent aspect is that the documents and classes in the

reference dataset must fulfill all23 groups to enable us to conduct the proposed experimental

design. However, in some cases, as in the reference datasetsanalyzed here, this might not

hold, particularly due to combinations regarding theCD andCS factors (see discussion

in Section4.3.3). In such cases, we are not able to isolate and simultaneously analyze all

three temporal factors. To overcome this issue, and yet provide valuable insights about the

temporal effects, we propose a pairwise approach, consisting of two 22r designs, referred

to asCD×TD andCS×TD. This decision comes with a cost, as we are not able to analyzea

possible interaction betweenCD andCS. However, as we will see in the next section, these

two factors are typically very correlated. Thus, analyzingthem in separate experimental

6Note that, as in Section4.3.2.1, we here use the CV metric to characterize class similarity temporalvariations. This is in contrast to Section4.2, where, followingMourão et al.(2008) strictly, we characterize theclass similarity variation using the standard deviation ofthe pooled similarities, a metric that depends on theunit and scale of the measurements.

Page 78: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

42 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

designs might still be worthwhile.

The first experimental design,CD×TD, aims at analyzing the impact ofCD, TD and

their interaction on the classification effectiveness achieved by the four algorithms in the

three reference datasets. The second one, referred to asCS×TD, allows one to quantify the

impact ofCS, TD and their interaction. For both designs, all documents of each refer-

ence dataset are divided into four partitions, with the samenumber of documents randomly

sampled at each replication, covering thus all possible factor-level combinations:{CD↓TD↓,

CD↓TD↑, CD↑TD↓, CD↑TD↑} for the former and{CS↓TD↓, CS↓TD↑, CS↑TD↓, CS↑TD↑} for the latter,

where↓ and↑ denote “lower” and “upper” levels, respectively.

4.3.3 Quantifying the Impact of Temporal Effects on ADC

In this section, we present how we applied the proposed methodology to quantify the im-

pact of temporal effects on ADC using, as experimental workload, the four ADC algorithms

(Rocchio, KNN, Naïve Bayes and SVM) and the three textual datasets (ACM-DL, MED-

LINE and AG-NEWS) presented in Section4.1. In other words, we performed a series of

experiments, following the proposed methodology, for eachcombination of ADC algorithm

and reference dataset. As the number of available documentsin all three datasets are not

enough to cover all23 partitions, we adopted the strategy described in Section4.3.2.4, con-

ducting two separate22r designs in each case.

Recall that, by Definition1, in order to define theTD ↓ andTD ↑ document groups,

we must determine the dominance thresholdα to compute the stability periods. Different

values ofα were evaluated and, as they lead to similar results, we fixedα = 50%, ensuring

that the terms will have a high degree of exclusivity with a single class. Furthermore, as

described in Section4.1, the KNN and SVM classifiers have some tuning parameters. In

particular, one must define the number of nearest neighbors to be considered (parameter

K) to use KNN. The SVM parameters depend on the kernel functionused. We chose the

RBF kernel function, since it yielded more stable results across replications than its linear

counterpart. For this classifier, the tuning parameters arethe costC of misclassification and

the shape of the RBF kernel function (parameterγ). All parameters were calibrated with

a cross-validation performed over the training set. We usedthe LibSVM implementation

(Chang and Lin, 2001), employing an one-against-one procedure to adapt the binary SVM

to the multi-class scenario, since this is the case in our reference datasets.

Next, we discuss the experimental design conducted for eachreference dataset.

Page 79: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.3. EXPERIMENTAL DESIGN 43

4.3.3.1 ACM-DL

The first step is to partition the ACM-DL documents into four groups for theCD×TD design

and four other groups for theCS×TD design. We do so by first partitioning them into one

pair of groups for each design, using theδCD andδCS thresholds. We setδCD as theaverage

CV of class sizes (Section4.3.2.1), computed across all classes. Similarly, we setδCS as

the averageCV of pooled similarities (Section4.3.2.3). These thresholds along with the

CV values of individual classes are shown in Figure4.4. We note that, to computeδCS, we

disregarded the CV associated with theGeneral Literatureclass (id0), since, as shown in

Figure4.4b), it is significantly larger than the CVs of the other classes. We believe that, for an

initial assessment, this decision might not significantly impair our analysis. The documents

of class 0 were then assigned to theCS↑ partition.

Analyzing Figure4.4, we can further understand why the ideal23 design was not

possible to be conducted on the ACM-DL dataset. LetCCD↑, CCD↓, CCS↑, CCS↓ denote

the sets of classes in partitionsCD↑, CD↓, CD↑ andCD↓, respectively. As we can observe,

CCD↑ ∩ CCS↓ = ∅ whereas|CCD↓ ∩ CCS↑| = 1. As we need at least two classes in each

partition to proceed with the classification task, there arenot enough documents to fill all

the cells of the ideal23r design. Figure4.4 also shows that 3 out of the 4 classes with high

CS also present highCD and all classes with lowCS also have lowCD. In other words,

there is a high correlation between these two factors, whichsupports our decision to ignore

a possible interaction between them, decoupling the analysis into two separate22r factorial

designs.

(a) Class Distribution Variation (CD) (b) Class Similarity Variation (CS)

Figure 4.4: Determining the Lower and Upper Levels ofCD andCS—ACM-DL.

Next, we further subdivide eachCD-based document partition according to theTD

Page 80: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

44 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

factor, using theδTD threshold (Section4.3.2.2). We do the same for theCS-based document

partitions. We setδTD equal to the average DSL value across all documents in each partition.

Figure4.5shows the distribution of DSL values and theδTD for each partition. Documents

from theCD↓ (or CS↓) partition with DSL smaller than the correspondingδTD are assigned to

CD↓TD↓ (or CS↓TD↓) group, whereas those with DSL higher thanδT are assigned toCD↓TD↑

(or CS↓TD↑) group. The same applies for those documents fromCD↑ andCS↑.

(a) LowCD (b) HighCD

(c) LowCS (d) HighCS

Figure 4.5: Determining the Lower and Upper Levels of TD—ACM-DL.

Recall that a2kr design requiresr replications to be performed for each configuration

and, as discussed in the previous section, this can be achieved by employing either K-fold

cross validation or repeated random sub-sampling. Due to the small size of the ACM-DL

dataset and the use of sampling to isolate the sampling effect, we use the repeated random

sub-sampling strategy, selecting50% of documents to compose the test set and the remaining

retained for the training set. We performedr = 50 replications.

Table4.5shows the results of both factorial designs (CD×TD andCS×TD) for each ADC

algorithm (first column). For better presentation, we representCD (CS) asA andTD as

Page 81: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.3. EXPERIMENTAL DESIGN 45

B for theCD×TD (CS×TD) design. For each algorithm and design, the “%Var” row liststhe

percentage of variation in classification effectiveness that can be explained by each effectqf

(f ∈ {A,B,AB}) and by experimental errors (ε). Similarly, the “Mean” row denotes the

estimated coefficients of the model, capturing the “average” impact of each factor: positive

values indicate an increase in classification effectiveness and negative values indicate the op-

posite. Note thatq0 refers to the grand mean, computed over all observations. The “99% CI”

rows report the 99% confidence intervals associated with thegrand meanq0 and each effect

qf (f ∈ {A,B,AB}). Intervals that include zero indicate statistically non-significant impact

of the associated factors. Finally, the “R2” column reports the coefficient of determination

of the proposed model: values close to1 indicate a well fitted model. Similar tables, referred

to as ANOVA (ANalysis Of VAriance) tables, will be used to summarize the results obtained

with the other datasets as well. We leave to Section4.4a detailed discussion of all results.

ADCAnalysis of Variance (ANOVA)

Model: y = q0 + qAxA + qBxB + qABxAxB + ε

Effects: q0 qA qB qAB ε R2

Rocchio

CD %Var − 40.41% 53.83% 1.80% 3.96%0.96× Mean 64.55 −10.38 −11.98 2.19 −

TD 99% CI [62.93, 66.17] [−12.00,−8.76] [−13.60,−10.36] [0.57, 3.81] −CS %Var − 50.86% 40.97% 3.72% 4.45%

0.95× Mean 66.86 −7.95 −7.13 −2.15 −TD 99% CI [65.69, 68.04] [−9.12,−6.78] [−8.31,−5.96] [−3.32,−0.98] −

KNN

CD %Var − 48.61% 44.77% 2.17% 4.45%0.95× Mean 65.40 −12.90 −12.38 2.73 −

TD 99% CI [63.45, 67.34] [−14.84,−10.95] [−14.32,−10.43] [0.78, 4.67] −CS %Var − 50.64% 44.16% 2.26% 2.93%

0.97× Mean 65.83 −9.14 −8.54 −1.93 −TD 99% CI [64.74, 67.34] [−10.24,−8.05] [−9.63,−7.44] [−3.03,−0.84] −

NB

CD %Var − 42.92% 51.39% 1.70% 3.98%0.96× Mean 63.56 −11.41 −12.48 2.27 −

TD 99% CI [61.83, 65.29] [−13.14,−9.68] [−14.21,−10.75] [0.54, 4.00] −CS %Var − 34.38% 53.90% 3.84% 7.87%

0.92× Mean 62.50 −5.97 −7.47 −1.99 −TD 99% CI [61.08, 63.92] [−7.39,−4.55] [−8.90,−6.05] [−3.42,−0.57] −

SVM

CD %Var − 64.67% 33.48% 0.21% 1.64%0.98× Mean 58.44 −17.16 −12.35 0.98 −

TD 99% CI [57.08, 59.80] [−18.52,−15.80] [−13.71,−10.98] [−0.38, 2.34] −CS %Var − 61.51% 31.28% 3.99% 3.22%

0.97× Mean 60.98 −10.81 −7.71 −2.75 −TD 99% CI [59.75, 62.21] [−12.04,−9.58] [−8.94,−6.48] [−3.98,−1.52] −

Table 4.5: Factorial Design applied to Rocchio, KNN, Naïve Bayes and SVM for ACM-DL(CD×TD design: A=CD and B=TD.CS×TD design: A=CS and B=TD).

4.3.3.2 MEDLINE

In order to build the four document partitions of each factorial design for the MEDLINE

dataset, we follow the same strategy adopted for ACM-DL. First, we partition the docu-

ments regarding theCD andCS factors, settingδCD andδCS as the average CV measures

Page 82: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

46 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

(a) Class Distribution Variation (CD) (b) Class Similarity Variation (CS)

Figure 4.6: Determining the Lower and Upper Levels ofCD andCS—MEDLINE.

computed for each factor. These partitions and corresponding thresholds are shown in Fig-

ure4.6. As with theGeneral Literatureclass in ACM-DL and theCS-based partition, we

here choose to ignore theAidsclass (id0) to defineδCD, assigning its documents to theCD↑

partition. We then further subdivide each of these partitions according to theTD factor,

settingδTD as the average DSL of all documents in each partition, as depicted in Figure4.7.

Note that, according to Figure4.6, CCD↑ ∩ CCS↓ = ∅ and|CCD↓ ∩ CCS↑| = 1. Thus,

the argument for the unfeasibility of a three-factor experimental design applied to ACM-DL

also holds for MEDLINE. However, the figure also shows that 3 out of the 4 classes with

highCS also have highCD, and all classes with lowCS also have lowCD. Thus, once

again, there is a high correlation between both factors, motivating our approach to decouple

the three-factor design into two independent 2-factor analyses.

Since the MEDLINE dataset is quite large (over 800 thousand documents), we are able

to replicate each experiment by performing a 10-fold cross validation, as the test size is suffi-

ciently large to achieve stable results among the replications. The results achieved with both

factorial designs (CD×TD andCS×TD), considering each ADC algorithm, are summarized in

Table4.6, and will be analyzed in Section4.4.

4.3.3.3 AG-NEWS

Finally, the same overall procedure is also adopted to buildthe two22r experimental designs

for AG-NEWS. We partition the documents with respect to theCD andCS factors, using

δCD andδCS values equal to corresponding average CV values, as shown inFigure4.8. Once

again, we choose to disregard theTop Storiesclass (id10) from theδCD computation, as it

Page 83: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.3. EXPERIMENTAL DESIGN 47

(a) LowCD (b) HighCD

(c) LowCS (d) HighCS

Figure 4.7: Determining the Lower and Upper Levels ofTD—MEDLINE.

presents a much higher CV in comparison to the other 9 classes. We assign its documents to

theCD↑ partition. Regarding the computation ofδCS, Figure4.8-b) shows that, unlike in the

previous cases, not only one but two classes, namelyTop Stories(id 10) andItalia (id 9), have

much larger average CV values. Instead of disregarding bothmeasures, we adopt a different

strategy to deal with these large deviations, smoothing their impact on the final computation.

We first compute an average CV across classes 9 and 10. Let it beCV9,10. We then take as

δCS the overall average computed over the average CVs of all remaining classes (shown in

Figure4.8-b) andCV9,10.

Similarly to the other two datasets, Figure4.8 shows that the number of documents

in AG-NEWS is not enough to fill all partitions of the ideal23 experimental design as

|CCD↓ ∩ CCS↑| = 1. Thus, the lack of enough samples to fill theCD↓CS↑ partition prevents

us to perform a complete three-factor design. The figure alsoshows that 2 out of the 3

classes with highCS also have highCD, and 3 out of the 4 classes with lowCS also have

low CD, indicating, once again, that both factors are very correlated.

Page 84: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

48 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

ADCAnalisys of Variance (ANOVA)

Model: y = q0 + qAxA + qBxB + qABxAxB + ε

Effects: q0 qA qB qAB ε R2

Rocchio

CD %Var − 92.57% 5.77% 0.85% 0.80%0.99× Mean 79.97 −6.28 −1.57 0.60 −

TD 99% CI [79.68, 80.26] [−6.57,−5.99] [−1.86,−1.28] [0.31, 0.89] −CS %Var − 87.96% 11.15% 0.00% 0.89%

0.99× Mean 81.59 −4.41 −1.57 −0.01 −TD 99% CI [81.37, 81.81] [−4.63,−4.19] [−1.79,−1.35] [−0.23, 0.21] −

KNN

CD %Var − 72.24% 25.68% 0.24% 1.84%0.98× Mean 84.95 −3.48 −2.08 0.20 −

TD 99% CI [84.67, 85.22] [−3.76,−3.20] [−2.35,−1.80] [−0.08, 0.48] −CS %Var − 76.84% 20.17% 0.36% 2.63%

0.97× Mean 84.39 −3.86 −1.98 −0.27 −TD 99% CI [84.03, 84.74] [−4.21,−3.50] [−2.33,−1.62] [−0.62, 0.09] −

NB

CD %Var − 76.01% 22.69% 0.01% 1.28%0.99× Mean 86.49 −4.18 −2.28 0.06 −

TD 99% CI [86.22, 86.76] [−4.45,−3.91] [−2.55,−2.01] [−0.21, 0.33] −CS %Var − 62.93% 35.70% 0.17% 1.20%

0.99× Mean 87.67 −2.57 −1.94 −0.13 −TD 99% CI [87.49, 87.85] [−2.75,−2.39] [−2.11,−1.76] [−0.31, 0.04] −

SVM

CD %Var − 76.33% 22.45% 0.26% 0.96%0.99× Mean 86.19 −4.75 −2.58 −0.28 −

TD 99% CI [85.92, 86.45] [−5.02,−4.49] [−2.84,−2.31] [−0.54,−0.01] −CS %Var − 61.90% 35.14% 0.93% 2.03%

0.98× Mean 87.90 −2.68 −2.02 −0.33 −TD 99% CI [87.66, 88.14] [−2.92,−2.44] [−2.26,−1.78] [−0.57,−0.09] −

Table 4.6: Factorial Design applied to Rocchio, KNN, Naïve Bayes and SVM for MEDLINE(CD×TD design: A=CD and B=TD,CS×TD design: A=CS and B=TD).

(a) Class Distribution Variation (CD) (b) Class Similarity Variation (CS)

Figure 4.8: Determining the Lower and Upper Levels ofCD andCS—AG-NEWS.

We replicate each experiment by performing a 10-fold cross validation, since, similarly

to MEDLINE, AG-NEWS also has a very large dataset. Table4.7 summarizes the results,

which are discussed in the next section.

Page 85: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.4. DISCUSSION 49

(a) LowCD (b) HighCD

(c) LowCS (d) HighCS

Figure 4.9: Determining the Lower and Upper Levels ofTD—AG-NEWS.

4.4 Discussion

Having presented our methodology to analyze the impact of temporal effects on ADC al-

gorithms and illustrated its applicability to four algorithms and three reference datasets, we

now discuss our results, reported in Tables4.5-4.7. Recall that, when analyzing the results of

a specific experimental design, the impact of each factor on the response variable is captured

by the percentage of variation explained by it (“% Var” in theANOVA tables). However,

when comparing results across different designs, as we do here, it is important also to ana-

lyze the effects associated with each factorf , qf , and their relative impact on the grand mean

q0. Since the total variation of the responses (SST ) may vary across different designs, the

relative impact of eachqf on the grand meanq0 allows a more fair comparison of the impact

of each factor on the results across the designs. Ultimately, it represents the extent at which

classification effectiveness improves or degrades, depending on the sign ofqf , when factor

f is at its higher or lower level.

Page 86: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

50 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

ADCAnalisys of Variance (ANOVA)

Model: y = q0 + qAxA + qBxB + qABxAxB + ε

Effects: q0 qA qB qAB ε R2

Rocchio

CD %Var − 55.56% 44.09% 0.30% 0.06%0.99× Mean 77.12 −11.17 −9.95 0.82 −

TD 99% CI [76.94, 77.30] [−11.35,−10.99] [−10.13,−9.77] [0.64, 1.00] −CS %Var − 51.11% 46.69% 2.10% 0.09%

0.99× Mean 76.83 −10.52 −10.05 2.13 −TD 99% CI [76.61, 77.05] [−10.74,−10.30] [−10.28,−9.83] [1.91, 2.35] −

KNN

CD %Var − 80.18% 18.56% 1.21% 0.06%0.99× Mean 84.29 −11.24 −5.41 −1.38 −

TD 99% CI [84.14, 84.44] [−11.38,−11.09] [−5.55,−5.26] [−1.53,−1.23] −CS %Var − 80.29% 18.63% 1.01% 0.06%

0.99× Mean 85.57 −10.43 −5.02 −1.17 −TD 99% CI [85.43, 85.72] [−10.57,−10.28] [−5.17,−4.88] [−1.32,−1.03] −

NB

CD %Var − 68.16% 31.78% 0.003% 0.06%0.99× Mean 82.86 −10.12 −6.91 −0.07 −

TD 99% CI [82.71, 83.01] [−10.27,−9.97] [−7.06,−6.76] [−0.22, 0.08] −CS %Var − 81.05% 18.67% 0.19% 0.09%

0.99× Mean 84.61 −10.71 −5.14 −0.53 −TD 99% CI [84.42, 84.78] [−10.89,−10.53] [−5.32,−4.96] [−0.70,−0.34] −

SVM

CD %Var − 80.98% 17.76% 1.15% 0.11%0.99× Mean 85.91 −10.32 −4.83 −1.23 −

TD 99% CI [85.71, 86.10] [−10.51,−10.12] [−5.03,−4.64] [−1.42,−1.03] −CS %Var − 83.37% 14.54% 2.00% 0.09%

0.99× Mean 87.75 −9.63 −4.02 −1.49 −TD 99% CI [87.58, 87.90] [−9.78,−9.47] [−4.18,−3.86] [−1.65,−1.33] −

Table 4.7: Factorial Design applied to Rocchio, KNN, Naïve Bayes and SVM for AG-NEWS(CD×TD design: A=CD and B=TD,CS×TD design: A=CS and B=TD).

We start with two general observations. First, across all reference datasets and ADC

algorithms, our experimental designs are successful in isolating the parameters that are the

target of the study: the analyzed temporal effects explain the vast majority of the variations

observed in the results. Indeed, the percentages of variation left unexplained and thus cred-

ited to experimental errors (columnε) are under 8%, 3% and 1% for ACM-DL, MEDLINE

and AG-NEWS, respectively. The larger variations left unexplained for the ACM-DL dataset

is possibly due to the fact that this dataset is much smaller than the other two (small sample

sizes incurs in greater variability). However, as we can observe in Table4.5, the percent-

ages of variation credited to experimental errors are inferior to the percentages credited to

the temporal factors. Consistently, the coefficient of determinationR2 is above0.95 in most

cases.

Our second general observation is that the percentages of variation explained by the

secondary factors (columnqAB), i.e., the interactions betweenCD andTD for theCD×TD

design, and betweenCS andTD for theCS×TD design, are very small across all datasets and

algorithms, falling below4%, 1% and2.2% for the ACM-DL, MEDLINE and AG-NEWS

datasets. Indeed, the effect of this interaction is statistically insignificant, with99% confi-

dence, in many of these cases (see line “99% CI”). If significant, the effect associated with

Page 87: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.4. DISCUSSION 51

the interaction is often negative, implying that it contributes to a degradation in classifica-

tion effectiveness, although the magnitude of such degradation is very small (up to4.51%,

0.37% and1.70%, on average, with respect to the overall average performance reported in

q0, for ACM-DL, MEDLINE and AG-NEWS, respectively). In a few cases, the effect of

the interaction is positive, implying that it actually contributes to improve classification ef-

fectiveness. We conjecture that this is a side effect of the interactions betweenCD andCS

factors that are not captured by our pairwise experimental designs. In other words, the pos-

itive interaction is possibly due to a few classes havingCD↓ andCS↑ in all three datasets, as

argued in Sections4.3.3.1-4.3.3.3. Nevertheless, even when positive, the effect due to the

interaction of multiple factors is very small, with an impact on the grand mean by as much

as only4.17%, 0.75% and2.77%, on average, for ACM-DL, MEDLINE and AG-NEWS, re-

spectively. Thus, we argue that the primary factorsCD, CS andTD are the main sources of

impact on classification effectiveness across all analyzedscenarios, focusing our discussion

on them.

In the following, we analyze specific results for each reference dataset, discussing the

overall behavior observed across all ADC algorithms in Section 4.4.1. We then discuss the

results for each specific ADC algorithm, pointing out invariants across datasets and drawing

insights into the influence of the temporal effects on each algorithm in Section4.4.2. Finally,

in Section4.4.3, we summarize the main implications of our findings.

4.4.1 Impact of Temporal Effects on the Reference Datasets

We start by analyzing the relative impact of the temporal factors (qf ) on the average effec-

tivenessq0 of the ADC algorithms in each dataset, given by the ratioqfq0

. As Tables4.5-4.7

show, the effects associated with the temporal factors (i.e., columnsqA andqB) represent

an impact on the average effectiveness of the ADC algorithms(i.e., columnq0) that falls, on

average, between9.55% and29.36% of q0 in the ACM-DL dataset,1.92% and7.85% of q0 in

the MEDLINE dataset, and4.58% and14.48% of q0 in the AG-NEWS dataset. Thus, the im-

pact of the temporal effects on classification is much higherin the ACM-DL and AG-NEWS

datasets than in the MEDLINE dataset. Such observation is consistent with the character-

ization reported in Section4.2, which points out a more stable behavior of MEDLINE in

contrast to the more dynamic nature of ACM-DL and AG-NEWS. Itis also consistent with

the qualitative analysis reported in (Mourão et al., 2008), which showed that:(i) once a term

appears, it tends to remain more stable over time in MEDLINE than in the other two datasets,

thus implying a smaller impact ofTD on classification of the former, and(ii) the more con-

solidated knowledge area captured in the MEDLINE dataset justifies the smaller impact of

CD andCS on it.

Page 88: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

52 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

To further corroborate such findings, we performed a two-sided Mann-Whitney test

(Hollander and A., 1999) to compare the coefficient of variations (CV’s) of class sizes (i.e.,

regardingCD) computed for the three reference datasets.7 Recall that the CV values are

reported in Figures4.4a, 4.6aand4.8afor ACM-DL, MEDLINE and AG-NEWS, respec-

tively. With 99% confidence, we found that the CV’s of class sizes in the MEDLINE dataset

are indeed smaller than those computed for the ACM-DL and AG-NEWS datasets (p-values

of 0.001 and0.005, respectively). Comparing the CV values computed for ACM-DL and

AG-NEWS, we found that both samples are statistically indistinguishable (p-value of0.24).

Thus, we state thatCDMEDLINE < CDACM−DL ∼ CDAG−NEWS to refer to the relative

impact ofCD in each dataset.

The same test was performed to compare the CVs of pooled similarities (i.e., related

to CS), reported in Figures4.4b, 4.6band4.8b. Once again, we found that theCV val-

ues computed for the MEDLINE dataset are smaller than those obtained for ACM-DL and

AG-NEWS (p-values of0.005 and0.0006, respectively), whereas no statistical difference

was observed between the values computed for ACM-DL and AG-NEWS (p-value of0.15).

Thus, we state thatCSMEDLINE < CSACM−DL ∼ CSAG−NEWS.

To compare the impact ofTD on the three datasets, we show, in Figure4.10, the

empirical cumulative distribution of the observed document stability level (DSL) values, for

each level ofCD andCS and for each reference dataset. The curves for MEDLINE show a

clear bias towards higher DSL levels, thus indicating a smaller impact ofTD. The curves for

both ACM-DL and AG-NEWS exhibit a much stronger bias towardsless stable documents,

exposing the more dynamic nature of these datasets. We note that, for the MEDLINE dataset,

the bias towards more stable documents is stronger for theCD↑ and CS↑ levels. In other

words, the partitions with higher temporal variations inCD andCS tend to have more

stable documents, in comparison with the partitions with lower variations on the two effects.

This behavior is a peculiarity of the MEDLINE dataset, beingnot observed in neither ACM-

DL nor AG-NEWS datasets. Once again, we applied the Mann-Whitney test, finding that

the DSL values are indeed larger for the MEDLINE than for ACM-DL and AG-NEWS (p-

values smaller than10−5), and that DSL values are larger in ACM-DL than in AG-NEWS

(p-value< 10−5). Thus, we state thatTDMEDLINE < TDACM−DL < TDAG−NEWS.

Having compared the relative impact of each temporal aspectacross the three datasets,

we now analyze the relative impact of the three temporal effects for each dataset. For the

ACM-DL dataset, we can not distinguish, with 99% confidence,the relative impact ofCD

(orCS) from the relative impact ofTD on most ADC algorithms. One exception is observed

whenSVM is used, for which the impact ofTD is smaller than the impact of the other two

7We chose a nonparametric test since the CV values regarding the CD aspect are not normally distributed.

Page 89: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.4. DISCUSSION 53

effects. Indeed, the effect associated withTD is 38.95% (40.21%) smaller than the effect of

CD (CS). Another exception is with Naïve Bayes, on which the effectassociated withTD

is statistically different and somewhat larger (25.13%) than the effect ofCS. We reach this

finding by analyzing the99% confidence intervals for the effects of each factor (“99% CI"

lines in Table4.5), which show statistical ties forqTD andqCD (or qCS) for the application of

Rocchio and KNN classifiers as well as a statistical tie forqTD andqCD for the application of

Naïve Bayes. In contrast, for MEDLINE and AG-NEWS, the impact of TD is consistently

lower than the impact ofCD (or CS) on all four algorithms. For instance, in the case of

MEDLINE, the impact ofTD is four times smaller than the impact ofCD, and almost

two times smaller than the impact ofCD, considering the Rocchio classifier. In the case of

AG-NEWS there is also a pronounced skew regarding the impactof TD and the impact of

CD/CS, with cases where the impact ofTD is almost the double of the impact ofCD and

CS. Similar conclusions are reached by analyzing the percentages of variations explained by

each individual factor. These findings reveal the challenges imposed by the temporal effects

and developing strategies to handle them in ADC algorithms shows up to be a promising

research direction.

Note that, except for Naïve Bayes on the ACM-DL dataset,CD’s impact on classifi-

cation is higher thanTD’s impact if and only if theCS’s impact is also higher thanTD’s.

This should come as no surprise given the strong positive correlation between both factors, as

discussed in Section4.3.3. Temporal variations in class sizes directly impact temporal vari-

ations in class vocabularies, and ultimately in the similarities across classes. For instance, if

a class increases in size with time, the number of candidate terms in its vocabulary may also

increase, causing more variations in its similarities withthe remaining classes. Thus, tem-

poral variations in class distribution contribute to variations in class similarities, justifying a

high correlation between both factors.

4.4.2 Impact of Temporal Effects on the ADC Algorithms

We now turn our attention to the impact of the temporal effects on the ADC algorithms.

As we can observe from the three ANOVA tables, all three factors have negative effects

(columnsqA andqB) in all analyzed scenarios, implying that all explored ADC algorithms

are negatively impacted by the temporal effects in all threedatasets. In fact, relative to the

overall average performance (q0), the effect ofCD contributes to an average decrease in clas-

sification effectiveness by as much29.36% (for the SVM classifier). Similarly, higher levels

CS andTD contribute to a classification degradation of as much as17.73% and21.13%

(also for the SVM classifier), on average. Moreover, the degradation is more significant for

the reference datasets in which the impact of the temporal effects is stronger, i.e., ACM-DL

Page 90: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

54 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

(a) LowCD (b) HighCD

(c) LowCS (d) HighCS

Figure 4.10: Cumulative Distribution Function of DocumentStability Level Values.

and AG-NEWS, as expected. In the following, we discuss the impact on each specific al-

gorithm, focusing on the results for the ACM-DL dataset (Table 4.5), as it is the one most

influenced by all three temporal effects.

Starting with the Rocchio classifier, we observe that all thethree temporal effects

greatly impact classification effectiveness, with more than 40% of the observed variations

allocated to each of them in both experimental designs. Indeed, the factors contribute to a sig-

nificant degradation in the overall classification effectiveness, in each design. For instance,

in theCD×TD design, a higher level ofCD incurs in an average degradation of16.08% in the

average performance, whereas the degradation caused by a higher level ofTD is 18.56%, on

average. Similarly, in theCS×TD design, corresponding degradations due to higher levels of

CS andTD are11.89% and10.66%, on average. The reasons for such significant impact on

Rocchio’s performance are the following.CD andCS affect the coordinates of the centroids

learned by the Rocchio classifier: asMiao and Kamel(2011) pointed out, the centroid vector

does not take the distribution of class sizes into account, and thus may be affected by varia-

Page 91: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.4. DISCUSSION 55

tions in such distribution. Since the distribution of classsizes in the entire dataset may not

be the same as the corresponding distribution observed whenthe test document was created,

the classifier’s prediction may be error prone.TD also significantly affects this classifier,

since, when averaging the vectors of each class to compute the class centroids, it considers

all training points to determine the class of a test document. Thus, it may be affected by the

variations in the term-class relationships.

Similarly, both KNN and Naïve Bayes classifiers are also greatly impacted by the three

temporal effects, with more than 44% and 34%, respectively,of the total variations allocated

to each of the them. Indeed, both classifiers have a bias regarding the distribution of class

sizes. In the KNN classifier, larger classes tend to have moredocuments in theK-neighbor

set for each test document (Tan, 2005). The Naïve Bayes classifier, in turn, tends to privilege

larger classes due to the class prior probability expressedin Equation4.1: when the class

sizes are similar, this classifier uses a priori informationto break ties, being directly affected

by variations in the distribution of class sizes. Thus, similarly to Rocchio,CD affects both

classifier’s decision boundaries: since the distribution of class sizes considering the entire

dataset may not reflect the distribution when the test document was created, both classifiers

may make wrong predictions. In fact, the average degradations incurred by theCD effect

in KNN and Naïve Bayes are of19.72% and17.95% in the average response, respectively.

Moreover,CS affects the KNN classifier (with an average decrease of13.88% in the aver-

age performance) because it directly perturbs the K nearestneighbor set, that is, because of

differences in the pairwise class similarities, this set may be composed by classes that were

similar in different points in time. Naïve Bayes, in turn, isaffected byCS (with an aver-

age decrease of9.55% in the average performance) as this classifier considers a somewhat

local metric to assess the relationships between terms and classes, expressed by the term

conditional probabilityP (t|c). As discussed in Section4.3.2.3, estimatingP (t|c) ultimately

searches a subset of the class vocabularies and, when vocabularies change over time the de-

cision rule also changes. Finally,TD significantly impacts both classifiers because both of

them consider the terms present in all training points to determine the decision boundaries.

In KNN, the impact ofTD takes place when building theK-neighbor set, while in Naïve

Bayes such impact occurs when estimating the maximum likelihood estimates (MLE) from

the training set—more specifically, the term conditional probabilitiesP (t|c). Thus, both

classifiers are also sensitive to variations in the term-class relationships.

The reader should note the peculiar behavior of the Naïve Bayes classifier in the ACM-

DL dataset, regarding the impacts of theTD andCS effects. Unlike the other three classi-

fiers, the degradation caused by theTD effect is somewhat larger (25.13%) than the degrada-

tion incurred by theCS effect. This can be explained by the nature of this classifier. Recall

that the Rocchio, KNN and SVM are examples of discriminativeclassifiers, which directly

Page 92: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

56 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

modelP (c|d) by learning the class boundaries that minimize the error rate (or some corre-

lated measure), ultimatelydiscriminatingbetween classes (that is, they learn class borders).

On the other hand, as a generative classifier, Naïve Bayes learnsP (c|d) indirectly, by apply-

ing Bayes’ rule and estimating bothP (c) andP (d|c) (Rasmussen and Williams, 2006). In

fact, the class conditionalP (d|c) is influenced by variations in the term-class relationships

(hence, byTD andCS effects). However, as discussed in Section4.3.2.3, unlike theTD

effect, theCS effect is bounded by the variations in the relationships between the most in-

formative terms and the classes as time goes by. Recall that the vocabularyVc,p denotes the

set of terms that occurred in classc at the point in timep. LetVc be the set of all terms that

occurred in classc, irrespective to a point in time (that is, the class vocabulary). SinceVc,p is

smaller thanVc, it is expected that in the generative case the influence ofTD dominates the

influence ofCS. For the discriminative case, however, minimizing the error rate (or some

correlated measure) bounds, at a certain extent, the influence of the variations in the term-

class relationships to the most discriminative terms. Thus, the discriminative classifiers are

naturally less sensitive to theTD effect than to theCS effect.

Considering SVM, bothCD andCS explain, each, more than 60% of the variations in

classification effectiveness in both experimental designs. TD, in contrast, is response to at

most 33% of the explained variation. Thus,CS andCD do have a more significant impact

on SVM’s effectiveness thanTD. The reasons for this are the following. First, variations in

the distribution of class sizes lead to boundary hyperplaneskewness (seeSun et al., 2009),

potentially misleading the classification decisions when considering data distributed over

several points in time with changing distribution. Due to the KTT condition expressed by

Equation4.5, the increase of someαi at the positive side of the hyperplane forces an increase

of someαi at the negative side to satisfy that constraint and, due to possible imbalances in the

distribution of class sizes, either of them may receive a higher value, causing the hyperplane

to be skewed towards the smaller class. Thus, clearly,CD does have a strong impact on this

classifier, and so doesCS, given that the two effects are strongly correlated. In contrast,

SVM has a natural robustness to theTD aspect: only the support points are taken into

account, during the test phase, to determine the classes, thus ultimately limiting the impact

of TD during the classification process. As expressed by Equation4.4, only the points with

αi > 0 (the support points) affect the decision rule, being the only step in whichTD impacts

the classification process.

Turning our attention to the results for MEDLINE and AG-NEWS, reported in Tables

4.6 and 4.7, we find that, in both datasets, the impact ofCD andCS in all four ADC

algorithm are consistently higher than the impact ofTD. This should come as no surprise

since, as discussed in Section4.4.1, these datasets are more influenced by theCD/CS

effects than byTD.

Page 93: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.4. DISCUSSION 57

We summarize our findings on the impact of the temporal effects on the four ADC

algorithms in Table4.8, which shows a partial ordering of the algorithms with respect to

the average impact of each temporal effect, for each dataset. This ordering is determined

by taking the overall average performance of each algorithm(q0) as baseline, analyzing the

effect associated with each factorqf (along with its corresponding 99% confidence interval),

and its relative difference to the overall average. As we cansee, the Rocchio classifier is

the most affected by all three effects in both MEDLINE and AG-NEWS. SVM is also very

affected, particularly in ACM-DL and MEDLINE, with bothCD andCS impacting more

SVM than the other three algorithms in ACM-DL. The impact ofTD, on the other hand,

is approximately the same on all four algorithms in ACM-DL. These relationships reinforce

that, apart from being negatively impacted by all three temporal effects, the four explored

ADC algorithms exhibit distinct behavior when faced with datasets with specific temporal

dynamics, as revealed by the conducted factorial designs.

Temporal DatasetEffect ACM-DL MEDLINE AG-NEWSCD SVM > NB∼ KNN ∼ RO RO > SVM > NB > KNN RO∼ KNN > SVM ∼ NBCS SVM > KNN ∼ RO > NB RO > SVM∼ NB > KNN RO∼ KNN ∼ NB > SVMTD SVM ∼ KNN ∼ RO∼ NB SVM > RO∼ NB ∼ KNN RO > NB > KNN > SVM

Table 4.8: A Comparative Study on the Impact of the Temporal Effects on each ADCAlgorithm—Rocchio (RO), SVM, Naïve Bayes (NB) and KNN.

4.4.3 Implications

The analyses performed in the previous sections provide some general guidelines regarding

the definition of requirements for strategies to deal with temporal effects in ADC. First, these

strategies should consider stable terms, since they untangle some latent structural properties

of the classes. It may be tempting to consider just training documents created at (or nearby)

the creation time of the test document (window-based approach) to define class boundaries.

However, such strategy may not be a wise choice, since it can lead to data sparseness prob-

lems. Moreover, it may also discard valuable information regarding stable terms occurring in

training documents created at points in time other than the test document’s creation time—

which may reveal discriminative evidence about the classes’ structural properties. Even

when considering training documents created at different points in time with respect to the

test document, stable terms still can provide valuable information to the classifier. Such in-

formation, however, is neglected when adopting window-based strategies. This motivates

Page 94: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

58 CHAPTER 4. A QUANTITATIVE ANALYSIS OF TEMPORAL EFFECTS ONADC

the use of instance weighting strategies, specially when dealing with more stable datasets,

such as MEDLINE. However, in order for this strategy to be successful, the weighting func-

tion must capture the underlying process that guides the temporal evolution of the dataset.

Furthermore, not only the stability of the terms over time should be explored, but also the

variations in the distributions of class sizes and class similarities.

Similarly, the proposed methodology to evaluate the impactof the temporal effects on

classification effectiveness provides valuable insights to better understand the behavior not

only of the considered ADC algorithms when faced with these effects but also of strategies

aimed at overcoming them. Such aspect will be taken into account when analyzing the

behavior of the temporally-aware algorithms proposed in Chapter5. In fact, the analyses and

methodologies performed here allow us now to have a deeper understanding of the results

reported in Section5.4.

4.5 Chapter Summary

In this chapter, we proposed a methodology, based on a seriesof full factorial designs, to

evaluate the impact of temporal aspects on ADC algorithms when applied to several refer-

ence datasets. First, we extended the characterization performed byMourão et al.(2008),

providing evidence of the existence of three temporal aspects in three textual datasets,

namely ACM-DL, MEDLINE and AG-NEWS. Then, we instantiated the methodology to

quantify the impact of the temporal aspects on the classification effectiveness of four well-

known ADC algorithms, namely Rocchio, KNN, Naïve Bayes and SVM.

Our characterization results show that, contrary to the assumption of static data distri-

bution on which all four explored algorithms are based each reference dataset has a specific

temporal behavior, exhibiting changes in the underlying data distribution with time. Such

temporal variations potentially limit the classification performance. According to our results,

the ACM-DL and AG-NEWS datasets are much more dynamic than the MEDLINE dataset,

resulting in the four explored ADC algorithms being more impacted by the temporal aspects

in the first two datasets. In addition to such findings, our proposed methodology enabled

us to quantify the impact of each temporal aspect on the analyzed datasets and algorithms,

allowing us to answer the two following questions, posed in this chapter:

1. Which temporal effects are more representative in each dataset? In the ACM-DL

dataset, the impact of the observed temporal variations in the distribution of class

sizes and in the pairwise class similarities are statistically equivalent to the impact

of the observed variations in the term distribution on most classifiers (SVM being an

exception). MEDLINE and AG-NEWS, on the other hand, are clearly more impacted

Page 95: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

4.5. CHAPTER SUMMARY 59

by the first two temporal aspects. These findings reveal the challenges imposed by the

temporal effects and that developing strategies to handle them in ADC algorithms is a

promising research direction.

2. What is the behavior of each ADC algorithm when faced with different levels of each

temporal aspect?All four explored ADC algorithms suffer a negative impact ofthe

temporal aspects in terms of classification effectiveness,being the most significant

impacts observed when these algorithms are applied to the most dynamic datasets

(i.e., ACM-DL and AG-NEWS). The SVM classifier was shown to bemore robust to

the term distribution aspect, while still being impacted bythe other two aspects. The

other three algorithms, on the other hand, are very sensitive to all three aspects. Thus,

the temporal dimension turns out to be an important aspect that has to be considered

when learning accurate classification models.

Page 96: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 97: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Chapter 5

Temporally-Aware Algorithms for

Automatic Document Classification

In this chapter we are particularly concerned with how tominimizethe impact that temporal

effects may have on ADC algorithms, in the light of the lessons learned in Chapter4. As

previously discussed, the temporal dynamics of data, reflected by the quantified temporal

effects, may violate the common assumption of stationary data distributions, limiting the

performance of ADC algorithms. As we have shown, all the three explored textual datasets

present varying class distributions, along with varying term distributions and pairwise class

similarities, at differing extents. Moreover, the analyzed ADC algorithms had their effec-

tiveness hindered by these variations (again, at differentdegrees). To address this issue, we

propose strategies to devise temporally-aware ADC algorithms. Recall that the class distribu-

tion variation relates to the observed variations over timein the representativeness of classes,

whereas the term distribution variation and the class similarities variation effects relate to the

observed variations over time in the term-class relationships and to variations in the pairwise

class similarities, respectively. Similarly to the strategies adopted to isolate each factor in

the experimental designs described in Section4.3, here the first effect is addressed in a finer

grained basis, at document level, while the other two effects are handled at a collection level.

In order to incorporate temporal awareness to document classifiers, we introduce a

weighting function that we calltemporal weighting function—or simply TWF—aiming at

addressing the previously quantified temporal effects. Such weighting function is modeled

according to the observed evolution of the term-class relationships over time, captured by a

metric of dominance(see Section4.3.2). We start by determining thetemporal weighting

functionfor a collection according to its characteristics—insteadof considering simple ad-

hoc functions based on the document’s age, as done in previous work (seeKlinkenberg 2004;

Koychev 2000). Towards this end, we offer a modeling framework that enables us to conduct

61

Page 98: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

62 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

a series of statistical tests in order to derive a function that effectively models the underlying

process governing the dynamical nature of the datasets. Forexample, as we shall see, this

function follows a lognormal distribution for the ACM-DL and MEDLINE datasets.

As will be described in Section5.1, to derive the TWF function demands some statis-

tical procedures that may not be suitable for a practitionerto perform, due to the diversity

and sophistication of the tests that may be needed to determine its expression. As reported

in Section5.1, the widely used procedures for independence and normalitytests of random

variables failed when applied to the AG-NEWS dataset, sinceits TWF does not follow a

Gaussian process, even in the log-transformed space. In this case, some other (possibly

more complex) tests should be performed, and it may be prohibitively hard for a practi-

tioner (which typically aims to achieve high accuracy classification, without regarding the

propertiesof the function that underlies the data variation process—which is, by the way, re-

flected by its expression and parameters), hurting the practical applicability of the proposed

framework. Furthermore, automatic data-mining processesfocused on classification may

need some fully-automated ways to determine the TWF. As a matter of fact, for the sake of

temporally-aware ADC, one just needs to know the positive real valued weights associated

to each temporal distance and, in fact, while the proposed statistical framework is able to

uncover the properties of the function that underlies the data variations, it may not be appli-

cable for the two mentioned scenarios, and strategies to overcome this are desirable. To cover

these practical scenarios, we propose an automatic approach, which overcomes the needs to

perform the required statistical tests, where the ADC algorithms themselves are employed to

derive the TWF.

Finally, the final step is to incorporate the TWF into the ADC algorithms. We propose

three strategies for doing so, where the weights assigned toeach example depend on the

notion of a temporal distanceδ. The temporal distance is defined as the difference between

a point in timep, in which a training example was created, and a reference point in time

pr, in which the test example was created. Such weights reflect the observed variability in

the data distribution, as captured by the TWF. The first strategy, namedtemporal weighting

in documents, weights training instances according toδ, ultimately addressing the explored

term distribution variation (TD) effect—since the TWF considers the observed variations

over time in the term-class relationships. However, as we have shown, the class distribution

variation (CD) and the class similarity variation (CS) effects also play an important role

for some datasets, and we should minimize their impact. Towards this end, we propose a

second strategy, calledtemporal weighting in scores, which is based on a mapping in the

class domainc 7→ 〈c, p〉, in which the training documents’ classes are mapped to derived

classes〈c, p〉 wherec denotes the actual document’s class andp denotes its creation point in

time. In this case, the scores (for example, similarities, probabilities) learned by a traditional

Page 99: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

63

classifier applied to the modified training set are weighted according toδ = p− pr, wherep

is the point in time associated to the derived class〈c, p〉 andpr is the point in time in which

the test document was created—that is,scorec =∑

p scorec,p(c, p) TWFδ. The combined

weighted scores for each class are then used to take the final classification decision. This

strategy is motivated by the fact that, when considering each point in time in isolation we

ultimately isolate the temporal effects. But, since it is not always plausible to consider just

the documents created in the reference point in timepr (due to data scarcity), we aggregate

the obtained scores for each point in time, using the TWF to account for the variations in

the term-class relationships (the only potential effect reflected when aggregating the inter-

mediate scores). However, this strategy has a somewhat undesirable property related to the

mappingc 7→ 〈c, p〉. As the number of documents from the derived class〈c, p〉 is typically

much smaller than the number of documents belonging to classc, the class imbalance artifi-

cially increases. Since class imbalance may be harmful for document classifiers (Chen et al.,

2011), we propose a third strategy aimed at ameliorating this, namely, theextended tempo-

ral weighting in scores. In this strategy the training setD is partitioned in sub-groups of

documentsDp with the same creation point in timep. Then, a classification model is built

based on eachDp in isolation. The classes’ scores are then produced for eachDp and, as

before, they are aggregated using the TWF to weight them. By construction, the class im-

balance problem is bounded by the imbalance observed in the class distribution observed in

Dp, which is usually smaller.

The three proposed strategies were implemented in three ADCalgorithms (Rocchio,

K Nearest Neighbors (KNN), and Naïve Bayes) and were evaluated using the three digital

libraries described in Section4.1 (ACM-DL, MEDLINE and AG-NEWS). As we shall see,

we achieved significant improvements on classification effectiveness for all classifiers. For

instance, the temporal-aware version of Naïve Bayes outperformed by up to 10% the state-

of-the-art classifier (SVM).

This chapter is organized as follows: In Section5.1we introduce the temporal weight-

ing function and describe a methodology to determine its expression and its parameters. In

Section5.2 we propose a strategy to automatically determine the TWF, without the needs

for performing any kind of statistical testing. In Section5.3 we describe our extensions to

traditional ADC algorithms, in order to incorporate the TWFinto them. We report the per-

formed experimental evaluation to assess the benefits of considering the temporal dimension

in ADC algorithms in Section5.4. Finally, in Section5.5we summarize our findings.

Page 100: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

64 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

5.1 Temporal Weighting Function

The potential impact that certain temporal effects have on term-class relationships may have

a great influence on the results of the classification process, as characterized in Chapter4.

Thus, incorporating information about these changes into the ADC algorithms has the po-

tential to improve its effectiveness.

We address this issue through a temporal weighting function(TWF) that quantifies

the influence of a training document while classifying a testdocument, as a function of the

temporal distance between their creation times. We distinguish two major steps in defining

such function: its expression and its parameters. The expression is usually harder to deter-

mine, since it may express the generative process behind thefunction, that is, it may express

the fundamental properties of the data variation phenomena— which can be smooth (possi-

bly following some linear nature), abrupt (maybe with some exponential behavior) or even

periodic—while the parameters are usually obtained using approximation strategies.

Intuitively, given a test document to be classified, the TWF must set higher weights to

training documents that are more similar to that test document with relation to the strength

of term-class relationships. As described in Section4.3.2, one metric that expresses such

strength is thedominance, since the more exclusive a term is to a given predefined class, the

stronger this relationship. The simplest approach to modelthe function that governs such

variations would be to use a unit pulse function⊓(δ) at temporal distance0

⊓(δ) =

α if δ = 0,

0 if δ 6= 0,

with the pulse magnitudeα proportional to the observed term dominance associated with

the training documents created in the same point in time of the test document. However, as

argued in Section4.4, considering a larger time interval instead of a single point in time is

better, since it better handles smoothed data variations and does not discard potentially useful

information regarding stable terms. We then need to determine the time period that must be

considered when modeling the underlying data variations, and this can be accomplished by

the notion of stability period described in Section4.3.2.2(see Definition1).

Recall that the stability periodSt,pr of a termt, considering the reference point in time

pr, in which the test document was created, consists of the largest continuous period of time,

starting frompr and growing both to the past and the future, whereDominance(t, c) > α

(for some predefinedα and any classc). In the case of the explored datasets, we investigated

different values forα when computing stability periods and, as they lead to similar results,

we adoptedα = 50%, ensuring that the terms will have a high degree of exclusivity with

Page 101: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.1. TEMPORAL WEIGHTING FUNCTION 65

some class.

Notice that the stability period of a term depends on a reference point in time and

thus a term may present different stability periods, one foreach point in time in which it

occurred in the dataset. We first determine the stability period for each term and then com-

bine them, as follows. In order to handle such situation, we mapped all the time points

in a stability period to temporal distances, where the reference year is considered as dis-

tance0. For instance, a termt1 may have different stability periods when considering

the years 1989 or 2000 as a reference. More specifically, if the stability period oft1 is

{1999,2000,2001} regardingpr = 2000, and {1988,1989,1990} regardingpr = 1989, these

periods would be both mapped to {-1,0,1}. ConsideringS′t as the set of temporal distances

that occur on the stability periods of termt (considering all reference momentspr), then

S′t = {δ ← pn − pr|∀pr ∈ P andpn ∈ St,pr}. Making the stability periods easily compara-

ble is important because our real interest is to know what kind of distribution the temporal

distances follow with respect to different terms.

The next step is to determine the function expression and, towards this goal, we con-

sidered the stability period of each term as a random variable (RV), where the occurrence of

each possible temporal distance in its stability period is an event. More formally, as Table5.2

shows, we are interested in the frequencies of the temporal distancesδ1 to δn, for termst1to tk. An interesting property that we may test is whether these RV’s are independent. This

hypothesis can be corroborated by the Fisher’s Exact Test (Clarkson et al., 1993) to assess

the independence of eachRVi andRVj, ∀i 6= j, where, as mentioned, each RV represents

the occurrence of a temporal distanceδ for a termt.

We applied this test to the three reference datasets, ACM-DL, MEDLINE and AG-

NEWS. For the first two, we obtained a p-value of0.99 through a Monte Carlo simulation,

which allows us to state that their associated random variables are indeed independent. Thus,

the observed variability of occurrences ofδ for different terms is a result of independent ef-

fects (Limpert et al., 2001). For the AG-NEWS dataset, this independence does not hold,as

indicated by the low p-value obtained (10−4), and some other hypothesis should be tested.

This highlights the difficulty faced when defining the function that best models the varying

behavior of the data at hand, motivating the development of afully-automated strategy that

overcomes the need to explicitly determine the TWF expression. As a matter of fact, one can

afford to avoid the explicit determination of the TWF expression and parameters since, for

the sake of temporally-aware ADC, just the TWF image matters(that is, the weights associ-

ated with each temporal distance). Clearly, avoiding to determine the TWF expression and

parameters comes at the cost of missing the opportunity to discover the TWF’s properties—

which is revealed when determining its expression and parameters. With this trade-off in

mind, in Section5.2we describe a fully-automated strategy to determine the weights of each

Page 102: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

66 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

temporal distance.

Turning our attention to the ACM-DL and MEDLINE datasets (that passed in the inde-

pendence tests), it is still not clear whether the effects responsible for the observed variability

in the temporal distance distributionDδ can be additive (leading to a normal distribution) or

multiplicative (leading to a lognormal distribution). In Figure5.1we show theDδ distribu-

tion, scaled to the[0, 1] interval. We then apply a statistical normality test to boththe original

and log-transformed distribution. According to D’Agostino’s D-Statistic Test of Normality

(D’Agostino R.B., 1973), with 99% confidence, we found that the lognormal distribution

best fits both the ACM-DL and MEDLINE collections, as presented in Table5.1.

(a) ACM-DL Dataset (b) MEDLINE Dataset

Figure 5.1:Dδ Distribution (Scaled to[0, 1] Interval).

Data ACM-DL MEDLINEOriginal 4.497e−6 0.002762

Log-Transformed 0.2144 0.6802

Table 5.1: D’Agostino’s D-Statistic Test of Normality. Bold-face for Tests That we can notReject the Null Hypothesis of Normality.

Consider that the distributionDδ related to the occurrences of the temporal distances

δ in the stability periods, which represents the distribution of eachδi over all termst, is

lognormally distributed iflnDδ is normally distributed. More generally, since the tempo-

ral distancesδi are RV’s under the independence assumption with finite mean and variance,

then, by the Central Limit Theorem,lnDδ =∑n

i=1 lnδi will asymptotically approach a

normal distribution and, by definition,Dδ converges to a lognormal distribution (Crow EL,

Page 103: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.1. TEMPORAL WEIGHTING FUNCTION 67

t1 t2 . . . tk Dδ

δ1 f11 f12 . . . f1k∑k

i=1 f1iδ2 f21 f22 . . . f2k

∑ki=1 f2i

...δn fn1 fn2 . . . fnk

∑ki=1 fni

Table 5.2: Temporal DistancesversusTerms.

1988). For a lognormal distribution, the asymptotically most efficient method for estimat-

ing its associated parameters relies on a log-transformation (Limpert et al., 2001). Using

a Maximum Likelihood method, we estimated those parametersfor both collections, and

then back-transformed them, as shown in Table5.3. We considered a 3-parameter gaussian

function,

F = aie−

(x−bi)2

2c2i ,

where the parameterai is the height of the curve’s peak,bi is the position of the center of the

peak, andci controls the width of the curve. The last one, also called theshape parameter,

reflects the nature of the variations of term-class relationships over time. Since abrupt or

smooth variations lead to small or greater stability periods, respectively, the shape of the

distribution changes accordingly, being a matter of parameter estimation to capture such

distinct natures. We performed two curve fitting procedures, considering a single gaussian

F and a mixture of two gaussians, given byG = G1+G2, where eachGi denotes a gaussian

function. The last one was the model that best fittedDδ, and its parameters are presented

in Table5.3, along with the goodness of fitting measureR2. TheR2 measure denotes the

percentage of variance explained by the model and, for both collections, the obtained model

explains99% of such variance.

ParametersACM-DL MEDLINE

Value Confidence Interval Value Confidence Intervala1 0.325 (0.288, 0.362) 0.089 (0.066, 0.113)b1 -0.028 (-0.309, 0.253) -0.013 (-0.349, 0.324)c1 3.636 (3.117, 4.154) 1.635 (1.099, 2.17)a2 0.616 (0.589, 0.643) 0.901 (0.891, 0.911)b2 0.037 (-0.395, 0.470) 0.092 (-0.130, 0.314)c2 20.14 (20.93, 23.35) 24.51 (23.71, 25.3)

R2 0.990 0.992

Table 5.3: Estimated Parameters for Both Datasets, with 99%Confidence Intervals.

The greater the frequency ofδ on stability periods, the more suitable training docu-

Page 104: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

68 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

ments created inδ are to build an accurate classification model, making the modeling of the

Temporal Weighting Function as a lognormal distribution aneffective strategy.

Figure5.2 shows the distribution of temporal scores for each possibletemporal dis-

tance between the creation time of test documentd′ and the training documents for the

ACM-DL and the MEDLINE datasets.

(a) ACM-DL Dataset

(b) MEDLINE Dataset

Figure 5.2: Fitted Temporal Weighting Function with Log-Transformed Data.

5.2 Fully-Automated TWF Definition

Clearly, to determine the expression and parameters of a function that effectively models

the underlying data variations is an important task, since it reveals the properties of the data

variations and offer substantial knowledge that can be exploited towards the development of

accurate classification models. However, to define the TWF function demands some statisti-

Page 105: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.2. FULLY-AUTOMATED TWF DEFINITION 69

cal procedures that may not be suitable for a practitioner toperform, due to the diversity and

sophistication of the tests that may be needed to define its expression. As reported in Sec-

tion 5.1, the most straightforward procedures for independence testing of random variables

failed when applied to the AG-NEWS dataset. Thus, unlike theTWF’s associated to the

ACM-DL and MEDLINE datasets, AG-NEWS’ TWF does not follow a Gaussian process

and some other (possibly more complex) tests should be performed to assess its expression.

However, it may be prohibitively hard for a practitioner (which typically aims at achieving

high accuracy classification without regards to the properties of the function that underlies

the data variation process—which is, by the way, reflected byits expression and parame-

ters), hurting the practical applicability of the proposedframework to devise the TWF and,

consequently, the applicability of the temporally-aware classifiers described in Section5.3.

Furthermore, automatic data-mining processes focused on classification may need automatic

ways to determine the TWF. As a matter of fact, for the sake of temporally-aware ADC,

one just needs to know the positive real valued weights associated to each temporal distance.

While the proposed statistical framework is able to uncoverthe properties of the function

that underlies the data variations, it may not be applicablefor the two mentioned scenarios,

and strategies to overcome this issue are desirable.

To cover such practical scenarios, in this section we describe a technique to auto-

matically determine the TWF, without the needs to perform any statistical test. Hence, we

describe a straightforward and suitable way to devise the TWF by a practitioner or by some

other automated data-mining process. Our goal is thus to develop a procedure which, given

a set of already classified documents, outputs a functionTWFEST : δ 7→ [0, 1] that ultimately

models the underlying data variations. More specifically, the ADC algorithms themselves

are used to devise such mapping.

LetD be the training set composed by already classified documentsdi = (〈−→xi , pi〉, ci),

where−→xi is the vectorial (bag of words) representation ofdi, pi denotes its creation point in

time andci denotes its associated class. The first step of our procedureconsists of changing

the associated class of each document to its creation point in time, that is, we representdiasd′i = (−→xi , p). Then, the training set is randomly partitioned into two subsets,Dt andDv,

and a classification procedure is performed, usingDt as a training set andDv as a valida-

tion set. Our basic assumption is that, due to the temporal effects previously described, the

underlying data distribution may be different for each point in time pi, and so we expect

that the classifier may be able to learn some structural properties observed in each of them.

Clearly, the classifier will not achieve high accuracy, since, as shown in Chapter4, the vari-

ations observed due to the temporal effects are typically smooth and hence it is unlikely that

the observed changes produce enough variations to enable discriminating data between each

point in time. However, the classifier will potentially predict nearby points in time, since

Page 106: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

70 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

data from nearby points in time tend to have similar underlying distributions.

Formally, an ADC algorithm is used to learn the underlying relationships between the

data and their creation points in time, expressed by the a posteriori probability distribution

P (pi|di). Thus, if documents created at a point in timepi share some structural properties

with data from a reference point in timepr (namely, when documents fromDv were created),

thenpi will receive a higher score than some other uncorrelated point in timepj 6= pi. Since

we aim at associating a real valued weight to the temporal distancesδi = pi − pr, we adopt

the following rule to devise the TWF:

TWF (δi) =

∑N

j=1 I(pj − pr = δi)

N,

whereI(•) is an indicator function which returns1 if the predicate• is true and returns

0 otherwise,pr is the actual creation point in time of the classified documents from Dv

(that is, the reference point in time),pj is the predicted point in time (which received the

highest score by the classifier) andN = |Dv| denotes the number of documents classified.

Intuitively, temporal distances with higher weights contain the most useful documents to

build the classification model, since they provide data sampled from similar distributions to

the ones that govern the test data. On the other hand, temporal distances with smaller weights

tend to have more unstable data, which may induce the classifier to misleading predictions.

The described procedure to automatically determine the TWFis listed in Algorithm2.

Algorithm 2 Automatic TWF Determination1: function LEARNTWF(D)2: D′ ← {d′i|d

′i = (di.xi, di.pi) ∧ di ∈ D}

3: (Dt,Dv)←RANDOMSPLIT(D′)4: p[ ]←CLASSIFY(Dt ,Dv)

5: TWF (δi) =∑N

j=1 I(pj−d′j .pj=δi)

N, whered′j ∈ Dv

6: return TWF7: end function

We show in Figure5.3 the estimated TWF’s for each explored textual datasets, esti-

mated by each explored classifier. They were obtained duringthe 10-Fold Cross Validation

procedure to assess the effectiveness of the temporally-aware algorithms, as reported in Sec-

tion 5.4. The reader should notice the similarities among the TWF’s learned by each classi-

fier. In fact, it does not matter which classifier is employed in line 4 of Algorithm 2, as we

shall discuss in Section5.4.

Page 107: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.3. TEMPORALLY-AWARE ADC 71

(a) ACM-DL Dataset

(b) MEDLINE Dataset

(c) AG-NEWS Dataset

Figure 5.3: Estimated Temporal Weighting Function.

5.3 Temporally-aware ADC

This section shows how three well-known text classifiers, namely Rocchio, KNN and Naïve

Bayes (Manning et al., 2008), can be modified to take into account the temporal weighting

function defined in Sections5.1 and5.2. The three algorithms are modified following two

strategies: temporal weighting in documents and temporal weighting in scores, as detailed

Page 108: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

72 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

below.

5.3.1 Temporal Weighting in Documents

The temporal weighting in documents strategy weights each training document by the tem-

poral weighting function according to its temporal distance to the test documentd′, as repre-

sented in Figure5.4. In the following we detail this strategy.

Figure 5.4: Graphical Representation of TWF in Documents.

The strategy to incorporate the weight of each training document to a given classifier

depends inherently on the characteristics of the classification algorithm being modified. In

the case of distance-based classifiers, the temporal weighting function can be easily applied

when calculating the distance between the training and testdocuments, by weighting each

training document (vectorial representation) by its associated temporal weight. In the case

of the Naïve Bayes, the temporal function can be used to weight the impact of each training

example in both the a priori and conditional probabilities estimation (that is, weight its impact

on the counts), in order to generate a more accurate a posteriori probability.

Rocchio Recall from Section4.1.2that the Rocchio classifier uses the centroid of a

class to find boundaries between classes. As an eager classifier, Rocchio does not require

any information fromd′ to create a classification model. Hence, we adapt it to becomea

lazy classifier when using the temporal weighting function,since the weights depends on the

creation point in time of a test document. When classifying anew documentd′, Rocchio

associates it to the class represented by the centroid closest tod′. In order to make Rocchio

a lazy classifier, we explicitly change the separation boundaries of classes according to the

temporal weights produced by the TWF function.

Hence, it needs to calculate each Rocchio’s class centroid based on the creation point

in time pr of a test documentd′. Consider the setDc ⊆ D of training documents that

belong to the classc. This set can be partitioned into subgroupsDc,p ⊆ Dc of documents

Page 109: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.3. TEMPORALLY-AWARE ADC 73

created at the same point in timep ∈ P. The centroid−→µ c for classc is thus defined by

weighting the documents vector representations with the score produced by the temporal

functionTWF (δ), obtained using the temporal distanceδ between the creation point in time

of d ∈ Dc,p andd′, for all p ∈ P. Thus, a centroid−→µ c is given by:

−→µ c =1

‖Dc‖

d∈Dc

(

p∈P

−→d p · TWF (δ)

)

,

whereDc is the number of documents in classc, P is the set of points in time observed in the

training set,−→d p ∈ Dc denotes a training document created at the point in timep andδ is the

temporal distance between−→d p and the test documentd′.

This approach redefines the centroid’s coordinates in the vectorial space considering

document’s representativeness on classc w.r.t. the reference point in timepr. Both training

and classification procedures are presented in Algorithm3.

Algorithm 3 Rocchio-TWF-Doc: Rocchio with Temporal Weighting in Documents1: function TRAIN(C, D, d′, TWF )2: for each c ∈ C do

3:−→µ c ←

1‖Dc‖

d∈Dc

(

p∈P

−→d p · TWF (p− d′.p)

)

4: end for5: return {−→µ c : c ∈ C}6: end function7: function CLASSIFY(D, C, d′)8: {−→µ c : c ∈ C} ←TRAIN(D, C, d′)

9: return argmaxc cos(−→µ c,−→d′ )

10: end function

KNN As described in Section4.1, KNN is a lazy classifier that assigns to a test doc-

umentd′ the majority class among those of itsk nearest neighbor documents in the vector

space. Determining the test document’s class from thek nearest neighbors training docu-

ments may not be ideal in the presence of term-class relationships that vary considerably

over time. To deal with it, we apply the proposed temporal weighting function during the

computation of similarities amongd′ and the documents in the training set, aiming to select

the closest documents, in terms of both similarity and timeliness.

Let s be the cosine similarity between a training documentd andd′. If d is similar

to d′ but is temporally distant, then it is moved away fromd′, reducing the probability of

being among thek nearest documents ofd′. LetTWF (δ) be the temporal weight associated

with the temporal distance between the time of creation of documentsd andd′. Then, the

documents’ similarity is given by:

Page 110: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

74 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

sim(d, d′)← cos(d, d′) · TWF (δ).

Both training and classification procedures are presented in Algorithm4.

Algorithm 4 KNN-TWF-Doc: KNN with Temporal Weighting in Documents1: function KNEARESTNEIGHBORS(D, d′, k, TWF )2: for eachd ∈ D do3: δ ← d.p− d′.p4: sim(d, d′)← cos(d, d′) · TWF (δ)5: priorityQueue.insert(sim, d)6: end for7: return priorityQueue.first(k)8: end function9: function CLASSIFY(D, d′ , k)

10: knn← KNEARESTNEIGHBORS(D, d′, k)11: return {argmaxc

c

knn.nextDoc(c)}

12: end function

Naïve BayesSimilarly to the previously defined “temporal weighting in documents"

approaches, here we apply the temporal weighting function on the information used by the

learning method, namely the relative frequencies of documents and terms, as follows:

P (d′|c) = η ·

p(Ncp · TWF (δ))∑

p(Np · TWF (δ))·∏

t∈d′

p(ftcp · TWF (δ))∑

p

t′∈V

(ft′cp · TWF (δ)),

whereη denotes a normalizing factor,Ncp is the number of training documents ofD assigned

to classc and created at the point in timep,Np is the number of training documents created at

the point in timep, ftcp stands for the frequency of occurrence of termt in training documents

of classc that were created on point in timep and, finally,δ denotes the temporal distance

betweenp and the creation time ofd′ (a.k.a., the reference point in time).

The main goal of this strategy is to reduce the impact that temporally distant informa-

tion have when estimating a posteriori probabilities. Algorithm 5 presents this strategy.

5.3.2 Temporal Weighting in Scores

A more sophisticated approach to exploit the temporal weighting function considers the

“scores” produced by the traditional classifiers, as represented in Figure5.5. By score we

mean: (i) the smallest distance from the test documentd′ to a class centroid for Rocchio; (ii )

the smallest sum of the distances of the K-nearest neighborsto documentd′ assigned to class

c in the case of KNN; or (iii ) the probability to generated′ with the model associated to some

Page 111: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.3. TEMPORALLY-AWARE ADC 75

Algorithm 5 Naïve Bayes TWF-Doc: Naïve Bayes with Temporal Weighting inDocuments1: function CLASSIFY(D, d′, TWF )2: for each c ∈ C do3: aPriori[c]←

∑p(Ncp·TWF (δ))

∑p(Np·TWF (δ))

4: termCond[c]←∏

t∈d′

∑p(ftcp·TWF (δ))

∑p

∑t′∈V

(ft′cp·TWF (δ))

5: end for6: return {argmaxc η · aPriori[c] · termCond[c]}7: end function

classc for Naïve Bayes. From now on, we refer to this approach astemporal weighting in

scores.

Figure 5.5: Graphical Representation of TWF in Scores.

LetC andP be the set of classes and creation points in time of the training documents.

First, each training document classc ∈ C is associated with the corresponding creation point

in timep ∈ P, generating a new class defined as〈c, p〉 ⊆ C× P. Then, we use a traditional

classification algorithm to generate scores for each new class〈c, p〉. Thus, the first step of

this strategy consists of generating a new training setDc,p with the class domain transformed

from C to C × P. Then, the test documentd′ is classified by a traditional classifier applied

to this new training set, ultimately generating scores for each〈c, p〉. Note that this scenario

isolates term-class relationship variations, since it ties together the predictive relationships

of the patterns observed in each classc at the point in timep, in which the patterns were

observed. To decide to which classc the documentd′ should be assigned to, the learned

scores for each〈c, p〉 are summed up, for allp ∈ P, weighting them by theTWF (δ), where

δ = p− pr corresponds to the temporal distance betweenp and the time of creationpr of d′,

that is,

Page 112: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

76 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

scoresc,p ← TRADITIONAL CLASSIFIER(d′,Dc,p〉),

scorec ←∑

p∈P

scoresc,p(c, p) · TWF (δ),

whereDc,p is the new set of training documents generated by mapping each document’s class

c to the derived class〈c, p〉, according to its creation point in time. At the end of this process,

d′ will be assigned to the classc with highest score, as listed in Algorithm6.

Algorithm 6 TWF-Sc: Temporal Weighting in Scores1: function CLASSIFY(d′ , C, P, D, TWF )2: Dc,p ← {dc,p = (

−→d.x, 〈d.c, d.p〉)|d ∈ D}

3: scoresc,p ←TRADITIONAL CLASSIFIER(d′ ,Dc,p)4: for each c ∈ C do5: δ ← p− d′.p6: scorec ←

p∈P scoresc,p(c, p) · TWF (δ)7: end for8: return {argmaxc scorec}9: end function

5.3.3 Extended Temporal Weighting in Scores

When generating the scores for the derived class〈c, p〉 during the classification of a test

document, the number of positive training documents (that is, training documents belonging

to 〈c, p〉) is usually outnumbered by the number of negative training documents (that is,

training documents belonging to a derived class〈c, p〉′ 6= 〈c, p〉). This is known as the

class imbalance problem, and is an issue for classifiers withsome bias towards the majority

classes. Indeed, this is a problem inherent to the classification task and the majority of

automatic classifiers is affected by this issue. The “in scores” strategy becomes vulnerable

to the class imbalance problem since it artificially increases the imbalance when mapping the

classes to〈c, p〉. Several works have already proposed strategies to minimize such problem,

for example, strategies to under-sample the majority classes (Lin et al., 2009) or over-sample

the minority classes (Chen et al., 2011). We address this issue by modifying the “in scores”

strategy in order to minimize the class imbalance problem.

More formally, letDc denote the set of documents belonging to classc andDpc ⊆ Dc

denote the set of documents created at the point in timep that also belongs to the classc.

Clearly, |Dpc | ≤ |Dc|. Now, consider our previously proposed “in scores” strategy, in which

a classifier is used to learn the scores for〈c, p〉. The difference between|Dpc | (the number

Page 113: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.3. TEMPORALLY-AWARE ADC 77

of positive documents) and|Dc \ Dpc | (the number of negative documents) is expected to be

greater than the difference between|Dc| and|D\Dc|. In other words, the number of negative

documents observed in the transformed class domain (C × P) outnumber the number of

positive documents in a much more expressive way than when considering the original class

domain (C). Thus, the “in scores” strategy artificially increases theclass imbalance when

considering the derived classes〈c, p〉, and such imbalance is greater than the observed when

considering the original class distribution.

Figure 5.6: Graphical Representation of Extended TWF in Scores.

Based on this observation, the extended version of the “in scores” aims at mitigating

the class imbalance problem by considering each point in time in isolation, as represented

in Figure5.6, employing a series of classifiers to associate scores for the classes considering

only documents belonging to each point in time independently, but belonging to all classes.

The scores obtained by each classifier are then aggregated with the corresponding TWF

weight, according to the temporal distance between the point in time associated to each

classifier and the creation time of the test document. LetDp denote the set of documents

created at the point in timep. SinceDpc ⊆ Dp then|Dp

c | ≤ |Dp| and the majority class sizes

observed inDp is bounded by|Dp|. In the “in scores” strategy, the majority class size is

bounded by|Dc,p| = |D|. Consequently, the class imbalance observed in the first approach

is smaller than the class imbalance observed in the second one.

There are two rather subtle differences between this new strategy and the previous

in scores approach. First, as mentioned, by construction the class imbalance problem is

bounded by the class imbalance observed inDp, since it is the set considered when training

Page 114: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

78 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

the intermediate classifiers. Second, consider the traditional classification procedure per-

formed by both strategies. While in the “in scores” strategysuch classification is performed

considering the modified training setDc,p, in the extended version only documents belonging

to Dp are considered, and this implies that documents created at apoint in timep′ 6= p do

not influence in the learned scores. As an example, consider the classTop Stories(id 10)

of the AG-NEWS dataset. From Figure4.2, we can observe that, in the50th week, none

of the documents belong to such class. Now assume that the classification procedure of

both strategies will attempt to learn a score for this class when classifying a test document

belonging to the1st week. The scores learned by the classifier considering the “in scores”

strategy will consider all training documents, including those belonging to the50th week.

Thus, these negative documents ultimately influence the learned scores. On the other hand,

in the “extended in scores” strategy, the classification procedures applied to each point in

time will act in isolation: the classifiers assigned to the points in time within the50th week

will output scores equal to zero for this class and just the classifiers assigned to points in

time with documents belonging to this class will output non-zeroed scores. The extended in

scores procedure is listed in Algorithm7.

Algorithm 7 TWF-Sc-Ext: Extended Temporal Weighting in Scores1: function CLASSIFY(d′ , C, P, D, TWF )2: for eachp ∈ P do3: Dp ← {d ∈ D|d.p = p}4: δ ← p− d′.p5: scorec(c)+= TRADITIONAL CLASSIFIER(d′ ,Dp) · TWF (δ)6: end for7: return {argmaxc scorec}8: end function

5.4 Results

Having presented our strategies to determine the TWF and ourtemporally-aware classifiers,

we now report our experimental evaluation to assess the effectiveness of the temporally-

aware classifiers in minimizing the impact of the temporal effects observed in the three ex-

plored textual datasets. Recall from Section5.1, that we found that for the ACM-DL and

MEDLINE datasets the TWF follows a lognormal distribution,unlike the TWF associated to

the AG-NEWS dataset, whose expression is still unknown (meaning that a different, possibly

more complex, statistical test is required to assess its TWF). Hence, we start by evaluating

the temporal algorithms using the original TWF obtained in Section5.1applied to the ACM-

DL and MEDLINE datasets. Next, we evaluate our temporally-aware classifiers using the

Page 115: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.4. RESULTS 79

TWF estimated using a machine learning approach, as described in Section5.2. In this case,

since complex statistical tests are not necessary anymore,we are able to determine the TWF

for the three textual datasets, in a fully-automated way. Thus, we evaluate our temporally-

aware classifiers using the TWF applied to the three reference datasets in order to assess their

effectiveness.

In order to evaluate the impact that the proposed TWF has on the classification task,

we compare both the traditional and temporally-aware versions of Rocchio, KNN and Naïve

Bayes in the three adopted datasets (ACM-DL, MEDLINE and AG-NEWS). For compari-

son we use two standard information retrieval measures: micro averaged F1 (MicroF1) and

macro averaged F1 (MacroF1). As described in Section2.2, while the MicroF1 measures the

classification effectiveness over all decisions made by theclassifier, the MacroF1 measures

the classification effectiveness for each individual classand averages them. All experiments

were executed using a 10-fold cross-validation (Breiman and Spector, 1992) procedure con-

sidering training, validation and test sets. The parameters were set using the validation set,

and the effectiveness of the algorithms measured in the testpartition.

We start by reporting the parameter setup performed in orderto conduct our experi-

mental evaluation, in Section5.4.1. Then, in Section5.4.2we report and analyze the results

obtained when using the original definition of the TWF (described in Section5.1) for the

ACM-DL and MEDLINE datasets. Finally, in Section5.4.3we evaluate the use of the fully-

automated strategy to devise the TWF to feed our temporally-aware classifiers and discuss

some important aspects regarding its efficiency in terms of runtime. All the experiments

were ran using a Quad-Core AMD OpteronTM CPU with16GB of RAM.

5.4.1 Parameter Settings

An important aspect to be considered when dealing with the temporally-aware classifiers is

that the TWF scale must be compatible with the values weighted by the TWF. Clearly, this is

algorithm specific and should be properly set to ensure that the TWF effectively improves the

classifier’s decision rules without compromising them. To explicitly control the TWF scale

(without modifying its shape), we introduce a scaling factor β which should be properly

calibrated over the training set.

Hence, in order to run the experiments, two important parameters had to be set: the

value ofK for KNN and the scaling factorβ. We first performed some experiments with

KNN to define the value ofK. This parameter significantly impacts the quality of classifier,

and must be carefully chosen. The following values were tested, by means of a crossed

validation over the training set, for each version of the traditional and temporally-aware

algorithms:3, 10, 30, 50, 150, and200. For the traditional version of the algorithmk = 30

Page 116: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

80 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

achieved better results, while for both thein documentsand in scoresversions of KNN

the best value ofk was50. The intuition for the traditional KNN to perform better with

smaller values ofK is that, as the number of neighbors increases, the variationon term-

class relationships also increases, and the probability ofmisclassification increases. On the

other hand, when settingK < 30 the traditional version of KNN performed poorly due

to overfitting. When considering temporal information by means of the proposed temporal

weights, in contrast, more consistent information becomesavailable (due to a largerK),

allowing a more accurate model. Finally, theextended in scoresversion of KNN performed

better withK = 3 in the ACM-DL dataset andK = 10 in the MEDLINE and AG-NEWS

datasets (recall that in this strategy, the KNN’s classification model is built from reduced

training sets, composed by documents belonging to the same point in time, which justifies

this smaller value).

We empirically tested three values forβ: 1, 10, and100. The best value of each version

of each classifier was considered. For Rocchio and KNN, the best results were obtained with

β = 1. For Naïve Bayes, the best value wasβ = 10.

5.4.2 Experiments with the Statistically Defined TWF

In this section, we report our experiments to compare the traditional and the proposed

temporally-aware versions of Rocchio, KNN and Naïve Bayes,using the statistically defined

TWF, reported in Section5.1, for the ACM-DL and MEDLINE datasets. We defer the anal-

ysis regarding the AG-NEWS dataset to Section5.4.3, when we discuss the obtained results

using the estimated TWF (as described in Section5.2). The results obtained for the ACM-

DL and MEDLINE datasets are reported in Tables5.4and5.5, respectively. In both tables,

each line presents the results achieved by the versions of the classifiers identified in the first

row and column. The values obtained for MacroF1 (“macF1”) and MicroF1 (“micF1”) are

reported, as well as the percentage difference between values achieved by the temporally-

aware methods and the traditional version of the classifiers. This percentage difference is

followed by a symbol that indicates whether the variations are statistically significant ac-

cording to a 2-tailed paired t-test, given a 99% confidence level. N denotes a significant

positive variation,• a non significant variation andH a significant negative variation. This

notation will be adopted also in Section5.4.3.

As we can see in Tables5.4 and 5.5, all modified versions of Rocchio and KNN

achieved better results than the baseline in ACM-DL. In MEDLINE, the versions “in scores”

and “extended in scores” achieved statistically significant gains, while the versions “in doc-

uments” were statistically the tied with the baseline. In particular, Rocchio with TWF in

scores presents the most significant improvements in both datasets, with gains up to+17.86

Page 117: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.4. RESULTS 81

Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Baseline 57.39 68.24 58.48 71.84 57.27 73.24TWF 60.02 70.64 59.92 73.84 60.78 74.11

in documents (+4.58)N (+3.52)N (+2.46)N (+2.78)N (+6.13)N (+1.19)•TWF 59.85 72.47 62.02 74.45 44.85 63.93

in scores (+4.29)N (+6.20)N (+6.05)N (+3.63)N (-27.69)H (-14.56)HTWF 59.27 71.39 59.78 73.85 56.23 72.35

in scores ext. (+3.28)N (+4.62)N (+2.22)N (+2.80)N (-1.84)• (+1.23)•

Table 5.4: Results Obtained when Incorporating theStatistically DefinedTWF to Rocchio,KNN, and Naïve Bayes—ACM-DL.

Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Baseline 54.26 69.27 72.49 82.86 64.61 80.82TWF 54.08 69.48 74.10 83.36 66.75 82.87

in documents (-0.33)• (+0.30)• (+2.22)N (+0.60)• (+3.31)N (+2.54)NTWF 63.95 77.63 75.89 86.35 58.12 80.49

in scores (+17.86)N (+12.07)N (+4.69)N (+4.21)N (-10.04)H (-0.41)•TWF 63.63 77.28 74.45 84.96 63.41 81.06

in scores ext. (+17.27)N (+11.56)N (+2.70)N (+2.53)N (-1.89)• (+0.30)•

Table 5.5: Results Obtained when Incorporating theStatistically DefinedTWF to Rocchio,KNN, and Naïve Bayes—MEDLINE.

and+12.07 for MacroF1 and MicroF1, respectively. Similarly, KNN with TWF in scores

achieves the best results among all KNN variations, with gains of+6.05% and+4.21% for

MacroF1 and MicroF1, respectively. In the case of Rocchio, the improvements achieved us-

ing the TWF can be explained by the fact that, in the traditional version, the documents are

summarized in a unique representative vector (centroid), aggregating documents from dis-

tinct creation points in time, ultimately affecting the prediction ability of the classifier. In

the case of KNN, the definition of class boundaries is done considering each training docu-

ment independently. KNN assumes that documents of same class are located close by on the

vectorial space. By using the TWF, thek nearest documents are reorganized, and the most

temporally relevant documents are placed closer to the document being classified, according

to the temporal distance between them.

The Naïve Bayes with TWF in documents presents better results for MacroF1 on both

ACM-DL and MEDLINE, and better MicroF1 in the MEDLINE dataset. Note that the best

improvement was achieved in MacroF1, pointing out that this strategy effectively reduces the

Naïve Bayes bias towards the most frequent classes and, consequently, improves the effec-

Page 118: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

82 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

tiveness of such classifier when predicting documents from the smaller classes. However,

in contrast with Rocchio and KNN, the Naïve Bayes with TWF in scores performs poorly

in both datasets. A closer look to the “in scores” strategy reveals that if such strategy is

built upon a traditional classifier whose decision rule is strongly influenced by the negative

documents (as KNN and Naïve Bayes), its performance is deemed to be poor when applied

to datasets with skewed〈c, p〉 distributions. Although in KNN this problem can be ame-

liorated by a proper tuning of the parameterK, in Naïve Bayes this is not possible. Thus,

we attribute the poor performance of the Naïve Bayes with TWFin scores to two major

weaknesses of traditional Naïve Bayes version. First, whenfacing skewed data distributions,

the traditional version of Naïve Bayes unwittingly preferslarger classes over others, caus-

ing decision boundaries to be biased (in this case, the prediction of the smaller classes is

influenced by the negative documents belonging to the major classes). Second, when data is

scarce, there is not enough information to perform accurateestimates, leading to bad results.

The skewness of data distribution among classes〈c, p〉 can be quantified by the Coef-

ficient of VariationCV = σµ

of their sizes, whereσ andµ stand for the standard deviation

and mean. To explore the impact of data skewness on Naïve Bayes, we sampled MEDLINE,

creating two sub-collections composed by the least and mostfrequent classes〈c, p〉, mini-

mizing data skewness. While the entire collection presentsCV = 1.33, the sub-collections

with the least and most frequent classes presentCV equal to0.57 and0.43, respectively. As

we can observe in Tables5.5and5.6, the greater the CV, the worse are the results.

Naïve Bayes Least frequent classes〈c, p〉 Most frequent classes〈c, p〉Metric CV macF1(%) micF1(%) CV macF1(%) micF1(%)

Baseline0.57

74.72 80.420.43

88.70 87.65TWF 78.49 84.75 91.16 89.66

in scores (+5.04)N (+5.38)N (+2.77)N (+2.29)N

Table 5.6: Results Obtained for the Least and Most Frequent Classes〈c, p〉 Sampling forNaïve Bayes—MEDLINE.

Figure5.7shows the histogram with the percentages of classes〈c, p〉with up to a given

size (specified in the x-axis) for the ACM-DL and MEDLINE datasets. As we can observe

in Figure5.7a, the data scarcity is prominent in the ACM-DL dataset, contributing to the

poor performance of the Naïve Bayes with TWF in scores. Notice that70% of classes〈c, p〉

have less than100 documents, a number too low to guarantee accurate estimates. This is

also observed in the MEDLINE dataset (see Figure5.7b), but at a smaller extent—more

specifically,13% of the classes〈c, p〉 are composed by less than500 documents, whereas

35% are composed by2500 to 3000 documents. In addition, ACM-DL has an even more

Page 119: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.4. RESULTS 83

skewed data distribution over each time point (with aCV equal to1.69 regarding the〈c, p〉

sizes), preventing us to sample it in sub-collections with smaller CV, as performed with the

MEDLINE dataset.

(a) ACM-DL Dataset (b) MEDLINE Dataset

Figure 5.7: Relative〈c, p〉 Sizes.

Recall that the main motivation behind the “extended in scores” strategy is to ame-

liorate the class imbalance problem, that negatively impacts the “in scores” effectiveness.

In the “extended in scores” strategy, the influence of negative documents is bounded when

considering data from each point in time in isolation. More specifically, the class imbalance

is not given by the〈c, p〉 distribution as in the “in scores” strategy, but by the observed class

imbalancewithin each point in time. In fact, this class distribution is typically more evenly

distributed than the artificial〈c, p〉 distribution. As we can observe in the reported results,

the extended in scores version of Naïve Bayes performed better than its in scores version.

However, it still did not performed better than the baseline(with statistically equivalent re-

sults in all cases), due to the discussed data scarcity problem, which prevents this classifier

to learn accurate estimates about the class densities. Strategies to handle data scarcity (for

instance, by oversampling the training set) are one of the current research focuses and we

plan to further investigate this matter as future work.

Now, we analyze the obtained results in the light of the quantitative analysis reported

in Chapter4. As observed, using TWF in scores in most cases led to better results than those

applying TWF in documents. This is due to the fact that the “inscores” strategy addresses

simultaneously the three discussed temporal effects, namely, the class distribution variation

(CD), the pairwise class similarity variation (CS) and the term distribution variation (TD),

whereas the “in documents” strategy takes into account justthe TD effect, as discussed

next. Furthermore, as we can observe in Table5.5 regarding the MEDLINE dataset, the

Page 120: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

84 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

“in documents” strategy the results obtained were statistically equivalent to the baselines

in almost all cases, with the Naïve Bayes being an exception.As will be discussed in the

following, this is due to the MEDLINE characteristics with respect to the extent of theTD

effect.

Recall that the temporal weighting in documents strategy weights each training doc-

ument by the TWF according to its temporal distance to the test document. The TWF is

modeled according to the observed variations over time in the term-class relationships for

each dataset, ultimately addressing theTD aspect. Furthermore, recall that the both the tem-

poral weighting in scores and its extended version ties together the observed patters to both

class and temporal information. While the “in scores” transforms the class domain fromC

to C × P (generating a new training set), the “extended in scores” strategy groups training

documents into partitions composed by documents created atthe same point in time, per-

forming a traditional classification procedure over each partition. Both strategies assume

that the temporal effects may be safely neglected within a single point in time and thus the

classification models learned considering each point in time in isolation are not affected by

them. However, as previously stated, considering only the data related to a single point in

time may disregard valuable information to learn an accurate classification model. Thus, the

second step of these strategies consists of aggregating theinformation learned for each point

in time, weighting the obtained classification scores by theTWF. Aggregating the obtained

scores for each point in time is affected by theTD effect, since the scores reflect the re-

lationships between terms and classes. In order to overcomethe observed variations in the

term-class relationships across the different points in time, the TWF is used to weight them

according to the temporal distance between the points in time associated to each partition

and the creation time of the test document. Thus, while the first step addresses theCD and

CS effects, the second step addresses theTD aspect observed when aggregating the scores.

Analyzing the reported results, we can observe that, for ACM-DL, the three strategies

achieved significant gains. Considering the temporal weighting in documents approach, we

can justify its gains due to the high impact ofTD in that dataset. Moreover, since ACM-

DL is also subject to a high impact of bothCD andCS, both the temporal weighting in

scores and its extended version also performed well, since they address such effects, as

previously discussed. In MEDLINE, in contrast, since the impact ofTD is smaller than the

impact of the other two effects, we should expect less significant gains achieved by temporal

weighting in documents. Indeed, this was the observed behavior: such approach achieved

statistical ties compared to baselines in almost all cases.However, as bothCD andCS

are important factors in that dataset, we can observe statistically significant improvements

in classification effectiveness when the temporal weighting in scores and its extension are

applied. Furthermore, the largest improvements are achieved when the temporal weighting

Page 121: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.4. RESULTS 85

in scores is applied with the Rocchio classifier, which, as discussed in the previous section,

is the most affected by bothCD andCS in that dataset (see summary in Table4.8).

5.4.3 Experiments with the Estimated TWF

In this section, we report our experimental evaluation to assess the effectiveness of the pro-

posed temporally-aware classifiers using the TWF learned bythe fully-automated procedure

described in Section5.2. The goal here is to increase the applicability of the temporally-

aware classifiers. For example, even if uncertain about the expression (and parameters) of

the TWF associated to the AG-NEWS dataset, we can still determine the weights associ-

ated to each temporal distance using the procedure described in Algorithm 2, and use our

temporally-aware classifiers with the learned TWF. Thus, inthis section we examine the re-

sults obtained when applying the temporally-aware classifiers to the three reference datasets,

using the estimated TWF. Recall that, in this case, the TWF islearned from the training set

D. An interesting aspect to be analyzed refers to the amount ofdata required to accurately

estimate this function.Is the whole training set needed to learn the TWF ?In order to have

a first glance on this matter, we evaluate our strategies using the TWF learned from the en-

tire D and from a sample composed by10% of D, selected by a per point in time random

sampling (to guarantee that each point in time will have at least one document).

We start by comparing the results obtained using the estimated TWF and the results

obtained when using the statistically defined TWF, considering the ACM-DL and MED-

LINE datasets. We stress here that the results obtained estimating the TWF using the three

classifiers (line4 of Algorithm 2) were statistically equivalent, as could be expected by the

observed similarities in Figure5.3. Thus, we report just the results obtained by estimating

the TWF using the Rocchio classifier. As will shall see, the use of the estimated TWF led

to results statistically equivalent to the ones obtained when using the original definition of

the TWF. Then, we compare the effectiveness of the traditional and the temporally-aware

classifiers when applied to the AG-NEWS dataset (since the same conclusions drawn in the

previous section, regarding the ACM-DL and MEDLINE, also hold here).

An important aspect to be observed regarding the Tables5.7and5.8, is that using the

estimated TWF led to statistically equivalent results to the use of the statistically defined

TWF. This was assessed by a 2-tailed paired t-test, with99% confidence level. In fact, there

is an interesting similarity between the distribution of temporal distances used to determine

the TWF expression, illustrated in Figure5.1, and the estimated TWFs for both datasets, il-

lustrated in Figures5.3aand5.3b. This implies that we can adopt the automated procedure to

determine the TWF without affecting the effectiveness of the temporally-aware algorithms.

Furthermore, the same discussion presented in the previoussection, regarding the quantita-

Page 122: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

86 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Baseline 57.39 68.24 58.48 71.84 57.27 73.24

TWF (100% of D) 60.21 70.70 60.08 73.88 61.38 74.60in documents (+4.91)N (+3.60)N (+2.74)N (+2.84)N (+7.18)N (+1.86)•

TWF (10% of D) 60.52 70.88 61.02 74.27 61.44 74.24in documents (+5.45)N (+3.87)N (+4.84)N (+3.82)N (+7.28)N (+1.36)•

TWF (100% of D) 60.47 72.90 61.88 74.53 45.16 64.55in scores (+5.47)N (+6.83)N (+5.81)N (+3.74)N (-26.82)H (-13.46)H

TWF (10% of D) 59.68 72.40 61.37 73.77 44.47 64.58in scores (+3.99)N (+6.10)N (+4.94)N (+2.69)N (-28.78)H (-13.41)H

TWF (100% of D) 59.96 71.99 59.80 73.95 56.28 72.73in scores ext. (+4.48)N (+5.49)N (+2.26)N (+2.94)N (-1.76)• (-0.70)•

TWF (10% of D) 59.85 71.79 59.76 73.85 56.19 72.70in scores ext. (+4.29)N (+5.20)N (+2.19)N (+2.80)N (-1.89)• (-0.74)•

Table 5.7: Results Obtained when Incorporating theEstimatedTWF to Rocchio, KNN, andNaïve Bayes—ACM-DL.

Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Baseline 54.26 69.27 72.49 82.86 64.61 80.82

TWF (100% of D) 54.03 69.48 73.96 82.76 67.95 82.98in documents (-0.43)• (+0.30)• (+2.03)N (-0.12)• (+5.17)N (+2.67)N

TWF (10% of D) 55.01 70.35 73.63 82.87 67.84 82.89in documents (+1.38)• (+1.56)• (+1.57)• (+0.01)• (+5.00)N (+2.56)N

TWF (100% of D) 64.47 77.12 75.99 86.33 58.20 80.48in scores (+18.82)N (+11.33)N (+4.83)N (+4.19)N (-9.92)H (-0.42)•

TWF (10% of D) 64.25 77.03 75.88 86.36 58.23 80.51in scores (+18.41)N (+11.20)N (+4.68)N (+4.22)N (-9.87)H (-0.38)•

TWF (100% of D) 64.53 77.16 74.63 85.07 64.64 81.12in scores ext. (+18.93)N (+11.39)N (+2.95)N (+2.67)N (-0.05)• (+0.37)•

TWF (10% of D) 64.32 77.24 74.74 84.99 64.74 81.10in scores ext. (+18.54)N (+11.51)N (+3.10)N (+2.57)N (+0.20)• (+0.35)•

Table 5.8: Results Obtained when Incorporating theEstimatedTWF to Rocchio, KNN, andNaïve Bayes—MEDLINE.

tive analysis of the behavior of the temporally-aware classifiers w.r.t. each dataset also holds

here. Similarly, the poor performance of both the in scores versions of the Naïve Bayes clas-

sifier is also attributed to the class imbalance problem (forthe TWF in scores strategy) and

to the lack of training documents associated to each class (for the extended TWF in scores

Page 123: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.4. RESULTS 87

strategy), just as before.

Another important aspect to be observed in Tables5.7and5.8is that it does not matter

using either the entire training setD or just10% of it to learn the TWF. In fact, both cases led

to statistically equivalent results (assessed by a 2-tailed paired t-test with99% confidence),

in all cases. This is important property of Algorithm2 since the smaller is the training set

(that is, its input) the smaller the expected runtime to learn the TWF.

Algorithm Rocchio KNN Naïve BayesMetric macF1(%) micF1(%) macF1(%) micF1(%) macF1(%) micF1(%)

Baseline 54.89 58.16 60.05 68.49 60.92 67.83

TWF (100% of D) 58.34 62.34 58.96 67.45 62.24 68.46in documents (+6.29)N (+7.19)N (-1.85)• (-1.54)• (+2.17)N (+0.93)•

TWF (10% of D) 58.35 62.29 58.91 67.35 62.38 68.55in documents (+6.30)N (+5.68)N (-1.90)• (-1.66)• (+2.40)N (+1.06)•

TWF (100% of D) 57.82 66.26 58.36 64.94 51.65 61.91in scores (+5.34)N (+13.93)N (-2.90)H (-5.47)H (-15.22)H (-8.73)H

TWF (10% of D) 58.01 66.30 58.15 64.84 51.69 61.97in scores (+5.68)N (+14.00)N (-3.16)H (-5.33)H (-15.15)H (-8.64)H

TWF (100% of D) 57.72 66.12 59.12 68.93 56.43 65.21in scores ext. (+5.16)N (+13.69)N (-1.57)• (+0.64)• (-7.37)H (-3.86)H

TWF (10% of D) 57.69 65.99 59.08 68.77 56.47 65.22in scores ext. (+5.10)N (+13.46)N (-1.61)• (+0.41)• (-7.30)H (-3.85)H

Table 5.9: Results Obtained when Incorporating theEstimatedTWF to Rocchio, KNN, andNaïve Bayes—AG-NEWS.

We now turn our attention to the AG-NEWS dataset. Similarly to the ACM-DL and

MEDLINE datasets, the temporally-aware versions of the Rocchio classifier present the most

significant improvements over the baseline, with gains up to6.29% and14.00% for MacroF1and MicroF1, respectively. As in the MEDLINE dataset, both the “in scores” versions of the

Rocchio classifier performed better than its “in documents”version. This is due to the nature

of this dataset, evidenced by the quantitative analysis reported in Chapter4. In fact, this

dataset present more prominent variations in the class distribution (CD) and the class simi-

larities (CS) than in the term distribution (TD)—as reported in Section4.4.1, “the impact of

theTD effect is consistently lower than the impact ofCD (orCS) on all four algorithms” in

this dataset. However, unlike the temporally-aware versions of Rocchio, the “in documents”

versions of KNN and Naïve Bayes were statistically tied withtheir baselines (with the Naïve

Bayes with TWF in documents being an exception, with statistically significant gains in the

MacroF1 of up to 2.40%). This is justifiable by the smaller extent of theTD effect in the

AG-NEWS dataset. Accordingly, their ”in scores” versions should perform better and this

Page 124: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

88 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

was not the observed. Indeed, both the KNN and Naïve Bayes with TWF in scores led to

significant losses in both MacroF1 and MicroF1. Again, we attribute this to both the class

imbalance and the data scarcity problems. In order to provide evidence for this problem, in

Figure5.8we show the histogram of the〈c, p〉 sizes (as done with the other two datasets). In

fact, we can observe that72% of the〈c, p〉 sizes are smaller than200 (with 46% of the〈c, p〉

classes composed by at most100 documents), with CV equal to1.85—a much more skewed

and sparse distribution than of the other two datasets.

Figure 5.8: Relative〈c, p〉 Sizes for AG-NEWS Dataset.

Recall that the “extended in scores” strategy aims at minimizing the influence of the

class imbalance problem in the classification effectiveness. In fact, this strategy outper-

formed the “in scores” strategy in all cases. However, it wasnot able to outperform the

baselines. This is due to the previously discussed data scarcity problem, which is, by the

way, more pronounced in the AG-NEWS dataset than in the othertwo datasets. In contrast

to the improvements obtained by the “extended in scores” version of the KNN classifier in the

MEDLINE dataset, in the AG-NEWS it incurred in statistical ties. Furthermore, in contrast

to the statistical ties obtained by the extended in scores version of the Naïve Bayes classifier,

in the AG-NEWS dataset it produced statistically significant losses. We conjecture that the

adoption of strategies to overcome the data scarcity problem may improve the effectiveness

of such strategy. Again, we leave this matter for future work.

Finally, we also compared the best temporally-aware classifiers to the state of the art

Support Vector Machine(Joachims, 1999) classifier. We adopted an efficient SVM imple-

mentation, SVM_PerfJoachims(2006), which is based on the maximum-margin approach

and can be trained in linear time. We used an one-against-all(seeManning et al., 2008)

methodology to adapt binary SVM to multi-class classification, since, as presented in Sec-

tion 4.1, the explored datasets present more than two classes. Such comparison is presented

Page 125: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.4. RESULTS 89

in Table5.10. For ACM-DL dataset (Table5.10a), the significant gains are of3.29% and

2.45% in macroF1 (and statistical ties in microF1), for KNN with TWF in scores and Naïve

Bayes with TWF in documents, respectively. Furthermore, both the Rocchio with TWF in

scores and KNN with TWF in documents obtained results statistically equivalent to the SVM

results. In all these three cases, the temporally-aware classifiers fastest than the SVM by two

orders of magnitude. For the MEDLINE dataset (Table5.10b), the most significant gains are

of 2.03% and2.48% in macroF1 and microF1, respectively, obtained by the KNN with TWF

in scores. The extended in scores version of KNN achieved statistically tied results. As we

shall discuss in Section5.4.4, both classifiers are significantly faster than SVM. Considering

that SVM is a state of the art classifier, and that both datasets are imbalanced, our results evi-

dence the quality of the proposed solution. Considering theAG-NEWS dataset (Table5.10c),

the best performing temporally-aware classifier was unableto outperform the SVM due to

the limitations already discussed. However, it is worth to say that our temporally-aware

classifier was not drastically outperformed, and there exists room for improvements.

5.4.4 Runtime Analysis

Now we turn our attention to the efficiency of our proposed classifiers, in terms of execution

time. We start by considering the average time spent by the classifiers in each iteration of

the K-Fold cross validation, and comparing the temporally-aware algorithms with both their

traditional counterparts and the state of the artSupport Vector Machine(Joachims, 1999)

classifier (using the previously described SVM_Perf implementation). Next, we analyze the

additional cost associated with the automatic determination of the TWF.

We report in Table5.11the average execution time of each traditional classifier (rows

entitled “traditional”, including the measurement for the SVM classifier), their temporally-

aware versions (lines “In Documents”, “In Scores” and “Extended In Scores”), along with

the standard deviation over the mean value (reported after the± symbol). We consider the

execution time measured for the overall classification task, comprised by both the training

and test stages.1 The measurements regarding the ACM-DL dataset refers to theclassifica-

tion of 2490 documents using19918 training documents, while the measurements regard-

ing the MEDLINE dataset refers to the testing of86145 documents with a classification

model learned from689163 training documents. Finally, the measurements regarding the

AG-NEWS dataset refers to the classification of83580 documents based on a classification

model learned from668636 documents. Clearly, the columns of this table are not compara-

ble, and we consider the measurements for each dataset independently.

1Recall that in our experimental setup, using the 10-fold cross validation strategy, one fold is used as testset, another one is retained as the validation set and the remaining folds are used as training set.

Page 126: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

90 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

AlgorithmMetric

macF1(%) micF1.(%)

SVM 59.91 73.88

Rocchio with 60.47 72.90TWF in scores (+0.93) • (−1.34) •

KNN with 59.78 73.88TWF in documents (−0.22) • (+0.00) •

KNN with 61.88 74.53TWF in scores (+3.29) N (+0.88) •

Naïve Bayes with 61.38 74.60TWF in documents (+2.45) N (+0.97) •

(a) ACM-DL Dataset

AlgorithmMetric

macF1(%) micF1.(%)

SVM 74.48 84.24

KNN with 75.99 86.33TWF in scores (+2.03) N (+2.48) N

KNN with 74.63 85.07TWF in scores ext. (+0.20) • (+0.98) •

(b) MEDLINE Dataset

AlgorithmMetric

macF1(%) micF1.(%)SVM 64.94 72.59

Naïve Bayes with 62.38 68.55TWF in documents (−4.10) H (−5.56) H

(c) AG-NEWS Dataset

Table 5.10: Effectiveness Comparison: Best Performing Temporally-Aware Classifiersver-susSVM.

As one could expect, our temporally-aware classifiers are typically slower than their

traditional counterparts. This comes as no surprise, sincethere is the overhead of considering

and managing the temporal information. Furthermore, the temporally-aware classifiers are,

by nature, lazy classifiers, which comes at the cost of a higher runtime. Furthermore, our

temporally-aware versions incurred in a higher increase inexecution time in the AG-NEWS

dataset, due to the higher number of points in time of this dataset. However, in almost all

cases our lazy temporally-aware classifiers were more efficient, in terms of execution time,

than the SVM classifier.

We also compared the best version of the methods previously proposed, for example,

Page 127: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.4. RESULTS 91

AlgorithmRuntime (seconds) per Dataset

ACM-DL MEDLINE AG-NEWSSVM Traditional 144.10±5.30 26955.0±2356.0 28667.0±1151.0

Rocchio

Traditional 2.00±0.00 111.0±0.0 96.5±0.5In Documents 6.60±0.52 209.5±12.5 4615.5±89.5

In Scores 9.00±0.00 300.5±3.5 5287.5±9.5Extended In Scores 7.20±0.42 263.5±0.5 3807.0±29.0

KNN

Traditional 8.90±0.32 13442.5±79.5 8154.0±60.0In Documents 11.03±0.48 15949.0±51.0 10368.5±8.5

In Scores 10.10±0.31 12557.5±630.5 8630.5±349.5Extended In Scores 8.40±0.75 7753.5±78.5 4711.5±46.5

Traditional 5.00±0.00 213.0±7.0 186.5±0.5Naïve In Documents 9.10±0.32 293.0±2.0 3780.0±95.0Bayes In Scores 63.80±1.32 1311.0±1.0 43570.0±85.0

Extended In Scores 60.50±1.18 656.5±6.5 38966.5±108.5

Table 5.11: Execution Time (in seconds) of each Explored ADCAlgorithm.

the KNN with TWF in scores and Naïve Bayes with TWF in documents, to the SVM classi-

fier, in terms of efficiency (execution time). Such comparison is presented in Table5.12. For

ACM-DL dataset (Table5.12a), our best performing classifiers were up to13 times faster

than SVM, while, at least, matching the SVM effectiveness (as reported in Table5.10a). For

the MEDLINE dataset (Table5.12b), the KNN with TWF in scores was more than two times

faster than SVM, and, as previously reported, outperformedthis classifier in both MacroF1and MicroF1. The extended in scores version of KNN achieved was more thanthree times

faster than SVM. Considering the AG-NEWS dataset (Table5.12c), our best performing

temporally-aware classifier was almost eight times faster than the SVM (but unable to out-

perform the SVM classifier in terms of effectiveness).

Finally, we now consider the efficiency of the TWF determination algorithm. Recall

from Algorithm2 that it is necessary to perform a classification over the training set in order

to estimate the a posteriori probability distributionP (pi|di) and then determine the TWF.

There are two key aspects to be considered. First, since it does not matter which of the three

classifiers are used to learn the TWF (since they led to statistically significant results), it is

advisable to use the Rocchio classifier for doing so, since itis, by far, the most efficient one

(as can be observed in Table5.11, by comparing the traditional versions of each of them).

Second, it is clear that if the training set size increases, the cost involved at determining the

TWF also increases and can be potentially prohibitive. To better understand the dependency

between the execution time of the TWF determination and the training set size, we measured

Page 128: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

92 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

Algorithm Time (s)

SVM 144.10±5.30

Rocchio with9.00±0.00

TWF in scoresKNN with

11.03±0.48TWF in documents

KNN with10.10±0.31

TWF in scoresNaïve Bayes with

9.10±0.32TWF in documents

(a) ACM-DL Dataset

Algorithm Time (s)

SVM 26955.0±2356.0

KNN with12557.5±630.5

TWF in scoresKNN with

7753.5±78.5TWF in scores ext.

(b) MEDLINE Dataset

Algorithm Time (s)

SVM 28667.0±1151.0

Naïve Bayes with3780.0±95.0

TWF in documents

(c) AG-NEWS Dataset

Table 5.12: Execution Time Comparison: Best Performing Temporally-Aware ClassifiersversusSVM.

the execution time spent to determine the TWF using the entire training setD and a per point

in time sample ofD, by randomly selecting10% of the documents ofD. We then compared

such measurements with the time spent by the fastest temporally-aware classifiers and the

SVM classifier, for each explored dataset. This comparison is reported in Table5.13. As we

can observe, the time required to automatically learn the TWF is negligible when compared

to the time spent by the classification task. In addition to this efficiency aspect, the TWF

determination is inherently an offline procedure, guaranteeing its practical applicability.

Page 129: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

5.5. CHAPTER SUMMARY 93

DatasetRuntime

TWF Determination Fastest Temporally-Aware10% of D EntireD Classifiers

ACM-DL 0.77±0.02 4.49±0.04 6.60±0.52Rocchio

in documents

MEDLINE 31.00±3.00 180.00±28.00 209.50±12.50Rocchio

in documents

AG-NEWS 120.00±8.00 1560.00±25.00 3780.00±95.00Naïve Bayesin documents

Table 5.13: Execution Time of the TWF Estimation using the Rocchio Classifier.

5.5 Chapter Summary

In this chapter, we discussed the impact that temporal effects may have in ADC, and pro-

posed new strategies for instance weighting that leads to more accurate classification models.

We started by proposing a methodology to model a temporal weighting function (TWF) that

captures changes in term-class relationships for a given period of time. For the ACM-DL

and MEDLINE datasets, we showed that the TWF follows a lognormal distribution, whose

parameters may be easily determined using statistical methods (see Section5.1). For the

AG-NEWS dataset, on the other hand, we showed that the same hypothesis testing proce-

dures adopted for the ACM-DL and MEDLINE datasets failed, implying that its associated

TWF follows a distinct (yet unknown) distribution. Thus, toassess the TWF associated to the

AG-NEWS dataset requires some other (possibly more complex) statistical tests, motivating

the development of a strategy to determine the TWF that overcomes the needs for perform-

ing such tests. As a matter of fact, for the sake of temporally-aware ADC, one just needs

to know the positive real valued weights associated to each temporal distanceδ. In Sec-

tion 5.2 we described such strategy, that uses the ADC algorithms themselves to determine

the mappingδ 7→ R+. This is accomplished by estimating the a posteriori probability dis-

tributionP (pi|di) and gathering the relative frequencies of the temporal distancesδ between

the predicted point in timepi and the actual point in timedi.p in whichdi was created.

We then presented our three strategies to incorporate the TWF to classifiers: TWF

in documents, TWF in scoresand an extended version of the TWFin scoresstrategy. The

TWF in documentsstrategy weights each training document by the TWF according to its

temporal distance to the test document. The TWFin scoresstrategy, in contrast, learns

the scores for each classc by using a traditional ADC algorithm to first learn scores for

the derived classes〈c, p〉, and then aggregating these scores using the TWF to weight them

(that is,scorec =∑

p scorec,p(c, p) TWFδ). Finally, theextendedTWF in scorespartitions

Page 130: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

94 CHAPTER 5. TEMPORALLY-AWARE ALGORITHMS FORADC

the training documents in sub-groups of documents with the same creation point in time

(and thus without temporal variability on the term-class relationships), learns a series of

classification models for each partition and aggregates thegenerated scores using the TWF

to weight them.

The three strategies were implemented considering three traditional classifiers, namely

Rocchio, KNN, and Naïve Bayes. Results with the traditionalversions of these classifiers

and the temporally-aware ones showed that considering temporal information significantly

improves the results of the traditional classifiers. We alsoshowed that even if using10% of

the training set to automatically determine the TWF we can accurately estimate it and achieve

comparable results to the ones obtained using all the training set for doing so. This highlights

that, in addition to the fact that this strategy overcomes the needs to perform potentially

complex hypothesis testing to determine the TWF, it demandsa quite small additional cost

for doing so, being performed usually in an offline manner. Also, both temporally-aware

KNN and Naïve Bayes achieved better results than SVM in the ACM-DL and MEDLINE

datasets, with better performance. Considering that SVM isa state of the art classifier, and

that the explored datasets are imbalanced, our results evidence the quality of our solution,

coupled with an efficient implementation.

Page 131: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Chapter 6

Conclusions and Future Work

In this chapter we summarize the research contributions of this dissertation and point out

some directions for further investigation.

6.1 A Quantitative Analysis of Temporal Effects on

ADC

In this work, we proposed a methodology, based on a series of full factorial designs, to

evaluate the impact of temporal effects on ADC algorithms when applied to distinct textual

datasets. First, we extended the characterization performed byMourão et al.(2008), pro-

viding evidence of the existence of three temporal effects in three textual datasets, namely

ACM-DL, MEDLINE and AG-NEWS. Then, we instantiated the methodology to quantify

the impact of the temporal aspects on the classification effectiveness of four well-known

ADC algorithms, namely Rocchio, KNN, Naïve Bayes and SVM.

Our characterization results show that, contrary to the assumption of static data dis-

tribution on which most of the ADC algorithms rely, each reference dataset has a specific

temporal behavior, exhibiting changes in the underlying data distribution with time. Such

temporal variations potentially limit the classification performance. According to our re-

sults, the ACM-DL and AG-NEWS datasets are much more dynamicthan the MEDLINE

dataset, implying that in the four explored ADC algorithms would be more impacted by the

temporal aspects in the first two datasets. In addition to such findings, our proposed method-

ology enabled us to quantify the impact of each temporal aspect on the analyzed datasets and

algorithms, allowing us to answer the two following questions, posed in Chapter4:

1. Which temporal effects influence more in each dataset?In the ACM-DL dataset, the

impact of the observed temporal variations in the distribution of class sizes and in

95

Page 132: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

96 CHAPTER 6. CONCLUSIONS AND FUTURE WORK

the pairwise class similarities are statistically equivalent to the impact of the observed

variations in the term distribution on most classifiers (SVMbeing an exception). MED-

LINE and AG-NEWS, on the other hand, are clearly more impacted by the first two

temporal aspects. These findings reveal the challenges imposed by the temporal ef-

fects and that developing strategies to handle them in ADC algorithms is a promising

research direction.

2. What is the behavior of each ADC algorithm when faced with different levels of each

temporal aspect?All four explored ADC algorithms suffer a negative impact ofthe

temporal aspects in terms of classification effectiveness,being the most significant

impacts observed when these algorithms are applied to the most dynamic datasets

(i.e., ACM-DL and AG-NEWS). The SVM classifier was shown to bemore robust to

the term distribution aspect, while still being impacted bythe other two aspects. The

other three algorithms, on the other hand, are very sensitive to all three aspects. Thus,

the temporal dimension turns out to be an important aspect that has to be considered

when learning accurate classification models.

6.2 Temporally-Aware Algorithms for ADC

Beyond quantifying the impact of the temporal effects in ADCalgorithms, we proposed

strategies tominimizetheir impact in three well known ADC algorithms, based on an in-

stance weighting paradigm to devise more accurate classification models. We started by

proposing a methodology to model a Temporal Weighting Function (TWF) that captures

changes in term-class relationships for a given period of time. For two of the three real

datasets explored, namely ACM-DL and MEDLINE, we showed that their TWF’s follow

a lognormal distribution, whose parameters may easily be tuned using statistical methods.

On the other hand, the TWF associated to the AG-NEWS dataset does not follow a normal

distribution (even in the log-transformed space). Indeed,the straightforward tests for inde-

pendence and normality of random variables failed, with99% confidence, and some other

(possibly more complex) tests should be performed. To guarantee the practical employment

of the temporally-aware classifiers, automated ways to determine the TWF are desirable. As

a matter of fact, for the sake of temporally-aware ADC, one just needs to know the pos-

itive real valued weights associated to each temporal distance. Thus, we also proposed a

fully-automated strategy to devise the TWF.

In order to incorporate the TWF to classifiers, we proposed three approaches: TWF

in documents, TWF in scores and the extended TWF in scores. TWF in documents weights

each training document by the TWF according to its temporal distance to the test document.

Page 133: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

6.2. TEMPORALLY-AWARE ALGORITHMS FORADC 97

TWF in scores, in contrast, takes into account scores produced by a traditional classifier

applied to a modified training set where the class of each training documentc is mapped to

a derived classc 7→ 〈c, p〉, with p denoting the training document’s creation point in time,

ultimately tying together the observed patters and both theclass and temporal information.

A weighted sum of the learned scores is then performed, according to the TWF. Finally,

the extended TWF in scores partitions the training documents in sub-groups of documents

with the same creation point in time (and thus without temporal variability on the term-class

relationships) including documents of all classes, learnsa series of classification models

for each partition and aggregates the generated scores using the TWF to weight them. These

strategies were incorporated to three traditional classifiers, namely Rocchio, KNN, and Naïve

Bayes.

Results with the traditional versions of these classifiers and the temporally-aware ones

showed that considering temporal information significantly improves the results of the tradi-

tional classifiers. We also studied the impact of estimatingthe TWF and incorporating it to

the classifiers, both in terms of effectiveness and efficiency. Two important aspects were dis-

cussed: First, all the three explored ADC algorithms provided an accurate TWF estimation.

Due to its efficiency and similar results obtained when compared to the other classifiers, we

chose Rocchio to estimate the TWF. Second, sampling10% of the training documents (in a

per point in time basis) to learn the TWF provided the same gains in the temporally-aware

classifiers as when using all the training set. This reduces even more the additional cost in

the runtime of the classification task. Also, both temporally-aware KNN and Naïve Bayes

achieved more effective results than SVM, also with better overall performance (i.e., consid-

ering the ACM-DL dataset, our best performing classifiers were up to 13 times faster than

SVM). Considering that SVM is a state of the art classifier, and that both collections are

very unbalanced, our results evidence the quality of our solution, coupled with an efficient

implementation.

6.2.1 Limitations

The proposed temporally-aware algorithms have some limitations and, consequently, there

are room for further improvements. These include:

Data Imbalance: As we discussed, the “in scores” version of our classifiers are sensitive

to the data imbalance observed when considering each derived class〈c, p〉. Ac-

tually, class imbalance is considered as a challenge by the Data Mining commu-

nity (Yang and Wu, 2006). This is a rather common scenario that raises due to several

factors as incomplete sampling of labeled data due to crawling problems, ephemeral

Page 134: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

98 CHAPTER 6. CONCLUSIONS AND FUTURE WORK

events, the high costs involved in labeling data, and so on. Strategies to handle these

cases are promising towards improving the effectiveness ofthe “in scores” strategy.

Data Scarcity: Another major technical challenge faced by the Data Mining community

relates to the scarcity of training data. In fact, both the “in scores” and the “extended

in scores” versions of our temporally-aware classifiers have their performance limited

by this problem, since the number of documents assigned for some classc and created

at the point in timepmay not be sufficient to learn accurate estimates. Again, strategies

to tackle this problem are good candidates to improve the effectiveness of both versions

of the temporally-aware classifiers.

6.3 Future Work

As a future work, we intend to incorporate temporal information to the SVM classifier, by

defining kernel functions that use the proposed TWF. We also plan to refine the TWF, which

can be further improved in, at least, two ways. First, it can be defined in a finer grained basis,

in order to account for the potentially distinct evolutive behavior of terms (that is, the TWF

may be further refined to account for not only the temporal distances between documents but

also according to each term in isolation). Second, as discussed in Section2.3, the temporal

unit used to determine the documents’ timeliness is defined according to the domain in which

the temporally-aware classifiers are applied to. This is done in a purelyqualitativefashion. A

well stablished way to define the temporal unit is thus highlydesirable, and a promising strat-

egy for doing so is the Formal Concept Analysis (Ganter and Wille, 1999) (FCA). FCA is a

well studied mathematical framework that is able to uncoverimplicit relationships between

objects and their attributes, ultimately finding ontologies (Ganter et al., 2005; Wille, 2005).

Such framework is widely used in concept classification and knowledge management. Con-

sidering our temporally-aware strategies, one can use the FCA to automatically determine

temporal periods which can be used as a kind of temporal units, instead of its purely quali-

tative determination. With such strategy, one can determine semantically meaningful groups

of documents that share some underlying data distribution,which is invariant over time, and

thus can be exploited to infer a proper temporal granularityin a fully-automated manner.

Another aspect that can be further improved relates to the memory and time efficiency.

Nowadays, very large databases are becoming even more common. Several organizations

have to maintain databases that grow without a limit, with a surprisingly fast rate. Clearly,

the classification of such data streams brings some challenging problems to be handled, as

hard memory/time constraints. In fact, mining high-speed drifting data streams is a topic that

is continuously receiving attention from the Data Mining community. While our classifiers

Page 135: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

6.3. FUTURE WORK 99

are still able to provide high quality classification with execution times much smaller than

the state of the art SVM classifier, the assessment of the testcreation point in time before

learning the classification model (that is, the lazy nature of our classifiers) may prevent the

applicability of the temporally-aware classifiers in such high speed streaming scenarios. The

definition of non-lazy strategies for ADC that can take advantage of temporal information

in a memory/time efficient way (e.g., by incrementally adjusting the classification model ac-

cording to the observed variations in the underlying data distribution) is a promising research

direction.

Factors other then the documents’ timeliness may also be exploited towards the con-

struction of more effective classification models. Indeed,we have already achieved some in-

teresting results when exploiting the underlying citationand authorship networks extracted

from the ACM-DL dataset (de M. Palotti et al., 2010), and a further investigation on this

matter may be valuable. For example, tying together the information gathered from these

networks with the documents’ timeliness may be an interesting research direction.

Finally, in a classification framework, not only the learning step may be affected by

the temporal dynamics of data, but also some of the data pre-processing steps, such as fea-

ture selection and data sampling. For example, since several ADC algorithms are affected

by the class imbalance problem, where some classes are more representative than others,

it is a common strategy to pre-process the data in order to provide more balanced training

sets. The usual way to balance the class distribution on training data consists of oversam-

pling the smaller classes or undersampling the larger ones.However, to the best of our

knowledge, none of the already proposed strategies for databalancing handle the temporal

dimension. Thus, we plan to further study the impact of the temporal dynamics on class

balancing strategies. Furthermore, we consider that strategies for feature selection may be

improved if considering the evolutive behavior of terms (for example, considering not only

the predictive power of terms, but also their temporal stability). This may reveal effective

approaches to further improve such data processing strategies and, ultimately, lead to more

accurate classification models.

Page 136: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …
Page 137: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

Bibliography

Alonso, O., Gertz, M., and Baeza-Yates, R. (2007). On the value of temporal information in

information retrieval.SIGIR Forum, 41(2):35–41.

Baeza-Yates, R. and Ribeiro-Neto, B. (2011).Modern Information Retrieval: The Concepts

and Technology Behind Search. Addison-Wesley, Boston, MA.

Bifet, A. and Gavaldà, R. (2006). Kalman filters and adaptivewindows for learning in data

streams. InDiscovery Science, pages 29–40, Barcelona, Spain.

Bifet, A. and Gavaldà, R. (2007). Learning from time-changing data with adaptive win-

dowing. In Proceedings of the SIAM International Conference on Data Mining, pages

443–448, Minneapolis, USA.

Breiman, L. and Spector, P. (1992). Submodel Selection and Evaluation in Regression - the

X-Random Case.International Statistical Review, 60(3):291–319.

Caldwell, N. H. M., Clarkson, P. J., Rodgers, P. A., and Huxor, A. P. (2000). Web-based

knowledge management for distributed design.IEEE Intelligent Systems, 15(3):40–47.

Chang, C.-C. and Lin, C.-J. (2001).LIBSVM: A Library for Support Vector Machines. Soft-

ware available athttp://www.csie.ntu.edu.tw/~cjlin/libsvm.

Chen, E., Lin, Y., Xiong, H., Luo, Q., and Ma, H. (2011). Exploiting probabilistic topic

models to improve text categorization under class imbalance. Information Processing &

Management, 47(2):202–214.

Clarkson, D. B., Fan, Y.-a., and Joe, H. (1993). A remark on algorithm 643: FEXACT: an al-

gorithm for performing fisher’s exact test in r x c % contingency tables.ACM Transactions

on Mathematical Software, 19(4):484–488.

Cohen, W. W. and Singer, Y. (1999). Context-sensitive learning methods for text categoriza-

tion. ACM Transactions on Information Systems, 17(2):141–173.

101

Page 138: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

102 BIBLIOGRAPHY

Crow EL, S. K. (1988). Log-normal distributions: Theory and application. New York:

Dekker.

D’Agostino R.B., P. E. (1973). Tests for departure from normality. Biometrika, 60:613–622.

de Lima, E. B., Pappa, G. L., de Almeida, J. M., Gonçalves, M. A., and Jr., W. M. (2010).

Tuning genetic programming parameters with factorial designs. InProceedings of the

IEEE Congress on Evolutionary Computation, pages 1–8, Barcelona, Spain.

de M. Palotti, J. R., Salles, T., Pappa, G. L., Arcanjo, F., Gonçalves, M. A., and Jr., W. M.

(2010). Estimating the credibility of examples in automatic document classification.Jour-

nal of Information and Data Management, 1(3):439–454.

Dries, A. and Rückert, U. (2009). Adaptive concept drift detection.Statistical Analysis and

Data Mining, 2(5-6):311–327.

Drummond, C. (2006). Discriminative vs. generative classifiers for cost sensitive learning.

In Canadian Conference on AI, pages 479–490, Québec, Canada.

Fdez-Riverola, F., Iglesias, E., Díaz, F., Méndez, J., and Corchado, J. (2007). Applying

lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with

Applications, 33(1):36–48.

Folino, G., Pizzuti, C., and Spezzano, G. (2007). An adaptive distributed ensemble approach

to mine concept-drifting data streams. InProceedings of the IEEE International Confer-

ence on Tools with Artificial Intelligence, pages 183–188, Patras, Greece.

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classi-

fication. Journal of Machine Learning Research, 3:1289–1305.

Forman, G. (2006). Tackling concept drift by temporal inductive transfer. InProceedings

of the International ACM SIGIR Conference on Research & Development of Information

Retrieval, pages 252–259, Washington, USA.

Gama, J., Medas, P., Castillo, G., and Rodrigues, P. (2004).Learning with drift detection.

In Proceedings of the Brazilian Symposium on Artificial Intelligence, pages 286–295, São

Luís, Brazil.

Ganter, B., Stumme, G., and Wille, R., editors (2005).Formal Concept Analysis, Founda-

tions and Applications, volume 3626 ofLecture Notes in Computer Science. Springer.

Ganter, B. and Wille, R. (1999).Formal Concept Analysis: Mathematical Foundations.

Springer, Berlin, Heidelberg.

Page 139: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

BIBLIOGRAPHY 103

Hastie, T., Tibshirani, R., and Friedman, J. H. (2009).The Elements of Statistical Learning.

Springer, New York, NY.

Hollander, M. and A., D. (1999).Nonparametric Statistical Methods. Wiley-Interscience,

New York, NY.

Jain, R. (1991).The Art of Computer Systems Performance Analysis: Techniques for Exper-

imental Design, Measurement, Simulation, and Modeling. John Wiley, New York, NY.

Joachims, T. (1999). Making large-scale SVM learning practical. In Schölkopf, B., Burges,

C., and Smola, A., editors,Advances in Kernel Methods - Support Vector Learning, chap-

ter 11, pages 169–184. MIT Press.

Joachims, T. (2006). Training linear svms in linear time. InProceedings of the ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining, pages

217–226, Philadelphia, USA.

Kelly, M. G., Hand, D. J., and Adams, N. M. (1999). The impact of changing populations

on classifier performance. InProceedings of the ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining, pages 367–371, San Diego, USA.

Kim, Y. S., Park, S. S., Deards, E., and Kang, B. H. (2004). Adaptive web document clas-

sification with MCRDR. InProceedings of the International Conference on Information

Technology: Coding and Computing, pages 476–480, Las Vegas, USA.

Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs. example weight-

ing. Intelligent Data Analysis, 8(3):281–300.

Klinkenberg, R. and Joachims, T. (2000). Detecting conceptdrift with support vector ma-

chines. InProceedings of the International Conference on Machine Learning, pages 487–

494, Stanford, USA.

Klinkenberg, R. and Rüping, S. (2003). Concept drift and theimportance of example. In

Text Mining - Theoretical Aspects and Applications, pages 55–78. Physica-Verlag.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and

model selection. InProceedings of the International Joint Conference on Artificial Intel-

ligence, pages 1137–1143, Québec, Canada.

Kolter, J. Z. and Maloof, M. A. (2003). Dynamic weighted majority: A new ensemble

method for tracking concept drift. Technical report, Department of Computer Science,

Georgetown University, Washington, USA.

Page 140: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

104 BIBLIOGRAPHY

Koren, Y. (2010). Collaborative filtering with temporal dynamics. Communications of the

ACM, 53:89–97.

Koychev, I. (2000). Gradual forgetting for adaptation to concept drift. InProceedings of the

ECAI Workshop Current Issues in Spatio-Temporal Reasoning, pages 101–106, Berlin,

Germany.

Kuncheva, L. I. and Žliobaite, I. (2009). On the window size for classification in changing

environments.Intelligent Data Analysis, 13(6):861–872.

Lawrence, S. and Giles, C. L. (1998). Context and page analysis for improved web search.

IEEE Internet Computing, 2(4):38–46.

Lazarescu, M. M., Venkatesh, S., and Bui, H. H. (2004). Usingmultiple windows to track

concept drift.Intelligent Data Analysis, 8(1):29–59.

Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log-normal distributions across the sci-

ences: Keys and clues.BioScience, 51(5):341–352.

Lin, Z., Hao, Z., Yang, X., and Liu, X. (2009). Several SVM ensemble methods integrated

with under-sampling for imbalanced data learning. InProceedings of the International

Conference on Advanced Data Mining and Applications, pages 536–544, Beijing, China.

Liu, A., Ghosh, J., and Martin, C. (2007). Generative oversampling for mining imbalanced

datasets. InProceedings of the International Conference on Data Mining, pages 66–72,

Las Vegas, USA.

Liu, R.-L. and Lu, Y.-L. (2002). Incremental context miningfor adaptive document clas-

sification. InProceedings of the ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, pages 599–604, Edmonton, Canada.

Manning, C. D., Raghavan, P., and Schtze, H. (2008).Introduction to Information Retrieval.

Cambridge University Press, New York, NY.

Miao, Y.-Q. and Kamel, M. (2011). Pairwise optimized rocchio algorithm for text catego-

rization. Pattern Recognition Letters, 32(2):375–382.

Mourão, F., Rocha, L., Araújo, R., Couto, T., Gonçalves, M.,and Meira Jr., W. (2008).

Understanding temporal aspects in document classification. In Proceedings of the Inter-

national Conference on Web Search and Web Data Mining, pages 159–170, Palo Alto,

USA.

Page 141: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

BIBLIOGRAPHY 105

Nishida, K. and Yamauchi, K. (2007). Detecting concept drift using statistical testing. InPro-

ceedings of the International Conference on Discovery Science, pages 264–269, Sendai,

Japan.

Nishida, K. and Yamauchi, K. (2009). Learning, detecting, understanding, and predicting

concept changes. InProceedings of the International Joint Conference on Neural Net-

works, pages 283–290, Atlanta, USA.

Orair, G. H., Teixeira, C., Wang, Y., Jr., W. M., and Parthasarathy, S. (2010). Distance-

based outlier detection: Consolidation and renewed bearing. Proceedings of the VLDB

Endowment, 3(2):1469–1480.

Rasmussen, C. E. and Williams, C. (2006).Gaussian Processes for Machine Learning. MIT

Press, Cambridge, MA.

Rocha, L., Mourão, F., Pereira, A., Gonçalves, M. A., and Meira Jr., W. (2008). Exploiting

temporal contexts in text classification. InProceedings of the International Conference on

Information and Knowledge Engineering, pages 243–252, Napa Valley, USA.

Salles, T., Rocha, L., Mourão, F., Pappa, G. L., Cunha, L., Gonçalves, M. A., and Jr., W. M.

(2010a). Automatic document classification temporally robust. Journal of Information

and Data Management, 1(2):199–212.

Salles, T., Rocha, L., Pappa, G. L., Mourão, F., Gonçalves, M. A., and Jr., W. M. (2010b).

Temporally-aware algorithms for document classification.In Proceedings of the Inter-

national ACM SIGIR Conference on Research & Development of Information Retrieval,

pages 307–314, Genebra, Switzerland.

Scholz, M. and Klinkenberg, R. (2007). Boosting classifiersfor drifting concepts.Intelligent

Data Analysis, 11(1):3–28.

Sebastiani, F. (2002). Machine learning in automated text categorization.ACM Computing

Surveys, 34(1):1–47.

Sun, A., Lim, E.-P., and Liu, Y. (2009). On strategies for imbalanced text classification using

svm: A comparative study.Decision Support Systems, 48(1):191–201.

Tan, S. (2005). Neighbor-weighted k-nearest neighbor for unbalanced text corpus.Expert

Systems with Applications, 28(4):667–671.

Tsymbal, A. (2004). The problem of concept drift: Definitions and related work. Technical

report, Department of Computer Science, Trinity College, Dublin, Ireland.

Page 142: CLASSIFICAÇÃO AUTOMÁTICA DE DOCUMENTOS …

106 BIBLIOGRAPHY

Vapnik, V. N. (1998).Statistical Learning Theory. Wiley-Interscience, New York, NY.

Vaz de Melo, P. O., da Cunha, F. D., Almeida, J. M., Loureiro, A. A., and Mini, R. A. (2008).

The problem of cooperation among different wireless sensornetworks. InProceedings

of the International Symposium on Modeling, Analysis and Simulation of Wireless and

Mobile Systems, pages 86–91, Vancouver, Canada.

Žliobaite, I. (2009). Combining time and space similarity for small size learning under

concept drift. InProceedings of the International Symposium on Foundationsof Intelligent

Systems, pages 412–421, Prague, Czech Republic.

Wang, H., Fan, W., Yu, P. S., and Han, J. (2003). Mining concept-drifting data streams using

ensemble classifiers. InProceedings of the ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pages 226–235, Washington, USA.

Widmer, G. and Kubat, M. (1996). Learning in the presence of concept drift and hidden

contexts.Machine Learning, 23(1):69–101.

Wille, R. (2005). Formal concept analysis as mathematical theory of concepts and concept

hierarchies. InFormal Concept Analysis, pages 1–33.

Yang, C. and Zhou, J. (2008). Non-stationary data sequence classification using online class

priors estimation.Pattern Recognition, 41(8):2656–2664.

Yang, Q. and Wu, X. (2006). 10 Challenging Problems in Data Mining Research.Interna-

tional Journal of Information Technology & Decision Making, 5(4):597–604.

Zhang, Z. and Zhou, J. (2010). Transfer estimation of evolving class priors in data stream

classification.Pattern Recognition, 43(9):3151–3161.