primeiras experiências: frequências dos bigramas e tetragramas por ordem decrescente freq. bigrama...
TRANSCRIPT
Unsupervised Document Classification and Automatic Topic Extraction
Joaquim SilvaUniversidade Nova de Lisboa
Portugal
João MexiaUniversidade Nova de Lisboa
Portugal
Agra CoelhoUniversidade Técnica de Lisboa
Portugal
Gabriel LopesUniversidade Nova de Lisboa
Portugal
FC / 02
.Extracting Relevant Expressions (REs) from Documents
.Application to Document Clustering and
Automatic Ontology Extraction .Other applications
FC / 02
Extracting REs from Documents
Common agriculture policyCommon CustomsProduits agricolesEconomia de energiaRational use of energyEnergy saving in the public sectorPublication au journal officiell des Communautés
REs:
FC / 02
Primeiras experiências: Frequências dos bigramas e tetragramas por ordem decrescente
Freq. Bigrama Freq. Tetragrama
1528 - O 75 - Notícias breves da891 - A 74 Notícias breves da actualidade348 Estados Unidos 64 - A bolsa de203 05 Jan 60 do Banco de Portugal195 De acordo 59 ministro dos negócios estrangeiros188 Agência Lusa 58 - Notícias breves da 179 Banco de 57 Notícias breves da actualidade 165 Conselho de 54 De acordo com o 51 De acordo com a 40 Libertação Nacional 49 por cento do que 40 Irlanda do 49 disse à Agência Lusa 40 Câmara de 46 na africa do Sul 40 13 - 45 com o objectivo de 39 Nacional de 39 Na sua 20 na abertura do mercado 39 Geral de 20 na Assembleia da Republica 39 Campeonato Nacional 20 em conferência de imprensa
20 do que no fecho 15 Câmara dos 20 do campeonato português de 15 Comissão Nacional 20 Ministro dos Negócios Estrangeiros 15 Com o 20 - A Camara Municipal 15 Carvalho da 19 presidente de Camara Municipal 15 Cabo Verde 19 por cento para o 15 Bósnia e 19 face às principais divisas 15 Associação 25 19 disse hoje à Agência 15 As conversações 19 de final da Taça 19 da Santa Casa da 4 Mês Cultural 4 México e 4 visita oficial de dois 4 Mário Tomé 4 visa protestar contra a 4 Municipalizados de 4 vila franca do campo 4 Municipal e 4 vice-ministro dos negócios estrangeiros 4 Mundo dos 4 verde deverá continuar a 4 Ministério de 4 venda e do transkei 4 Minas Gerais 4 valores estavam hoje a
Este critério penaliza o comprimento da sequência
Colocações propostas após os filtros de Justeson e Katz
f(w1 w2) w1 w2 Padrão f(w1 w2) w1 w2 Padrão
11487 New York A N 2001 Middle East A N 7261 United States A N 1942 Saddam Hussein N N 5412 Los Angeles N N 1867 Soviet Union A N 3301 last year A N 1850 White House A N 3191 Saudi Arabia N N 1633 United Nations A N 2699 last week A N 1337 York City N N 2514 vice president A N 1328 oil prices N N 2378 Persian Gulf A N 1210 next year A N 2161 San Francisco N N 1074 chief executive A N 2106 President Bush N N 1073 real estate A N
É necessária informação morfo-sintáctica. As longas sequências continuam a ser penalizadas pelo critério da frequência.
Tools used for Extrating REs:
.LocalMaxs algorithm
.Fair Dispersion Point Normalization (FDPN)
.Symmetric Conditional Probability (SCP)
[Silva and Lopes 99]
FC / 02
)().(
),()|().|()),((
2
ypxp
yxpxypyxpyxSCP
Measuring 2-grams cohesion
F
wwpwwfSCP n
n
21
1
))(())((_
)().(1
11
1
11 ni
ni
ii wwpwwp
nF
where
Applying FDPN to measure n-grams cohesion(n>2). Building pseudo-2-grams
FC / 02
Ex:
))sec,,,,(().(16
1torpublictheinsavingpenergypF
))sec,,,(()).,(( torpublictheinpsavingenergyp
))sec,,(()).,,(( torpublicthepinsavingenergyp
))sec,(()).,,,(( torpublicptheinsavingenergyp
)(sec)).,,,,(( torppublictheinsavingenergyp
F
torpublictheinsavingenergyp 2))sec,,,,,((
FC / 02
))sec,,,,,((_ torpublictheinsavingenergyfSCP
)(),( 11 WyWx nn
[length(W)=2 and g(W)>y] or[length(W)>2 and g(W)x and g(W)>y]
W is a RE if, for
)(1 Wn is the set of the g(.) values of all the (n-1)-gram contained in the n-gram W
)(1 Wn is the set of the g(.) values of all the (n+1)-gramcontaining the n-gram W
LocalMaxs AlgorithmFC / 02
LocalMaxs Algorithm (improved)
W is a RE if, for
)(),( 11 WyWx nn
[length(W)=2 and g(W)>y] or [length(W)>2 and g(W)>(x+y)/2]
is the set of the g(.) values of all the (n-1)-gram contained in the n-gram W
is the set of the g(.) values of all the (n+1)-gramcontaining the n-gram W
in energysaving
energysaving
energysaving in
energysaving in the
energy savingin the public
energy savingin the publicsector
energysaving in thepublic sectorhas
g(.)=SCP_f(.)
LocalMaxs Algorithm
The cohesion values of the n-grams and theelection of REs
FC / 02
SCP_f(.)0.0009276 Universidade Nova 0.0001322 Universidade Nova de 0.0004058 da Universidade Nova 0.00005399 na Universidade Nova 0.0002555 Nova de Lisboa 0.0053873 Universidade Nova de Lisboa 0.0001187 Universidade Nova de Lisboa ( 0.00006521 Universidade Nova de Lisboa , 0.00002609 Universidade Nova de Lisboa . 0.0001675 na Universidade Nova de Lisboa 0.0005022 da Universidade Nova de Lisboa 0.02768 Faculdade de Economia da Universidade 0.0001675 de Economia da Universidade Nova 0.004839 reitor da Universidade Nova de Lisboa 0.03134 Faculdade de Economia da Universidade Nova 0.00004907 , reitor da Universidade Nova de Lisboa 0.0001744 o reitor da Universidade Nova de Lisboa 0.00004893 reitor da Universidade Nova de Lisboa , 0.00007832 reitor da Universidade Nova de Lisboa . 0.0001992 Faculdade de Economia da Universidade Nova , 0.0007259 da Faculdade de Economia da Universidade Nova
Universidade Autodidacta Universidade Nova Universidade Tecnica Universidade Técnica Universidades Portuguesas Associacao de Estudantes da Universidade do Algarve cento dos estudantes da Universidade de Coimbra reitor da Universidade Nova de Lisboa Faculdade de Economia da Universidade Nova académica da Universidade da Beira Interior criação de uma Universidade de Bragança dirigente da associação académica da Universidade reitor da Universidade de Aveiro Associacao de Estudantes da Universidade Associação de Estudantes da Universidade Estudantes da Universidade do Algarve Hospitais da Universidade de Coimbra Reitoria da Universidade de Lisboa cento dos estudantes da Universidade uma Universidade de Bragança Economia da Universidade Nova
Universidade Clássica de Lisboa Universidade Nova de Lisboa Universidade da Beira Interior associação académica da Universidade criação de uma Universidade Estudantes da Universidade Hospitais da Universidade Reitores de Universidades Universidade Católica Portuguesa Universidade de Aveiro Universidade de Coimbra Universidade de Edimburgo Universidade de Evora Universidade do Algarve reitor da Universidade
Universidade Clássica de Lisboa Universidade Nova de Lisboa Universidade da Beira Interior associação académica da Universidade criação de uma Universidade Estudantes da Universidade Hospitais da Universidade Reitores de Universidades Universidade Católica Portuguesa Universidade de Aveiro Universidade de Coimbra Universidade de Edimburgo Universidade de Evora Universidade do Algarve reitor da Universidade
pppp
p
p
RERERE
RERERE
RERERE
,,2,1
,22,22,1
,12,11,1
.Reducing the Number of Base Features
25,838 REs (Base Features) extrated from the multilingualcorpus: 872,795 words in 324 documents
RE=
),(, kiki RERECovRE
Using Principal Components Analysis to reduce 25,838Base Features?
No!!!
FC / 02
Matrix of Document Similarity
nnnn
n
n
SSS
SSS
SSS
,,2,1
,22,22,1
,12,11,1
A smaller (324 324) covariance matrix:
S=
)).((1
1 *.,
*,
*.,
1
*,, llij
pi
ijilj zzzz
pS
With n documents and p REs
FC / 02
is a Transformed Occurrence
)( *
*.,
*,*
,
j
jjiji
DV
xxz
nj ,,1;,,1 pi
)().(.,*, iijiji REALREVxx
pi
ijjij xx
pDV
1
2*.,
*,
* )(1
1)(
*, jiz
*, jix is a Weighted Occurrence
occurrences of the ith RE in the jth documentjix , is the number of
)( iREAL is the Average Length of words in the ith RE
FC / 02
2
1,., )(
1
1)(
nj
jijii zz
nREV
)(.,,
,
j
jjiji
DV
xxz
pi
ijjij xx
pDV
1
2.,, )(
1
1)(
;,,1 pi nj ,,1
The “Variance” of RE. Normalizing thedocuments size
pi
ijij x
px
1,.,
1
FC / 02
with Q=P1/2S=PPT=QQTS is a covariance matrix
The first k columns of Q concentrate PTK(k)100% of the totalinformation contained in the original Base Features
nj
j j
kj
j jkPTV
1
1)(
n 1 are the eigenvalues of S
Q is a matrix of documents characterized by uncorrelatedfeatures (components)
[Escoufier and L’Hermier 78]
FC / 02
Σc Distrib. Vol. Shape Orient.λI Spher. Equal EqualλcI Spher. Vari. EqualλDADT Ellips. Equal Equal EqualλcDcAcDc
T Ellips. Vari. Vari. Vari.λDcADc
T Ellips. Equal Equal Vari.λcDcADc
T Ellips. Vari. Equal Vari.
Model Based-Cluster Analysis [Fraley andRaftery 98]
How many clusters?
Parameterizations of the covariance matrix inthe Gaussian model and their geometricinterpretation
Bayesian Information Criterion. Evidence of Clustering
FC / 02
Clusters Topics
)().().()( iiii REThrREALREVREScore 1)( iREThr if thresholdREfSCP i )(_
0)( iREThr otherwise
Topics correspond to the most important REs
The 15 most important REs of the cluster occurring in atleast 50% of its documents and having a score(.) >max(score(.))/50 are considered as topics
FC / 02
Results
First level of clusters: 3 components -PTV(3)=.848; PTV(5)=.932 and PTV(8)=.955
Second level (sub-clusters): 11 components -PTV(11)=.82
FC / 02
Clust. Main Topic Corr.#
Tot.#
Act.Corr.#
Prc.%
Rec.%
1 European communities 108 108 108 100 1002 Comunidade Europeia 108 107 107 100 99.13 Communauté européene 108 109 108 99.1 1001.1 Rational use of energy 23 23 20 86.9 86.91.2 Agricultural products 27 27 21 77.8 77.81.3 Combined nomenclature 58 58 51 87.9 87.92.1 Economia de energia 23 26 21 80.8 91.32.2 Produtos agrícolas 27 25 21 84 77.82.3 Nomenclatura Combinada 58 56 52 92.9 89.73.1 Politique énergétique 23 26 22 84.6 95.73.2 Produits agricoles 27 27 21 77.8 77.83.3 Nomenclature combinée 58 56 53 94.6 91.4
FC / 02
Cluster 1 European Communities, MemberStates, EUROPEAN COMMUNITIES,Council Regulation, Having regard toCouncil Regulation and OfficialJournal
Cluster 2 Comunidade Europeia (EuropeanCommunity), NomenclaturaCombinada (Combined Nomenclature),COMUNIDADES EUROPEIAS anddirectamente aplicável (directlyapplicable)
Cluster 3 Communauté européenne,nomenclature combinée, Étatsmembres, COMMUNAUTÉSEUROPÉENNES and directementapplicable
FC / 02
Cluster 1.1 Rational use of energy, energyconsumption and rational use
Cluster 1.2 Agricultural products, Official Journal,detailed rules, Official Journal of theEuropean Communities, proposal from theCommission, publication in the OfficialJournal and entirely and directly
Cluster 1.3 Combined nomenclature, CommonCustoms, customs authorities, No 2658/87,goods described, general rules, appropriateCN, Having regard to Council Regulation,tariff and statistical and Customs Code
FC / 02
Cluster 2.1 economia de energia (energy saving), utilizaçãoracional (rational use), racional da energia(rational of energy) and consumo de energia(energy consuming)
Cluster 2.2 produtos agrícolas, Comunidades Europeias,Jornal Oficial (Official Journal), directamenteaplicável (directly apllicable), COMUNIDADESEUROPEIAS, Jornal Oficial das Comunidades(Official Journal of the Communities),directamente aplicável em todos os Estados-membros (directly apllicable to all Member States),publicação no Jornal Oficial (publication in theOfficial Journal), publicação no Jornal Official dasComunidades and Parlamento Europeu (EuropeanParliament)
Cluster 2.3 Nomenclatura Cominada, autoridades aduaneiras(customs authorities), indicados na coluna(indicated in the column), mercadorias descritas(goods described), informações pautais vinculativas(binding tariff informations), Aduaneira Comum(Common Customs) regras gerais (general rules),códigos NC (NC codes) and COMUNIDADESEUROPEIAS
FC / 02
Cluster 3.1 politique énergétique (energy policy), rationnelle deénergie and l’utilization rationnelle
Cluster 3.2 produits agricoles, organisation commune(common organization), organisation commune desmarchés (common organization of the markets),directment applicable, Journal officiel, Journalofficiel des Communautés and COMMUNAUTÉSEUROPÉENNES
Cluster 3.3 Nomenclature combinée, autorités douanières(customs authorities), nomenclature tarifaire, No2658/87, marchandises décrites (goods described),tarifaire et statistique (tariff and statistical) andCOMMUNAUTÉS EUROPÉENES
FC / 02
-2.5
-2
-1.5-1
-0.5
0
0.5
1
1.5
2
2.5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Ord
ere
d o
bs
erv
ati
on
s
Standard normal quantiles
Assessing Normality of Data FC / 02
5
10
15
20
25
5 10 15 20 25
Ord
ere
d d
ista
nc
es
Chi-square percentiles
Assessing Normality of Data FC / 02
Data Transformations to approximate to Normality
For each column of Q, each observation can be modified from y to y() [Box and Cox, 64]
1)(
yy 0for
yy ln)( for 0
mj
jj
mj
jj yyy
m
ml
1
2)(
1
)( ln)1(])(1
ln[2
)(
mj
jjy
my
1
)()( 1
is chosen such that l() is maximized m is the number of elements of the cluster
FC / 02
Document Classification
• Representing the new Document with k components
We need:
Where
is the value of the document j for the component i
],...,[ ,,2,1 jkjjTj vvvx
jiv ,
But we have:
],...,,[ *,
*,2
*,1 jpjj
Tj zzzy
where *,liz
is the weighted occurrence of the document l for Relevant Expression (RE) i, and p the number of REs
TjyTranslate to T
jx],...,[ ,,2,1 jkjj
Tj vvvx
2/1 PΛsv Tj
Tj withSo, let PΛS TP
S The similarity document matrix, P the eigenvectors matrix, the eigenvalues diagonal matrix, and is a vector of similarities
Λ
Zys Tj
Tj p 1
1
Tjs
So, is a vector with n (the number of documents) elements, i.e.
Vector has the first k elements (components) of the vector
2/1 PΛsv Tj
Tj
],...,[ ,,2,1 jnjjTj vvvv
],...,[ ,,2,1 jkjjTj vvvx
],...,[ ,,2,1 jnjjTj vvvv
Quadratic Discrimination Score
)ln()()(2
1|)ln(|
2
1)( 1
iiiT
iiQi pd μxμxΣx
kkkk
k
k
SCSCSC
SCSCSC
SCSCSC
,,2,1
,22,22,1
,12,11,1
SC
.Document class i is represented by cluster i and is estimated by SC (cov matrix between components)
))((1
1,.,,.
1,, miml
ni
iilml cccc
nSC
ni
iill c
nc
1,,.
1
is estimated by the vector of means ],...,,[ ,.,.2,.1 kT cccc
A Criterion for Classifying Documents
r Be x the vector of components for a
document to classify and a class represented by cluster r
x belongs to class if
))((min)())((max)( jQrj
Qr
Qii
Qr dddd cxxx
r
)(xQrd is the quadratic score for x
i = 1, 2, … g ; j=1,2, …, n
g is the number of classes; n is the number of documents of the cluster r; is a document of cluster r jc
Results
• Average Precision: 93%
• Average Recall: 93%
• Average Precision on rejection: 91.5%
• Average Recall on rejection: 100%
Conclusions
.Unsupervised statistics-based andlanguage independent approach
.No pre-defined topics, features ordocument descriptors
. Number of clusters not left to userchoice
.About 80% of the clusters REs canbe taken as Topics
FC / 02
Other Applications
FC / 02
NLP Lexicon enrichment
Parsing precision Attachment decision
Rewriting Grammar Rules
Sequences of Strongly connected Characters
.Information Retrieval
f(y))(Nf(x))(Nf(y)f(x)
f(y))f(x)Ny)(f(x,y))((x,Φ
22
Applying the Fair Dispersion to the 2(.):
)...wp(w)...wp(w1n
1Avp n1i
1ni
1ii1
)...wf(w1n
1Avx
1ni
1ii1
)...wf(w
1n1
Avyni
2ini
)()(
)...())...((_
21
12
AvyNAvxNAvpAvpNwwf
wwfn
n
FC / 02
The scores for the several statistics-basedmeasures
Statistics-based
measure: g(.)=
Precision(average)
ExtractedMWUs(count)
SCP_f(.) 81.00% 24476
SI_f(.) 75.00% 209062_f(.) 76.00% 24711
Dice_f(.) 58.00% 32381
LogLike_f(.) 53.00% 40602
Results for Contiguous MWUs
FC / 02
LUSA Corpus containing 919,253 words.
Evaluation Criterion: Compound nouns, locutions, frozen forms and Relevant Expressions are correct (MWUs).
LocalMaxs and SCP_f scores for Different Languages
Language PrecisionExtracted
MWUs(count)
Corpus size
English 77.00% 8017 493191
French 76.00% 8980 512031
German 75.00% 5190 454750
MedievalPortuguese
73.00% 5451 377724
Multilingual Corpora ( Eupopean Parliament debates) No morpho-syntactic filters used Evaluation Criterion: MWU / Relevant Expressions are
correct extractions
FC / 02
The Scores for the Contiguous CompoundVerb Extractions
Form Precision Extractedcompound verbs
2-gram 81.00% 108
3-gram 73.00% 492
Evaluation Criterion:From aTagged Corpus [Marques&Lopes] (1,194,206 words)
verb forms changed to infinitive forms
Ex: - arredar pé (to leave)
- estar para chegar (to be about arriving)
- ter pela frente (to face)
FC / 02
Universidade Autodidacta Universidade Nova Universidade Tecnica Universidade Técnica Universidades Portuguesas Associacao de Estudantes da Universidade do Algarve * cento dos estudantes da Universidade de Coimbra reitor da Universidade Nova de Lisboa Faculdade de Economia da Universidade Nova
MWUs Samples
FC / 02
sub-Thatcherite theology sine qua non deformed by the removal of a tumour Vocational Training Reform of the common agricultural Council of Agriculture Ministers Common agricultural policy
Spread of Organized Crime Sanz Fernández SIR JACK STEWART-CLARK Royal Society Richard Attenborough Red Cross LUCAS PIRES Henry the Navigator
English
FC / 02
Contrôle de la croissance démographique Infrastructures nécessaires Résolutions adoptées Président du tribunal Mise en marché commune Protection de la petite enfance Miskito Tawahka Pech Protection du touriste Drame algérien Commission a données aux amendements adoptés Sécurité de nos approvisionnements énergétiques Directive relative à la sécurité
French
FC / 02
Annahme einer Richtlinie über die Werbung (consideration of a directrix about publicity)
Rechte beim Gerichtshof der Europäischen Gemeinschaft geltend (European Community Parliament's Rights in force)
Algerischen Volkes (Algerian people) Gefahr für die Volksgesundheit (Danger for public health) Zusammensetzung der Ausschüsse und Delegationen
(composition of the ? and delegation) Schaffung des EWR (creation of the EWR) Währung und Industriepolitik über den Vorschlag (metal
money and industrial policy about a proposal) Gemischten Parlamentarischen (mixed Parliament)
German
FC / 02
Using Tags as “Words” in LocalMaxs to obtain Preference Selection for relative clauses or other clauses attachments.
_PR _ADV _ADV _V que mais tipicamente corresponde freq=2_PR _ART _N _V que os mesmos derem freq=6_PR _N _PPOA cujo reexame se freq=4_PR _N _V _ART cuja realização impliquem a freq=2_PR _PPOA _V _VIRG _PREP _N que se desdobra , com normalidade freq=2_PR _V quem vier freq=92_PR _V _ART _PAR _N _PAR que precedeu a " deliberação " freq=3_PR _V _CONTR _N _ADJ _PREP _PIND que resultaram da análise sistemática de todo freq=2_PREP _ADJ _N _PTO _V de particular importncia . Vejamos freq=2_PREP _ART _N _ADJ _ADJ _PREP de os serviços públicos municipais de freq=4_PREP _ART _N _CONJCOORD _N _ADJ por o município ou municípios concedentes freq=5
FC / 02
Using Characters as “Words” in LocalMaxs to obtain Strongly connected Sequences of Characters (within words and between words) Ex: trateg trutur xtracç éctric tratam struíd tradiç trangeir ntratuai ncentraç xtraí xtrem tritiva trocas* tráfico trânsit utras*in (utrazin)
FC / 02
REs points to relevant information: Topics and Subtopics Ex. about “Human Rights”
Extracting its related Subject Matters (Topics and Subtopics) … :
European Convention on Human Rights European Court of Human Rights Universal Declaration of Human Rights European Commision of the Human Rights Etc.
Selection by Topic / Subtopic
. Information Retrieval
FC / 02