nuno alberto paulino da fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · nuno...

223

Upload: trinhcong

Post on 21-Mar-2018

226 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

Nuno Alberto Paulino da Fonseca

Parallelism in Inductive LogicProgramming Systems

Faculdade de Ciências da Universidade do PortoMaio 2006

Page 2: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT
Page 3: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

Nuno Alberto Paulino da Fonseca

Parallelism in Inductive LogicProgramming Systems

Tese submetida à Faculdade de Ciências da Universidade do Porto para obtenção dograu de Doutor em Ciência de Computadores

Advisors: Prof. Fernando Silva and Prof. Rui Camacho

Departamento de Ciência de ComputadoresFaculdade de Ciências da Universidade do Porto

Maio 2006

Last update: October 11, 2006

Page 4: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2

Page 5: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

To my parents and my wife.

3

Page 6: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4

Page 7: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

Acknowledgments

First, and foremost, I wish to thank my supervisors: Fernando Silva e Rui Camacho.I learned a lot from them. I specially thank them for their guidance, support, andpatience over the last years.

A special word of appreciation to Vitor S. Costa the support provided through theyears... and for maintaining the YAP Prolog!

A great thanks to my colleagues at the LIACC, in particular to Ricardo Rocha, MichelFerreira, and Tiago Soares that contributed to the part of the work referred in thisdissertation related to the exploitation of Tabling and RDBMS by an ILP system.

A �dank u wel� to Hendrik Blockeel and the DTAI group, in the Katholieke UniversiteitLeuven, for receiving me for three months. I am also thankful to David Page, in theDepartment of Biostatistics and Medical Informatics at University of Wisconsin, forreceiving me in two occasions.

I also thank the Fundação para a Ciência e Técnologia (FCT) for the scolarshipthat made this dissertation possible. This work was supported with the FCT grantSFRH/BD/7045/2001 and through the �Programa Operacional Ciência, Tecnologia,Inovação (POCTI) e do Programa Operacional Sociedade da Informação (POSI) doQuadro Comunitário de Apoio III (2000-2006), com fundos comunitários (FSE) efundos nacionais�.

Last but not least, a special word of appreciation to my family and friends for all thesupport.

Nuno Fonseca

5

Page 8: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

6

Page 9: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

Abstract

Inductive Logic Programming (ILP) is a Machine Learning approach with foundationsin Logic Programming. ILP has been successfully applied in many application areas,such as engineering, bioinformatics, pharmacology (drug design), and protein structureprediction. Models discovered by ILP systems are typically represented as logicprograms, a subset of �rst order logic. The expressiveness of �rst-order logic grants�exibility and understandability to the induced models. However, ILP systems su�erfrom signi�cant limitations that reduce their applicability. First, most ILP systemsexecute in main memory, limiting their ability to process large databases. Second,ILP systems are computationally expensive. On complex applications, ILP systemscan take several hours, if not days, to return a model. Third, the lack of e�ciencymakes human-computer interaction unpractical in complex applications. Therefore, amajor obstacle that ILP systems must overcome is e�ciency. A promising researchdirection to improve the e�ciency of ILP systems is to explore parallelism by makinguse of the growing number of existing computer clusters and grids.

In this dissertation we research techniques to exploit parallelism in ILP systems.We start by describing and evaluating a new ILP system called April, capable oftransparently exploiting parallelism in distributed or shared memory computers andshow that April's sequential execution performance is comparable to a state-of-the-artILP system. We study and evaluate the performance of several parallel algorithmsimplemented in April using well known ILP applications. Furthermore, we studyif the solutions of the parallel approaches produce better results than sequentialrandomized algorithms. We propose a novel parallel algorithm for ILP that combinesdata parallelism and pipelining. The results of a performance evaluation show that thealgorithm outperforms other parallel algorithms and that it is able to achieve super-linear speedups, in a distributed memory computer, without a�ecting the quality ofthe models found.

7

Page 10: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

8

Page 11: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

Resumo

A Programação Lógica Indutiva (PLI) é um método de aprendizagem automáticaque utiliza técnicas de Programação Lógica. A PLI tem sido aplicada com sucesso aenúmeras áreas, como a engenharia, a bioinformática, a concepção de medicamentos ea previsão da estrutura de proteínas. Os modelos descobertos por sistemas de PLIsão usualmente representados como programas lógicos, um subconjunto da lógicade primeira ordem. A expressividade da lógica de primeira ordem dá �exibilidadee compreensibilidade aos modelos induzidos. No entanto, os sistemas de PLI têmlimitações intrinsecas que reduzem a sua aplicabilidade. A maioria dos sistemas dePLI não é capaz de processar grandes quantidades de dados, uma consequência deserem executados com os dados carregados em memória. Os sistemas de PLI sãolentos, podendo demorar horas ou dias para produzir um modelo, o que por sua vez fazcom que a interactividade não seja um ponto forte dos sistemas de PLI. A paraleliza-ção de sistemas PLI é um caminho que pode permitir melhorar signi�cativamente odesempenho destes sistemas, tirando partido do crescente número de clusters e grids.

Nesta dissertação estudamos técnicas para explorar paralelismo em sistemas de PLI.Começamos por descrever e avaliar um novo sistema de PLI, chamado April, capaz deexplorar paralelismo de forma transparente em computadores paralelos de memóriadistribuída ou de memória partilhada. Mostramos que o desempenho sequencial doApril é comparável com um sistema de PLI de referência. Avaliamos o desempenho dediversos algoritmos de PLI implementados no April usando várias aplicações. Com-paramos os algoritmos paralelos com algoritmos estocásticos para avaliar a qualidadedos modelos produzidos e os tempos de execução. Propomos um novo algoritmoparalelo para PLI que explora paralelismo nos dados e �pipelining� e mostramos queo seu desempenho é superior aos outros algoritmos paralelos de PLI.

9

Page 12: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

10

Page 13: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

Résumé

La Programmation Logique Inductive (PLI) est une approche d'analyse automatiqueissue de la programmation logique. La PLI a été appliqué avec succès dans beaucoup dedomaines d'application, tels que la technologie, la bio informatique, la pharmacologie(conception de médicaments), et la prédiction de structure de protéine. Les modèlesdécouverts par les systèmes de PLI sont typiquement représentés comme programmesde logique, un sous-ensemble de la logique du premier ordre. L'expressivité de lalogique de premier ordre donne de la �exibilité et améliore la compréhension desmodèles induits. Cependant, les systèmes de PLI sou�rent des limitations signi�ca-tives qui réduisent leur applicabilité. Premièrement, la plupart des systèmes de PLIs'exécutent en mémoire centrale, limitant ainsi leur capacité de traiter de grandes basesde données. En second lieu, les systèmes de PLI sont informatiquement chers. Surdes applications complexes, les systèmes de PLI peuvent prendre plusieurs heures, sice n'est des jours, pour concevoir un modèle. Troisièmement, leur manque d'e�cacitérend l'interaction homme machine peu fonctionnelle. Par conséquent, un des obstaclesimportant que les systèmes de PLI doivent surmonter est son manque e�cacité. Unedirection prometteuse de recherches pour améliorer l'e�cacité des systèmes d'ILP estd'explorer le parallélisme en se servant du nombre de plus en plus important clusterset grids existants.

Dans cette thèse, nous étudierons des techniques pour exploiter le parallélisme dansles systèmes de PLI. Nous commencerons par décrire et évaluer un nouveau système dePLI appelé April, capable d'exploiter d'une manière transparente le parallélisme dansdes ordinateurs distribués ou partagés de mémoire et nous prouverons que l'exécutionséquentielle de l'exécution de April est comparable à un système de dernière générationde PLI. Nous évaluerons l'exécution de plusieurs algorithmes parallèles mis en appli-cation avec April en utilisant des applications bien connues de PLI. En outre, nousétudierons si les solutions en approches parallèles produisent de meilleurs résultatsque les algorithmes séquentiels randomisés. Nous proposerons un nouvel algorithme

11

Page 14: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

parallèle pour PLI qui combine deux techniques de mise en parallèle : parallélisme dedonnées et �pipelining�. Nous prouverons que cet algorithme surpasse d'autres algo-rithmes parallèles et qu'il peut réaliser les �super-linear speedups�, dans un �ordinateurà mémoire répartie�, sans a�ecter la qualité des modèles trouvés.

12

Page 15: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

Contents

Abstract 8

Contents 13

List of Tables 19

List of Figures 23

1 Introduction 27

1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.2 Inductive Logic Programming . . . . . . . . . . . . . . . . . . . . . . . 30

1.3 Motivation and Contributions . . . . . . . . . . . . . . . . . . . . . . . 32

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 On Inductive Logic Programming 37

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2 The ILP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2.1 Learning from Entailment . . . . . . . . . . . . . . . . . . . . . 42

2.2.2 Learning from Interpretations . . . . . . . . . . . . . . . . . . . 44

2.3 The Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

13

Page 16: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3.1 A Generic Rule Covering Algorithm . . . . . . . . . . . . . . . . 46

2.3.2 Structuring the Hypothesis Space . . . . . . . . . . . . . . . . . 46

2.3.3 Hypotheses Generation . . . . . . . . . . . . . . . . . . . . . . . 49

2.3.4 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.3.5 Search Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.3.6 Bounding the Search . . . . . . . . . . . . . . . . . . . . . . . . 54

2.3.7 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.3.8 Hypotheses Evaluation Measures . . . . . . . . . . . . . . . . . 57

2.3.9 Matching Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 59

2.3.10 MDIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.4 Randomized Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.4.1 Stochastic Clause Selection . . . . . . . . . . . . . . . . . . . . . 62

2.4.2 Randomized Rapid Restarts . . . . . . . . . . . . . . . . . . . . 64

2.5 Improving the E�ciency of ILP Systems . . . . . . . . . . . . . . . . . 65

2.5.1 Algorithms/Program Optimizations . . . . . . . . . . . . . . . . 66

2.5.2 Reducing the Hypothesis Space . . . . . . . . . . . . . . . . . . 69

2.5.3 Faster Hypotheses Evaluation . . . . . . . . . . . . . . . . . . . 73

2.5.4 Data Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

2.5.5 Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 79

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3 The April ILP system 81

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.2.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.2.2 Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

14

Page 17: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.2.3 Meta-Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.3 April's Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.3.1 Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.3.2 Clause Generation . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.3.3 Clause Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.3.4 Clause Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.4 Coupling with Relational Databases . . . . . . . . . . . . . . . . . . . . 102

3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 106

3.6.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.6.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

3.6.4 Performance Evaluation: April vs Aleph . . . . . . . . . . . . . 109

3.6.5 Performance Analysis of April . . . . . . . . . . . . . . . . . . . 112

3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4 Parallelization Strategies for ILP 115

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.3 Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.3.1 Parallel Exploration of Independent Hypotheses . . . . . . . . . 119

4.3.2 Parallel Exploration of the Search Space . . . . . . . . . . . . . 119

4.3.3 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.3.4 Parallel Coverage Tests . . . . . . . . . . . . . . . . . . . . . . . 120

4.4 Parallel ILP Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

15

Page 18: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.5 Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.5.1 Parallel Coverage Tests . . . . . . . . . . . . . . . . . . . . . . . 124

4.5.2 Parallel Stochastic Clause Selection . . . . . . . . . . . . . . . . 125

4.5.3 Data Parallel Learn Rule . . . . . . . . . . . . . . . . . . . . . . 128

4.5.4 Parallel Randomized Rapid Restarts . . . . . . . . . . . . . . . 128

4.5.5 Data Parallel ILP . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.6.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4.6.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4.6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.6.4 Base (Sequential) Results . . . . . . . . . . . . . . . . . . . . . 135

4.6.5 Parallel Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

4.6.6 Parallel versus Randomized Algorithms . . . . . . . . . . . . . . 144

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5 A Pipeline Data-parallel Algorithm for ILP 147

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.2 Pipelining and Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.3 A pipelined data-parallel rule learning algorithm . . . . . . . . . . . . . 150

5.3.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.3.2 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.3.3 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.3.4 Reducing Over�tting by Exploring Sample Variance . . . . . . . 160

5.4 p2 −mdie: A pipelined data-parallel algorithm for ILP . . . . . . . . . 161

5.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5.5.1 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

16

Page 19: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.5.3 Speedup Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.5.4 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 171

5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

6 Conclusion and Further Research 175

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

6.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.4 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Appendix 181

A Logic Programming and Prolog 181

A.1 Logic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

A.2 Prolog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

B Supplementary Tables and Graphics 189

B.1 Performance Evaluation: April vs Aleph . . . . . . . . . . . . . . . . . 190

B.2 April's Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . 190

B.3 Strategies to parallelize ILP . . . . . . . . . . . . . . . . . . . . . . . . 192

B.4 P 2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

References 199

Index 217

17

Page 20: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

18

Page 21: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

List of Tables

2.1 Family data representation in a learning from entailment setting. . . . . 43

2.2 Confusion matrix for a two-class (binary) classi�cation problem. . . . . 58

2.3 Evaluation measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.4 Approaches and methods for improving e�ciency. . . . . . . . . . . . . 67

2.5 Algorithms or implementation optimizations. . . . . . . . . . . . . . . . 68

2.6 Methods for reducing the hypothesis space. . . . . . . . . . . . . . . . . 70

2.7 Methods for improving the e�ciency of the hypothesis evaluation. . . . 74

2.8 Data handling methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.1 Dimensions of ILP: April and other well known ILP systems. . . . . . . 85

3.2 Background knowledge properties declarations. . . . . . . . . . . . . . . 88

3.3 Time (in seconds) to evaluate 200 examples on 688 randomly generatedclauses from four arti�cially generated problems [BGSS03]. . . . . . . . 103

3.4 Applications characterization. . . . . . . . . . . . . . . . . . . . . . . . 107

3.5 Settings used by April and Aleph for each application. . . . . . . . . . 108

4.1 Summary of the parallel ILP implementations and reported results . . . 123

4.2 Applications characterization. AET is the average estimated time toevaluate a single example. . . . . . . . . . . . . . . . . . . . . . . . . . 133

4.3 Settings used by April in the experiments. . . . . . . . . . . . . . . . . 134

19

Page 22: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.4 Comparison of SCS and RRR with BFBB as far predictive accuracy isconcerned. Signi�cant changes on accuracy, for a p-value of 0.02, aremarked in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.5 Rankings of the three sequential algorithms. An entry of 1 undertime (T ) means that the corresponding algorithm achieved the lowestruntime for the given problem, while an entry of 1 under accuracy (A)means that the corresponding algorithm achieved the highest predictiveaccuracy for the given problem. The last row shows the median ranksof each algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.6 Rankings of the �ve parallel algorithms. An entry of 1 under speedup(S) means that the corresponding algorithm achieved the overall bestspeedup while an entry of 1 under accuracy (A) means that corre-sponding algorithm had the lowest or no loss in accuracy for the givenproblem (�=� means that accuracy of the parallel algorithm is equal tothe accuracy of the sequential). The last row shows the median ranksof each algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.1 Applications characterization. . . . . . . . . . . . . . . . . . . . . . . . 164

5.2 p2: Settings used by April in the experiments. . . . . . . . . . . . . . . 165

5.3 p2-mdie: Comparison of the DPILP with p2 − mdie (with a pipelinewidth of 10) as far speedup is concerned. . . . . . . . . . . . . . . . . . 171

5.4 p2: Scalability results. Execution time (in seconds) observed when �xingthe number of examples used by each processor. . . . . . . . . . . . . . 172

B.1 Average sequential execution time (in seconds) taken by April and Alephto search for a single rule. Standard deviation is presented withinbrackets. The relative value of the di�erence is computed as the ratioof April's minus Aleph's execution time with Aleph's execution time.Statistically signi�cant di�erences, for a p-value of 0.02, are marked inbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

B.2 Average predictive accuracies for the rules found by April and Aleph.The relative value of the di�erence is computed as the ratio of thedi�erence of April's minus Aleph's accuracy and Aleph's accuracy. Sta-tistically signi�cant di�erences, for a p-value of 0.02, are marked in bold.190

20

Page 23: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

B.3 Average memory usage (in kbytes) to perform a single search for a ruleby April and Aleph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

B.4 Sequential execution time (in seconds) with the BFBB, RRR, and SCSalgorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

B.5 Average accuracy (Training/Test) with the BFBB, RRR, and SCS al-gorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

B.6 Number of clauses generated (C) and number of examples evaluated (E)with the BFBB, RRR, and SCS algorithms. . . . . . . . . . . . . . . . 192

B.7 Execution time (in seconds) and respective speedup (within brackets)for the PCT , DPILP , and DPLR algorithms. . . . . . . . . . . . . . . 193

B.8 Execution time (in seconds) and speedup (withing brackets) for theparallel randomized search algorithms (PRRR and PSCS ). . . . . . . . 193

B.9 Average test accuracy and accuracy variation (within brackets) by al-gorithm. Statistically signi�cant changes (using a t-test with a p-valueof 0.02) are marked in bold. . . . . . . . . . . . . . . . . . . . . . . . . 195

B.10 p2: speedup (S), accuracy and accuracy variation, and theory size (|H |) for 2, 4, 6 and 8 processors. Statistically signi�cant di�erences inaccuracy, for a p-value of 0.02, are marked in bold. . . . . . . . . . . . 196

B.11 The impact of p2 algorithm on the (average) number of epochs for 2, 4,6 and 8 processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

B.12 Average number of messages (| Msgs |), average amount of commu-nication exchanged in MB, and average size of the maximum messageexchanged while executing p2 on the Carc application with a pipelinewidth of 1, 10, and unlimited (inf). . . . . . . . . . . . . . . . . . . . . 197

B.13 Average number of messages (| Msgs |), average amount of commu-nication exchanged in MB, and average size of the maximum messageexchanged while executing p2 on the Mesh application with a pipelinewidth of 1, 10, and unlimited (inf). . . . . . . . . . . . . . . . . . . . . 197

B.14 Average number of messages (| Msgs |), average amount of commu-nication exchanged in MB, and average size of the maximum messageexchanged while executing p2 on the Mut application with a pipelinewidth of 1, 10, and unlimited (inf). . . . . . . . . . . . . . . . . . . . . 197

21

Page 24: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

B.15 Average number of messages (| Msgs |), average amount of commu-nication exchanged in MB, and average size of the maximum messageexchanged while executing p2 on the Pyr application with a pipelinewidth of 1, 10, and unlimited (inf). . . . . . . . . . . . . . . . . . . . . 197

22

Page 25: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

List of Figures

1.1 Overview of the Knowledge Discovery in Databases process. . . . . . . 29

2.1 Induction versus deduction. . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2 The ILP problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.3 A generic covering algorithm. The learn_rule() procedure returns aset which includes the best rule found that explains a subset of thepositive examples (E+). . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.4 An example of a generic learn_rule() procedure. . . . . . . . . . . . . 47

2.5 Part of a re�nement graph for the grandfather relation. The rules areordered from the most general general (top) to more speci�c (bottom). 51

2.6 An example of a generic learn_rule() procedure with branch-and-bound. The changed/added lines are underlined. . . . . . . . . . . . . . 55

2.7 A high-level description of a learn_rule() procedure that performsstochastic clause selection. . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.8 A high-level description of a procedure to perform randomized rapidrestarts search (RRR). . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.1 April's main algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.2 April's main modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3.3 Comparison of April and Aleph in term of average execution time,accuracy, and memory usage. . . . . . . . . . . . . . . . . . . . . . . . 110

23

Page 26: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.4 Average memory usage of April in coverage lists (cache), search space(clauses's data) and other components of the system (including theProlog engine memory usage). . . . . . . . . . . . . . . . . . . . . . . . 111

3.5 Average distribution of April's execution time on the four main compo-nents (saturation, clause generation, clause evaluation, and clauseselection) when performing a top-down search. . . . . . . . . . . . . . . 112

4.1 Simpli�ed schemes of the messages exchanged by the parallel algo-rithms. Solid lines represent the execution �ow, horizontal dashed linesrepresent message passing between the processes, and vertical dashedlines represent idleness. The algorithms are ordered by the granularityof their parallel tasks, from the �nest-grained to the most coarse-grained.126

4.1 Continuation of Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.2 The data parallel learn rule algorithm (DPLR). . . . . . . . . . . . . . 129

4.3 An high-level description of a procedure to perform parallel randomizedrapid restarts. The di�erences to the sequential version of the algorithm(Figure 2.8) are underlined. . . . . . . . . . . . . . . . . . . . . . . . . 130

4.4 Sequential execution: time a) and speedup b) of the randomized algo-rithms in relation to the BFBB algorithm. . . . . . . . . . . . . . . . . 136

4.5 Sequential execution: relation between the number of epochs a) and thenumber of clauses generated b) with the speedup observed. . . . . . . . 138

4.6 Sequential execution: clauses generated (lines) and examples evaluated(bars). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

4.7 Speedups observed with each parallel algorithm for 2, 4, 6 and 8 pro-cessors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4.8 Accuracy variation observed with each parallel algorithm for 2, 4, 6 and8 processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.1 A pipeline with 5 stages. . . . . . . . . . . . . . . . . . . . . . . . . . . 149

5.2 Parallel pipelined rule search with 3 workers. . . . . . . . . . . . . . . . 151

24

Page 27: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.3 Example of a pipelined rule search. A squared box represents a node inthe search space that is considered good. The good nodes are used asthe search starting points in the next worker. . . . . . . . . . . . . . . 152

5.4 The p2 − covering algorithm. . . . . . . . . . . . . . . . . . . . . . . . 153

5.5 A pipelined learn_rule′() procedure. The main di�erence when com-pared to the learn_rule() concerns the S argument that contains a setof rules that de�ne the starting points of the search space. next_worker()computes the identi�er of the next worker on the pipeline. genNewRules()generates a set of rules from a given rule. . . . . . . . . . . . . . . . . . 155

5.6 p2-mdie - A pipelined data-parallel covering algorithm based on MDIE. 162

5.7 Worker view of the pipeline data-parallel covering algorithm based onMDIE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5.8 p2-mdie: Average speedups observed with 2, 4, 6 and 8 processors anda pipeline width of 1, 10, and unlimited (inf). . . . . . . . . . . . . . . 166

5.9 p2-mdie: Average amount of communication exchanged (in MB) with 2,4, 6 and 8 processors and a pipeline width of 1, 10 and unlimited (inf). 168

5.10 p2-mdie: Average number of epochs with 2, 4, 6 and 8 processors and apipeline width of 1, 10 and unlimited (inf). . . . . . . . . . . . . . . . . 169

5.11 p2-mdie: Average accuracy variation with 2, 4 , 6 and 8 processors anda pipeline width of 10 and unlimited (inf). . . . . . . . . . . . . . . . . 170

B.1 Average distribution of April's execution time on the four main compo-nents (saturation, clause generation, clause evaluation, and clauseselection) when performing a Stochastic Clause Selection. The Stochas-tic clause selection involved the random generation and evaluation of688 clauses uniformly distributed by length. . . . . . . . . . . . . . . . 191

B.2 Average distribution of April's execution time on the four main compo-nents (saturation, clause generation, clause evaluation, and clauseselection) when performing a Stochastic Clause Selection (outer ring)or top-down search (inner ring). The Stochastic Clause Selection in-volved the random generation and evaluation of 688 clauses uniformlydistributed by length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

25

Page 28: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

B.3 Average number of epochs by algorithm. . . . . . . . . . . . . . . . . . 194

26

Page 29: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

The beginning is the most important part of the work.

Plato

1Introduction

Inductive Logic Programming (ILP) [Mug90] is a Machine Learning approach whichuses techniques of Logic Programming. It has been successfully applied in manyapplication areas, such as engineering [DBJ97], bioinformatics [CK03, Kin04], phar-macology (drug design) [KMS92, FMPS98], and protein structure prediction [MKS92,TMS98], just to mention a few (for an extensive list of applications of ILP we referto [BM95, D�01, PC03, ILP02]). A reason for its success is the readability of the modelsfound by ILP systems. However, ILP still has some shortcomings [Pag00, PS03], oneof them being the long execution time required to �nd the models.

This dissertation focus on techniques to improve the e�ciency of ILP systems. Inparticular, the main focus of the work is on techniques to parallelize the execution ofILP systems in order to reduce their execution time.

This chapter begins by situating the dissertation work in the context of several areassuch as Knowledge Discovery in Databases, Data Mining and ILP. It is followed by abrief introduction to ILP and then the motivation for this work is presented. Finally,the goals and contributions of the dissertation are outlined, as well as a road map forthe remainder of the dissertation.

27

Page 30: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

28 CHAPTER 1. INTRODUCTION

1.1 Context

The amount of data collected and stored in databases is growing considerably inalmost all areas of human activity. A paramount example is the explosion of biotechdata [BKML+05, HJZ+00], where the volume of data has been doubling every threeto six months as a result of automation in biochemistry. In this example, as in manyothers, processing all data is either impossible or very expensive, both humanly andcomputationally. This justi�es the increased interest on the automatic discovery ofuseful knowledge in databases (Knowledge Discovery in Databases).

Knowledge Discovery in Databases (KDD) is �the non-trivial process of identify-ing valid, novel, and potentially useful, and ultimately understandable structure indata� [FPSS96]. In a KDD context, the data is a set of facts and a structure corre-sponds to a high-level description of the data in the form of a pattern or a model.

A pattern is a piece of knowledge discovered in the data, expressed in some lan-guage, which describes (summarizes) a subset of the data. Patterns can be consideredas knowledge: �a pattern that is interesting (according to a user-imposed interestmeasure) and certain enough (again according to the user's criteria) is called knowl-edge� [FPSS96]. The discovered patterns should be valid not only on the input databut also on new data, up to some degree of certainty (typically de�ned by a user).They should also be novel and potentially useful for the user. A model can be seen asa set of patterns that characterize the whole data. Sometimes models are presentedas black-boxes, but in most cases, the models should be understandable in order to beuseful.

KDD is a cyclic process composed of several steps. Following the CRoss Industry

Standard Process for Data Mining (CRISP) methodology [She00], the major stepsinvolve business understanding and data preparation (including data cleaning, pre-processing and integration), search for patterns, and evaluation. All these steps arerepeated in multiple iterations. Figure 1.1 gives an overview of the steps comprisingthe KDD process. The process is non-trivial in the sense that some search is involved,as opposed to performing a straightforward computation like computing the averagevalue of a set of numbers.

Data Mining is one of the steps of KDD, and perhaps the central one. It concernsthe extraction of useful information (relationships and summaries) from databases orlarge data sets. The relationships and summaries extracted through data mining toolsare usually referred to as models or patterns. Other main steps in the KDD process

Page 31: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

1.1. CONTEXT 29

BussinessUnderstanding

Data Understanding Data Preparation

Data MiningPatternsModelsEvaluationKnowledge

Figure 1.1: Overview of the Knowledge Discovery in Databases process.

are concerned with preparing data and evaluating the discovered patterns. A moredetailed description of the KDD process can be found in [BA96, Fay01].

Several Data Mining algorithms have been proposed in the literature [HSM01] toaddress di�erent data mining tasks, such as predictive modeling (classi�cation andregression), and descriptive modeling (clustering, summarization, among others).

• In predictive modeling the goal is to predict the value of some �eld(s) in adatabase based on the value of other �elds (having a priori some examples wherethe value of the �eld is known). If the value of the �eld being predicted mayassume a numerical (continuous) value then the problem is a regression problem.If the �eld is categorical then it is a classi�cation problem and each category isreferred as a class value.

• In descriptive modeling the goal is to build a model that describes interestingregularities in the data. Clustering and data summarization are two examples ofdescriptive modeling tasks. A clustering task consists of grouping the data itemsinto subsets that contain data items similar to each other. A data summarization

task consists in extracting compact patterns (summaries) that describe subsetsof the data.

Many Data Mining algorithms come from the Machine Learning (ML) area. MLalgorithms have been successfully used to extract (learn) relationships and correlationshidden in data. Learning a pattern (or a model) can be seen as the task of searchingthrough a space of possible patterns (or models) for one that �ts the data [Mit97].The goal of the search is to �nd the pattern that �best� �ts the data. The searchspace can be extremely large (or even in�nite), making impractical an enumerationof all patterns. To make the search feasible several constraints (bias) are imposedon the pattern language (language bias) and on the search method (search bias).

Page 32: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

30 CHAPTER 1. INTRODUCTION

Practical algorithms often perform heuristic search and do not guarantee that theoptimal pattern is found.

Most Data Mining approaches search for patterns in a single table (or relation),where each row (or tuple) on the table characterizes one object of interest. Thisis commonly referred to as propositional representation (or attribute-value represen-

tation). The algorithms that look for patterns in this representation are known aspropositional algorithms. In more complex applications the data involves severalrelations, thus being spread over multiple tables. Great care and e�ort has to bemade in data preparation in order to squeeze as much relevant data as possible intoa single table so that propositional algorithms can be applied. However, in spite ofthe care, propositionalizing the data (i.e., converting from multiple tables into a singleone) may lead to redundancy, loss of information [Wro01] or tables of prohibitivesize [DR98]. Fortunately, there is an alternative for propositional Data Mining calledMulti-Relational Data Mining (MRDM ).

Multi-Relational Data Mining can analyze data from multiple relations, with no needto transform the data into a single table �rst. Most relational Data Mining techniqueshave been developed within the area of Inductive Logic Programming (ILP). To learnpatterns in multi-relational data, ILP approaches mainly use languages based on LogicProgramming [Llo87, Hog90], which form an important subset of �rst-order logic andgeneralize relational databases, i.e., each relational database can be converted into alogic program [RU95]. Therefore, the patterns mined (extracted in a KDD contextor learned in a ML context) can reside in a relational or deductive database∗. Therelations in the database can be de�ned extensionally, as lists of tuples, or intensionally,as database views or sets of rules. The latter allows MRDM systems to take intoaccount valid domain knowledge, commonly referred to as background knowledge.

1.2 Inductive Logic Programming

Arguably, the most expressive and human understandable representation for learningmodels are if-then rules. Several learning approaches produce this kind of rules,such as propositional rule learning (see e.g., [Mic69, CN89]), and tree learning (seee.g., [Qui93]). Inductive Logic Programming (ILP) is another learning approach

∗A deductive database is a combination of a conventional database containing facts, a knowledge

base containing rules, and an inference engine that allows the derivation of information implied by

the facts and rules [RU95].

Page 33: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

1.2. INDUCTIVE LOGIC PROGRAMMING 31

capable of producing if-then rules.

Models discovered by ILP systems are typically represented as logic programs, a subsetof �rst order logic, and the patterns as clauses (rules). A model is therefore a set ofrules. ILP systems induce models from the input data that consists of training exam-ples of the target relation and, often, from prior knowledge, also named background

knowledge. Both, examples and background knowledge are usually represented as logicprograms.

Many ILP systems usually follow a generate-and-test approach to discover the patterns(see e.g., [QCJ93, BR98, DT00, MF01]). Using this approach, ILP systems learn thepatterns by performing a search through a space of candidate patterns for one withthe desired properties. The search space of patterns can be very large or even in�nite.Finding a pattern by naively enumerating all candidate patterns is, in almost allpractical applications, computationally too expensive or even unfeasible. Therefore,ILP systems often employ search strategies, such as greedy, randomized or branch-and-bound search. Regardless of the search strategy chosen, each pattern generatedis evaluated on the input data in order to determine its goodness. Patterns unable tobecome good are discarded, while potentially interesting patterns are further expandedin subsequent stages of the search. The search ends when a suitable pattern is found.

The evaluation of a single pattern involves testing if the pattern together with thebackground knowledge �explains� the training examples. The time needed to evaluatea single pattern depends primarily on the size of the set of training examples and onthe computational e�ort required to evaluate the pattern with the given backgroundknowledge. The evaluation of a single pattern, even for small sets of training examples,can take very long time.

Inductive Logic Programming has several advantages, namely:

• Expressiveness. First Order Logic (FOL) can represent more complex conceptsthan traditional attribute-value languages. Typically, structural concepts arehard to represent using a zero-order language.

• Readability. It is arguable whether logic formulae are easier to read than decisiontrees or a set of linear equations. However, they are potentially readable. If theknowledge is structured, a �rst order representation is probably easier to readthan a zero-order one.

• Use of background knowledge. Domain knowledge can be encoded and givenas background knowledge. The source of the background knowledge can be an

Page 34: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

32 CHAPTER 1. INTRODUCTION

expert or a discovery system. In some cases background knowledge can be grownduring discovery time. This is known as predicate invention or constructivesynthesis.

• Multiple tables. Handling multiple relations is natural in FOL. Therefore multipledatabase tables can be handled without explicit and expensive joins.

The expressiveness of FOL gives to the induced models �exibility and understand-ability. However, ILP systems su�er from signi�cant limitations that reduce theirapplicability in data mining tasks. Most ILP systems execute in main memory, limitingtheir ability to process large databases. Furthermore, ILP systems are computationallyexpensive, e.g., evaluating individual rules may take considerable time. On complexapplications, ILP systems can take several hours, if not days, to return a model.Therefore, a major obstacle that ILP systems must overcome is e�ciency.

1.3 Motivation and Contributions

There are several reasons to improve the e�ciency of ILP systems. First, e�cient ILPsystems may be applied to larger and wider spectrum of problems. Second, one of thechallenges for ILP is the improvement of human-computer interaction [PS03]. To thisend, ILP systems need to be interactive, which in turn implies that the systems shouldbe very fast in order to make the interaction practical in real time. Third, fast ILPsystems are required because it is common to perform several runs in order to searchfor good parameter settings or/and to produce performance statistics of the modelsfound (e.g., using cross-validation [Koh95]).

In previous years, research on improving the e�ciency of ILP systems has been mainlyfocused in reducing their sequential execution time (see Section 2.5 for a survey).Another promising line of research to improve e�ciency, as pointed out by severalresearchers in recent years [Pag00, PS03], is the parallel execution of ILP systems (asurvey is presented in Chapter 4). Parallel processing is becoming widespread makinguse of the growing number of existing computer clusters and grids. A particularexample is the widespread use of �Beowulf Clusters�, arrangements of several personalcomputers that allow for parallel processing. Although parallel processing cannotchange the order of time or space complexity of an algorithm but it may be usedto obtain solutions faster, by using more processing power (more CPUs), or bettersolutions, in the same amount of time.

Page 35: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

1.3. MOTIVATION AND CONTRIBUTIONS 33

The main goal of this work is to improve the e�ciency of ILP systems by studying

and developing techniques to parallelize their execution.

To achieve this goal we started by implementing a sequential ILP system (from scratch)and tried to make it competitive with state-of-the-art sequential ILP systems. Thereasons for the development of a new ILP system as opposed to use an existing onewere twofold:

• Firstly, when evaluating parallel algorithms one tries to compare them to ane�cient sequential algorithm. Many techniques have been proposed to reducethe sequential execution time of ILP systems. However, their implementationis scattered among several ILP systems. Hence, a competitive sequential imple-

mentation should include the latest techniques for improving e�ciency;

• Secondly, a �exible and modular sequential ILP system would make the imple-mentation of several parallel ILP algorithms an easier and faster mission.

The ILP system implemented is called April. It combines and integrates severaltechniques to maximize e�ciency. The development of a �exible ILP system, wherethe plug-in of (new) techniques is easily performed, should make the evaluation andintegration of novel proposals a faster and easier task.

There are undoubtedly a variety of parallel schemes and algorithms that can beimplemented. A second contribution of this work is a comparative study of severalparallelization strategies based on data partitioning and parallel searches. The goal ofthe study was to assess which strategy is better overall.

Randomized algorithms [MR95] make random choices during the execution with theexpectation of achieving good performance in the �average case�. The solutions pro-duced by randomized algorithms may be the same as regular algorithms or not. Athird contribution is an empirical comparison between parallel algorithms and (sequen-tial and parallel) randomized algorithms. We assess whether the parallel algorithmsproduce better models and in less time than sequential randomized algorithms. Inother words, we tried to assess if and when the parallelization of ILP systems pays o�as opposed to performing randomized searches.

One last contribution is the proposal of a novel parallel covering algorithm targeted fordistributed memory machines that exploits pipelined data-parallelism [HJ86, KCN90]and is geared to achieve good speedups while preserving the quality of the models.

Page 36: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

34 CHAPTER 1. INTRODUCTION

1.4 Thesis Organization

The following chapter establishes the context and provides the necessary backgroundfor the remaining chapters of this thesis. It also presents a survey on approaches toimprove the performance of ILP systems.

Chapter 3 describes the April ILP system and presents an empirical evaluation of thesystem, including a comparison with the state-of-the-art ILP system Aleph [Sri03].

Chapter 4 presents and evaluates several parallelization strategies for ILP.

Chapter 5 presents and evaluates a pipelined data-parallel algorithm for ILP systems,and more generally, algorithms based on the generic covering algorithm.

Finally, in Chapter 6 we summarize and present some future work.

An e�ort has been made to formulate each chapter of the dissertation as relativelyself-contained. Those who wish to do so can read any chapter independently afterreading chapter 1 and 2.

1.5 Bibliographic Notes

Parts of this dissertation have been published in conferences. The following listcontains the key articles.

• A state-of-the-art survey on parallelization strategies for ILP together withthe evaluation of three data parallel algorithms is presented in an ILP-2005paper [FSC05] (awarded best student paper).

• The new parallel algorithm is described in a Cluster-2005 paper [FSCC05].

• The April ILP system is described in a paper to be included in the proceedingsof the JELIA06 conference [FSC06].

• A proposal on avoiding redundancy in Inductive Logic Programming is discussedand evaluated in [FCCS04].

• A preliminary integration of Tabling [Mic68], a Logic Programming technique,with ILP is discussed in ECML-2005 paper [RFC05].

Page 37: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

1.5. BIBLIOGRAPHIC NOTES 35

My contribution in [RFC05] consisted in the design, implementation, and evaluation

of a scheme to explore tabling in ILP systems.

• Two e�cient data structures to Inductive Logic Programming systems are pro-posed and evaluated in [FRCS03].

• An evaluation of the coverage caching technique is performed in a paper at EPIA2003 [FCSC03].

Page 38: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

36 CHAPTER 1. INTRODUCTION

Page 39: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

He who loves practice without theory is like the sailor who boards

a ship without a rudder and compass and never knows where he

may cast.

Leonardo da Vinci

2On Inductive Logic Programming

This chapter introduces the main concepts, techniques and ter-

minology of Inductive Logic Programming (ILP) relevant for the

remaining chapters. The chapter ends with a survey of techniques

to improve the e�ciency of ILP systems.

2.1 Introduction

Deduction operates by discovering the necessary implications of established truths,that is, established generalizations are applied either to other generalizations or tospeci�c cases in order to discover new conclusions. In a deductive argument, if thepremises∗ are true and if the argument is valid, then the conclusion must also be true.For instance, with a premise containing the rule that �all men are mortal� and knowingthat �Socrates is a man� we can deduce that �Socrates is mortal�.

On the other hand, induction, or inductive logic, is the process of forming conclusionsthat reach beyond the premises, beyond the current boundaries of knowledge, thusmaking inductive conclusions probable rather than certain. At the core of inductive

∗Premises are statements (facts and rules) believed to be true, and they include reasons, evidence,

and observations.

37

Page 40: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

38 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

thinking is the "inductive leap", the stretch of imagination that draws a reasonableinference from the available information. Therefore, inductive conclusions are onlyprobable. A generalization, or inductive generalization, is a particular type of inductionthat proceeds from a premise about a sample to a conclusion about the population.For example, "All my friends consider this an interesting dissertation, so everybodyelse must consider it interesting too".

Induction draws conclusions about a class of objects based upon the characteristicsobserved in a sample of that class. Such conclusions are persuasive to the extentthat the sample was connected causally to the larger class in such a way that thecharacteristics of the larger class will be re�ected in the sample. However, inductivearguments are not erosion proof. Conclusions based on inductive reasoning may turnout to be false or improbable when given more data. As an example consider the caseof Bertrand Russell's Inductive Turkey:

�The turkey found that, on his �rst morning at the turkey farm, that he

was fed at 9 a.m. Being a good inductivist turkey he did not jump to

conclusions. He waited until he collected a large number of observations

that he was fed at 9 a.m. and made these observations under a wide

range of circumstances, on Wednesdays, on Thursdays, on cold days,

on warm days. Each day he added another observation statement to his

list. Finally he was satis�ed that he had collected a number of observation

statements to inductively infer that �I am always fed at 9 a.m.�. However

on the morning of Christmas eve he was not fed but instead had his throat

cut.�

Therefore, while deductive inference proceeds by application of sound rules of inference,inductive inference typically involves unsound conjectures.

In a Machine Learning context, when learning from a sample of data, the model isbased on the sample, but it is generalized for the whole population from where thesample was taken. Inductive Logic Programming (ILP) [Mug90] generalizes fromindividual instances in the presence of background knowledge, conjecturing patternsabout yet unseen data. The patterns discovered by ILP systems are usually expressedas logic programs, a subset of �rst-order (predicate) logic.

Inductive Logic Programming (ILP) is a sub�eld of Machine Learning (ML). It istheoretically settled at the intersection of inductive learning and Logic Programming.Logic Programming (LP) is the programming paradigm that uses First Order Logic(FOL) to represent relations between objects and implements a form of deductive

Page 41: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.1. INTRODUCTION 39

reasoning. From inductive machine learning ILP inherits its goal: the development oftools and techniques to induce hypotheses from observations (examples). From LogicProgramming it inherits its representation formalism, its semantic orientation, andsome operational techniques.

ILP goes beyond the theory of Logic Programming by investigating induction ratherthan deduction as the basic mode of inference. In a sense, ILP performs the inversetask of Logic Programming, in the same way that induction is the inverse of deduction(see Figure 2.1). When deducing we start by having some facts and general laws(or rules), and from them new facts are derived. For instance, suppose it is knownthat �Henry is the father of Jane�, �Jane is the parent of John�, and a rule describingthe grandfather relation. With this rule and facts it is valid to deduce that �Henryis the grandfather of John�. On the other hand, suppose that given some examples(instances) of the grandparent relation we want to induce the grandparent rule. Fromthe examples and premises (background knowledge) one can induce a de�nition of thegrandparent relation.

Figure 2.1: Induction versus deduction.

The ILP problem described above is the classical inductive concept learning from

examples task. Given some background knowledge and sets of positive and nega-tive examples, the goal is to �nd a hypothesis that can predict the examples giventhe background knowledge. In this learning problem, termed as predictive learning

problem, the goal is to �nd a generalized hypothesis which can replace the examples.In contrast, in descriptive learning the goal is to �nd properties that describe thedata, i.e., that hold on the data. This learning problem is known as descriptive ILP

whereas the �rst one is known as predictive ILP. Learning classi�cation rules [MF01],

Page 42: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

40 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

induction of logical decision trees [BR98], �rst-order regression [KB97] are examples ofpredictive ILP tasks. Learning integrity constraints [RD97], association rules [DT00],�rst order clustering [LvF02], subgroup discovery [Wro97] are examples of descriptiveILP tasks. In this thesis we will focus on predictive ILP, more speci�cally in learningclassi�cation rules.

The remainder of this chapter is organized as follows. We next present the ILPproblem, followed by a description of strategies to solve it. We focus on two searchtechniques relevant for the description of the work: stochastic clause selection andrandomized rapid restarts. For a thorough and more formal introduction to ILP werefer to [NCdW97, MDR94]. Finally, we conclude with a comprehensive survey onimproving the performance of ILP systems.

2.2 The ILP Problem

Whilst Logic Programming is about deducing logical consequences from programs, ILPstarts with the logic consequences and aims at obtaining logic programs. In general,an ILP system receives as input prior knowledge B (a.k.a. background knowledge)and some examples E, and induces (learns) a theory H (a.k.a. a model).

The background knowledge B is represented as a logic program. The predicates in B

can be de�ned either intensionally or extensionally. A predicate is de�ned extensionallywhen it consists of a set of ground facts. A predicate is de�ned intensionally if it isde�ned through non-ground program clauses.

The set of examples (denoted as E) is typically represented as sets of ground unitclauses (facts). Other approaches to represent the examples may involve groundclauses [BR98], or non-ground clauses [Sri03]. Typically, there are two types ofexamples: positive examples, which are true; and negative examples, which are false.The sets of positive and negative examples are denoted by E+ and E−, respectively.

A theory H, a hypothesis∗, is a set of rules. Hypotheses are found within some hy-potheses language L (also called concept language), typically a subset of the languageof program clauses. Since the complexity of learning grows with the expressiveness ofthe hypothesis language, restrictions are usually imposed on the hypotheses. Theserestrictions are referred to as language bias and are used to reduce the number of

∗Depending on the context, a hypothesis may refer to a single rule (clause) or a theory (set of

rules).

Page 43: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.2. THE ILP PROBLEM 41

candidate hypotheses (we will address this issue in Section 2.3.7).

The di�erent tasks addressed by an ILP system can be generally de�ned as outlinedin Figure 2.2 (more formal de�nitions can be found in [MDR94, NCdW97, WD95,Rae97]).

Given:

• a set of examples E

• a background knowledge B

• a language L that de�nes the allowed hypotheses

• a notion of explanation

Predictive ILP :Find a hypothesis H ∈L such that:

• H together with B explains E+

(completeness) and ;

• H together with B does not explainE− (consistency).

Other conditions imposed are:

• B does not explain neither E+ norE−

Descriptive ILP : Find a hypothesisH ∈ L such that:

• H explains B together with E+

(i.e., H is valid in B and E+);

and possibly that

• H is complete, i.e., for any other H ′

in L, H explains H ′;

• H is non-redundant, i.e., there isno proper subset H ′ of H which isvalid and complete

Figure 2.2: The ILP problem.

There are two main de�nitions for the notion of explanation: learning from entailment

and learning from interpretations. The �rst notion of explanation is the most employedin ILP and is often used in predictive ILP, while learning from interpretations is moreused in descriptive ILP. The two notions, described in more detail in the next sections,di�er mainly in their representation of the examples.

Predictive ILP with the learning from entailment semantics is also known as normalsemantics, normal ILP setting, explanatory ILP, or strong ILP. Systems using thissemantics are concerned with the induction of rules that explain (correctly classify) the

Page 44: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

42 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

given examples. Descriptive ILP systems, also termed as non-monotonic ILP [MDR94]systems, are concerned with �nding properties in the data (background knowledge andexamples), as opposed to predictive ILP, where the goal is to �nd rules that can replacethe data.

Descriptive ILP has a very interesting property: since no negative examples are usuallyused, if two individual clauses are valid on the data, their conjunction will also be validon the data [RD97]. Therefore, parallel ILP systems following this setting are ratherstraightforward to implement since clauses can be searched independently of each other(and therefore in parallel). This property is not satis�ed when learning with negativeexamples in both settings.

In this thesis the normal semantics is employed, i.e., we address the predictive ILPlearning task with learning from entailment semantics. Although the task and seman-tics are di�erent, many of the ILP techniques described in the remaining of the chapterare applicable to both ILP tasks and to both notions of explanation.

2.2.1 Learning from Entailment

This setting, introduced by Muggleton [Mug91], is used by many ILP systems (e.g.,Golem [MF90], Foil [QCJ93], Progol [MF01], Indlog [Cam00], and Aleph [Sri03]). Thenotion of explanation, in this setting, denotes coverage based on logic entailment.Given a background theory B, a hypothesis H and an example set E, where H andB are sets of de�nite clauses:

an example e ∈ E is (intensionally) covered by H i� B ∧H � e.

Also related with this setting, one can de�ne global completeness and consistency asfollows [Rae97]:

• A hypothesis H is (globally) complete i� B ∧H � E+.

• A hypothesis H is consistent i� B ∧H 2 E−.

• A hypothesis that is complete and consistent is said to be correct.

Note that � denotes logical implication (semantic entailment) that in practice isoften replaced by syntactic entailment (`) or provability, where the SLDNF-resolutioninference rule (as implemented by Prolog engines) is often used to prove that examplesare derivable from a hypothesis and background knowledge.

Page 45: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.2. THE ILP PROBLEM 43

B E+ E−

male(henry). female(eve). grandfather(henry,jane). grandfather(john,jane).

male(john). female(jane). grandfather(eve,jane).

father(henry,john). grandfather(eve,john).

father(john,jane).

mother(eve,john).

parent(X,Y):-mother(X,Y).

parent(X,Y):-father(X,Y).

Table 2.1: Family data representation in a learning from entailment setting.

In the most general formulation, each e ∈ E as well as B and H can be a clausaltheory. In practice, each e is often a ground unit clause, while B and H are de�nitelogic programs. In learning from entailment it is usual to have both positive andnegative examples.

The data may be noisy, i.e., the background knowledge may contain errors or the set ofexamples may contain errors of classi�cation. To handle noise, ILP systems often relaxthe consistency condition and allow a rule to cover some negative examples. In thecases where no negative examples are provided it is possible to derive them explicitlyby applying the Closed World Assumption (CWA). The negative examples are usefulbecause they help ILP systems to avoid producing too general hypotheses but theyare not mandatory [Mug96].

Example 1 (An ILP problem in the normal ILP setting)

Having the data presented on Table 2.1, an ILP system using a normal

setting of ILP could learn a rule (that may replace the examples) for the

grandfather(X,Y) relationship which states that person X is grandfather of

Y . Knowing B and faced with the facts E+ and E− an ILP system would

be able to infer that:

grandfather(X, Y ) : −father(X, Z), parent(Z, Y ).

i.e., a grandfather of a person Y should be the father of one of his/her

parents.

Page 46: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

44 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

2.2.2 Learning from Interpretations

In learning from interpretations [Rae97] hypotheses are clausal theories and examplesare Herbrand interpretations. The requirement in learning from entailment that B ∧H � e is replaced in learning from interpretations [RD94] by the requirement that H tobe true (valid) in the minimal Herbrand model of B∧e,∀e. Therefore, coverage underlearning from interpretations is de�ned as follows. Let e be an example (Herbrandinterpretation),

H covers e under interpretations if and only if B ∧ e is a model of H

Learning from entailment and learning from interpretations di�er mainly in the formthat examples and background knowledge are represented. From a practical point ofview, in learning from interpretations each example e is represented as a separateProlog program encoding its speci�c properties as sets of interpretations and thebackground knowledge is given in the form of another Prolog program. In learning fromentailment all data is together in the background knowledge and there is no separationof the examples (besides being positive or negative). Therefore, data partitioning inlearning from interpretations can be performed more e�ciently, since all propertiesof an example are grouped together, while in learning from entailment that does nothappen.

Example 2

Consider the data in Table 2.1. In the learning from interpretations setting

the background knowledge and examples could be represented as follows.

B E+ E−

parent(X,Y):- grandfather(henry,jane):- grandfather(john,jane):-male(john),

mother(X,Y). male(henry), female(jane), father(henry,john),

parent(X,Y):- female(jane), male(henry).

father(X,Y). father(henry,john), grandfather(eve,jane):-female(eve),

male(john), female(jane), mother(eve,john),

father(john,jane). man(john), father(john,jane),

father(henry,jonh), male(henry).

grandfather(eve,john):-male(john),

female(eve), mother(eve,john),

father(john,jane), female(jane).

It has been shown [Rae97] that learning from interpretations reduces to learning from

Page 47: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3. THE LEARNING PROCESS 45

entailment. This means that solutions found in the learning from entailment withclose-world assumption are also solutions in the learning from interpretations settingbut the opposite does not always hold.

Learning from interpretations has been applied to several problems, namely in learning�rst-order logical decision trees [BR98] and association rules [DT00], perform �rst-order clustering [RB97] or subgroup discovery [Wro97].

2.3 The Learning Process

Usually, it is not obvious which hypothesis (H) we should choose from the set ofpossible hypotheses (L). One way to choose H comprises the enumeration of allpossible hypotheses within the language. This can be done through general arti�cialintelligence techniques like generate and test algorithms. However, due to the largeand possibly in�nite size of the search space, a pure generate and test approach iscomputationally too expensive to be of practical interest. Therefore, as many otherlearning problems, the approach followed is to map the learning problem into a searchproblem [Mit82].

The states in the search space (designated as hypothesis space) are hypotheses andthe goal is to �nd one or more satisfying some quality criterion. Therefore, thesearch involves traversing the hypothesis space, which in turn involves generating andevaluating hypotheses.

The total execution time of an ILP system can be roughly divided into three majorcomponents:

• time spent to generate rules;

• time spent to evaluate rules;

• and, in some cases, the time to load the data from disk (for very large datasets).

The dominant component may depend on the ILP system used, on the search algo-rithm, on the parameter settings, and on the data. In most cases, the evaluation of therules is responsible for a signi�cant part of the execution time. It has been shown by�elezný et al. [vSP02] (in two applications) that ILP runtimes may di�er considerably,depending on the choice of a heuristic, on the problem instance, or on the choice ofthe seed examples.

Page 48: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

46 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

We next describe how the �nal hypothesis produced by an ILP system is found.

2.3.1 A Generic Rule Covering Algorithm

A plethora of rule learning algorithms [Für99] use a variant of the generic covering

algorithm (also called separate-and-conquer). An example of a generic covering algo-rithm is presented in Figure 2.3. The algorithm learns one rule at a time using somegeneralization procedure that performs a search through an ordered space of legal rules.After �nding a rule, all covered positive examples are separated (removed) from thetraining set and the next rule is learned from the remaining examples. The algorithm'sexecution can be divided into a number of epochs. Each epoch is responsible for �ndingrules that cover a number of positive examples. In other words, an epoch correspondsto the outer cycle of the algorithm. The execution will consist of as many epochs asneeded to have all positive examples covered or to have some other stopping criteriaveri�ed (e.g., time constraint).

Most predictive ILP systems use some variant of the generic covering algorithm. Themain di�erence between existing ILP systems and algorithms that use a variant ofthis covering algorithm (see e.g., [Mug95, QCJ93, Sri03]), resides in the learn_rule()procedure (step 4).

The learn_rule() procedure (outlined in Figure 2.4) accepts a set of examples andreturns a consistent rule (clause), or a set of consistent rules, that explain some orall positive examples. It searches the (potentially in�nite) space of rules, starting atsome rule(s) (START_RULES), for a rule that optimizes some quality criteria. InFigure 2.3 the learn_rule() procedure is invoked with k = 1 and therefore it willreturn a single rule. Later on we will consider that it returns a set with the best k

good rules, for some k ≥ 1.

2.3.2 Structuring the Hypothesis Space

In order to facilitate the search, the set of rules (hypotheses ∈ L) is structured, i.e.,ordered, through the dual notions of generalization and specialization. There areseveral orderings that can be used in ILP [NCdW97]. A popular ordering is the θ-subsumption, used initially by Plotkin [Plo70] in the learning context:

De�nition 1 Let c and d be two clauses. A clause c θ-subsumes d (c � d) if there

Page 49: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3. THE LEARNING PROCESS 47

covering(E)

Input: set of examples E.Output: A set of consistent rules.

1. Rules_Learned=∅2. E+=POSITIVES(E)3. while E+ 6= ∅ do

4. R=learn_rule(1,E)5. Rules_Learned = Rules_Learned ∪ R6. E+ = POSITIV ES(E) = E+ \ {Examples Covered by R}7. end while

8. return Rules_Learned

Figure 2.3: A generic covering algorithm. The learn_rule() procedure returns a setwhich includes the best rule found that explains a subset of the positive examples(E+).

learn_rule(k,E)

Input: Maximum number of rules to return (k) and a set of examples (E).Output: A set containing the best k rules found.

1. Good=∅2. S=START_RULES3. while some stop criteria not satis�ed do

4. Pick=pickRule(S)5. S=S \ {Pick}6. NewRules=genNewRules(Pick)7. evalOnExamples(E,NewRules)8. Good=Good ∪ {r ∈ NewRules | good_rule(r)}9. end while

10. return bestOf(k,Good)

Figure 2.4: An example of a generic learn_rule() procedure.

Page 50: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

48 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

exists a substitution θ, such that cθ ⊆ d.

Example 3

To illustrate the above notion, consider the clauses c1 and c2

c1 = grandfather(X, Y ) : −parent(X, Y ).

c2 = grandfather(X, Y ) : −male(X), parent(X, Y ).

Under clausal form c1 may be represented {grandfather(X, Y ), parent(X, Y )},where all variables are assumed to be universally quanti�ed and the commas

denote disjunction.

Clause c1 θ-subsumes c2 (c1 � c2) under the empty substitution θ = ∅,

since {grandfather(X, Y ), parent(X, Y )} is a proper subset of {grandfather(X, Y ),

male(X), parent(X, Y )}.

θ-subsumption introduces a syntactic notion of generality. A clause c is at least asgeneral as clause d (c � d) if c θ-subsumes d. A clause c is more general than d (c ≺ d)if c � d holds but d � c does not. In this case it is said that d is a specialization (orre�nement) of c, and c is a generalization of d.

The most relevant properties of θ-subsumption are [NCdW97]:

• If c θ-subsumes d then c � d. The opposite does not always hold. For instance,for the self-recursive clauses c3 = p(f(X))← p(X) and c4 = p(f(f(Y )))← p(Y ),c3 � c4 but c3 does not θ-subsumes c4.

• A clause c is θ− equivalent to a clause d i� c θ-subsumes d and d θ-subsumes c.For example, the clause c1 described above and the clause

c5 = grandfather(X, Y ) : −parent(X, Y ), parent(Z,W )

θ − subsume one another and are thus equivalent. Since two clauses that areequivalent under θ-subsumption are also logically equivalent, ILP systems shouldgenerate only one clause for each equivalent class. In other words, ILP shouldgenerate reduced clause [Plo70]: a clause is reduced i� is not θ − equivalent toany subset of itself. For instance, the clause c1 is reduced while c5 is not.

• The relation � introduces a lattice∗ on the set of reduced clauses [Plo70]. Thetop element of the generalization lattice is �, the empty clause. The glb (greatest

∗A lattice is a partial ordered set, i.e., a re�exive, anti-symmetric and transitive binary relation,

in which every pair of elements (a, b) has a greatest lower bound (glb) and a least upper bound (lub).

Page 51: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3. THE LEARNING PROCESS 49

lower bound) of two reduced clauses c and d is called the most general instance(mgi) and is the union of the two clauses mgi(c, d) = c ∪ d. The lub of twoclauses c and d is called the least general generalization (lgg) [Plo70]. Under θ-subsumption the glb and lub of clauses are unique up to a renaming of variables.

It is important to note that θ-subsumption is a purely syntactic notion that does nottake into account the background knowledge. Naturally, the same holds for the notionof generality based on θ-subsumption.

An alternative notion of generality is logical implication: a clause c is at least asgeneral as clause d with respect to the background knowledge B if B ∪ {c} � d.The generalization relation based on logical implication is the most natural notionof generalization, however it poses some practical problems. For instance, given twoclauses c and d it is not a decidable problem to determine that c � d [MP92]. Anotherissue is that two clauses do not necessarily have a least general generalization underimplication [NCdW97], therefore not introducing a lattice on the set of clauses.

Therefore, although other notions of generalization exist [NCdW97], for practicalreasons the θ-subsumption is the most frequently used in ILP.

2.3.3 Hypotheses Generation

Having a structure on the set of hypotheses, ILP systems use re�nement operators,name coined by Shapiro [Sha83], to navigate through the lattice of hypotheses (hereinreferred as search space). A re�nement operator ρ is a function which computes(generates) a set of specializations (dually, generalizations) of a (set of) clause(s).Next we will focus on re�nement operators of a clause.

Specialization re�nement operators allow the navigation through the lattice of clausesfrom the most general towards more speci�c ones. This kind of re�nement operator,called a downward re�nement operator, basically employs two syntactic operations ona clause:

1. apply a substitution θ to the clause;

2. add a literal (or a set of literals) to a clause.

Re�nement operators can also compute a set of generalizations of a clause, allowingthe navigation through the lattice from the most speci�c towards more general clauses.

Page 52: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

50 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

This type of re�nement operator, called upward re�nement operators, performs twobasic syntactic operations on a clause:

1. apply an inverse substitution to the clause, and

2. remove a literal from the body of the clause.

A re�nement operator is considered ideal [vdL95] if it respects three properties:

• properness, i.e., a re�nement operator should not generate equivalent (redun-dant) clauses;

• local �niteness, i.e., it should compute only a �nite set of re�nements;

• completeness, i.e., every re�nement should be reachable by a �nite number ofapplications of the operator.

It has been shown [vdL95] that ideal operators do not exist for unrestricted θ-subsump-tion ordered set of clauses, as used in most ILP systems. Hence, generic re�nementoperators for ILP cannot be ideal. Since guaranteeing completeness and local �nitenessis fundamental, properness is usually sacri�ced. Therefore, ILP systems generateredundant hypotheses. Besides completeness and local �niteness, further restrictionsare imposed upon the re�nement operators. One of those conditions require that thegenerated hypotheses satisfy the language bias (see Section 2.3.7).

The hypothesis space is designated as a re�nement graph. A re�nement graph isa directed, acyclic graph in which nodes are clauses and arcs correspond to basicre�nement operations. The average branching factor of a re�nement graph, i.e., theaverage number of re�nements, may be very large. Therefore, for e�ciency reasons,the number of clauses considered may be limited (see e.g., [QCJ93]).

Figure 2.5 illustrates a part of a re�nement graph for the grandfather relation prob-lem presented in Table 2.1. At the top of the re�nement graph is the clause c =

grandfather(X, Y ). Connected to c are clauses generated by some re�nement operatorthat adds a literal to a clause.

2.3.4 Search Strategy

The search strategy used by an ILP system indicates how the lattice is traversed. Thetwo most common approaches follow speci�c-to-general (bottom-up) and general-to-

Page 53: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3. THE LEARNING PROCESS 51

Figure 2.5: Part of a re�nement graph for the grandfather relation. The rules areordered from the most general general (top) to more speci�c (bottom).

speci�c (top-down) directions.

In a top-down search, the initial rule is, usually, the more general one. Each rule isrepeatedly specialized through the application of downward re�nement operators inorder to remove inconsistencies with the negative examples.

In a bottom-up search, the examples, together with the background knowledge, arerepeatedly generalized by applying upward re�nement operators.

Other less common approaches, neither top-down or bottom-up, and possibly notdeterministic exist (see e.g., [vSP02, TNM03, Sri00]). Some of non-deterministicapproaches are described in Section 2.4, namely Stochastic Clause Selection and Ran-domized Rapid Restarts.

2.3.5 Search Method

Another component of the search (besides the strategy) is the search method used to�nd a hypothesis. The search methods often employed in ILP are the ones used inArti�cial Intelligence for problem solving [RN03], namely uninformed search methodsor informed (heuristic) search methods. The later class of methods can be used whenit is possible to know (or estimate) if a hypothesis is more promising than another.These methods di�er in the order by which the nodes are expanded.

Some of the most well-known uninformed search methods are breadth-�rst, depth-�rst,

Page 54: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

52 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

and iterative-deepening search. Breadth-�rst search is a simple method in which theroot node is expanded �rst, then all direct successors of the root node are expanded,then all their successors, and so on. In general, all the nodes are expanded at a givendepth of the search before any nodes in the following level are expanded. The maindrawbacks of this search method is the high memory requirements and poor e�ciency.Several ILP systems use this search method, e.g., MIS [Sha83] and Warmr [DT00]. Thedepth-�rst search always expands the deepest node. This search has the advantage ofhaving lower memory requirements, since a node can be removed from memory afterbeing expanded as soon all its descendants have been fully explored. Since, in general,the depth of the graph can be in�nite, a variant of depth-�rst search, called depth-limitsearch imposes a limit l on the depth of the graph explored. Obviously, this search isincomplete if a solution is in a region deeper than l. Iterative depth-�rst search (ids)is a general strategy often used in combination with depth-�rst search, that �nds thebest depth limit. It does this by gradually increasing the depth limit (�rst 0, then1, then 2, and so on) until a solution is found. The drawback of ids is the wastefulcomputations. Iterative deepening search is used in ILP systems like Aleph [Sri03]and Indlog [Cam00].

Informed (heuristic) search methods can be used to �nd solutions more e�ciently thanuninformed but can only be applied if there is a measure to evaluate and discriminatethe nodes. Several heuristics have been proposed to ILP. This topic is discussed inSection 2.3.8.

A general approach to perform informed search is the best-�rst search. Best-�rst

search is similar to breadth-�rst strategy but with the di�erence that the node selectedfor expansion is based on the value of an evaluation function that estimates the�distance� to the solution. The node with the lowest (or highest, depending on theimplementation) evaluation is selected for expansion∗. In practice one uses a heuristicfunction to estimate the quality of a node. An heuristic function gives an estimateof the cost/distance to the solution from a given node. Due to the use of a heuristicfunction, the designation of best-�rst search is over rated, since it selects the nodethat seems the best and not necessarily the best one. The best-�rst search is used bysystem like Cigol [MB88], Indlog, and Aleph.

A form of best-�rst search is greedy best-�rst search as it expands only the node

∗If it were possible to evaluate and discriminate the nodes accurately, then expanding the best

node �rst would not be a search at all: it would consist of following a straight path to the solution.

Therefore, the evaluation functions return an estimate of the quality of a node and are not exactly

accurate.

Page 55: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3. THE LEARNING PROCESS 53

estimated to be closer to the solution on the grounds that it would lead to a solutionquickly. The most widely known form of best-�rst search is called A∗ search. Itevaluates the nodes by combining the cost to reach the node and the distance/cost tothe solution. This algorithm is used by the Progol ILP system [Mug95].

The search algorithms mentioned so far explore the search space systematically. An-other class of search algorithms are the local search algorithms, commonly used forsolving computationally hard problems, where the space of candidate solutions is verylarge. The goal of a local search algorithm is to �nd the best node in the search space,according to some objective function. The basic principle underlying local search isto start from an initial candidate solution and then, iteratively, make moves from onecandidate solution to another candidate solution in its direct neighborhood. The movesare based on information local to the node, and continue until a termination conditionis met. For a description of a generic local search algorithm we refer to [RN03]. Localsearch algorithms, although not systematic, have the advantage of using little memoryand can �nd reasonable solutions in large or in�nite search spaces, where systematicalgorithms are unsuitable.

Hill-climbing search [RN03] is one example of a local search algorithm. The algorithmdoes not look ahead beyond the immediate successor nodes of the current node, and forthis reason it is sometimes called greedy local search. Although it has the advantageof reaching a solution more rapidly, it has some potential problems. For instance, it ispossible to reach a foothill (local maxima or local minima), a state that is better thanall its neighbors but it is not better than some other states farther away. Foothillsare potential traps for the algorithm. Hill-climbing is good for a limited class ofproblems where we have an evaluation function that fairly accurately predicts theactual distance to a solution. ILP systems like FOIL [QCJ93] and Forte [RM95] usehill-climbing search algorithms.

There are many variations of the hill-climbing search [RN03]. For instance, local beamsearch algorithms keep k nodes in memory instead of just one. It begins with k

randomly generated nodes. At each step, all successors of all k states are generated. Ifany of the successors is the solution, the algorithm halts, otherwise, it selects the k bestsuccessors from the complete list and repeats. The greediness of this approach oftenhas a good impact on performance. However, a solution may not be found because awrong subset of paths was chosen.

Page 56: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

54 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

2.3.6 Bounding the Search

The search space can be very large. Several techniques have been developed andapplied to ILP in order to reduce the search space. The search can be bounded throughthe exploration of bias declarations (see Section 2.3.7) or/and by bounding the searchspace bellow (in the lattice) through the use of a bottom clause (see Section 2.3.10).Another approach is the use of a generic branch and bound technique.

Branch-and-bound is a general algorithmic method for �nding optimal solutions forvarious optimization problems. It belongs to the class of implicit enumeration methods.It �nds the optimal solution by keeping the best solution found so far. If a partialsolution cannot improve on the best, it is abandoned (pruned).

The degree to which the search space may be pruned depends strongly on the natureof the problem being solved. In the worst case, no subtrees are pruned and the branch-and-bound procedure visits all the nodes in the search space. The technique is nothelpful if all solutions are about the same or the initial solutions are very poor andbetter solutions are only found gradually. In either case, the cost is similar to exploringthe entire search space. It is important to understand the trade-o� being made: thesearch space is being pruned with the overhead of performing the tests as each node isvisited. The technique is successful only if the savings that follow from pruning exceedthe additional execution time arising from the tests.

The branch-and-bound technique can easily be incorporated into the learn_rule()procedure of Figure 2.4. Such extension is outlined in Figure 2.6. The changed/addedlines are underlined.

The lattice introduced by the generalization relation allows the justi�able pruning oflarge parts of the search space, hence reducing the hypotheses considered during thesearch. For instance, one can prune a hypothesis h if:

• when specializing, h does not explain positive evidence then all specializationsof h will also not explain the positive evidence;

• when generalizing, h is inconsistent then all generalizations of h will also beinconsistent.

As will be referred in the following subsections, the search space can be further reducedusing bias.

Page 57: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3. THE LEARNING PROCESS 55

learn_rule_bb(k,E)

Input: Number of rules to return (k) and a set of examples (E).Output: A set consisting of the best k rules found.

1. Good=∅2. S=START_RULE3. while stop criteria not satis�ed do

4. Pick=pickRule(S)5. S=S \ {Pick}6. if not prune(Pick) then

7. NewRules=genNewRules(Pick)8. NewRules= {r ∈ NewRules | not prune(r)}9. evalOnExamples(E,NewRules)10. Good= Good ∪ {r ∈ NewRules | good_rule(r)}11. S= S ∪ (NewRules \Good)12. endif

13. end while

14. return bestOf(k,Good)

Figure 2.6: An example of a generic learn_rule() procedure with branch-and-bound.The changed/added lines are underlined.

Page 58: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

56 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

2.3.7 Bias

ILP is a complex problem due to the large and potentially in�nite size of the hypothesisspace L. Practical ILP systems reduce the size of the hypothesis space by imposing allsorts of restrictions, mostly syntactic, on candidate hypotheses. Another approach isto reduce the search space, i.e., reduce the subset of L that is generated and evaluatedduring the search process. Bias [Mit80] determines the hypothesis space and searchspace, and is central to address e�ciency concerns.

The notion of inductive bias can be categorized in three di�erent types [NRA+96]:language bias; search bias; and preference bias.

Language bias determines the hypothesis space by de�ning the hypothesis language(L). The hypothesis language can be restricted to clauses with at most n literals, toclauses without function symbols, etc.

Search bias determines the search space, i.e., which part of the hypothesis space issearched and how it is searched. It can be through a restriction or preference bias. Arestriction bias determines which hypothesis should be ignored, while the preferencebias determines which hypothesis should be considered �rst. Some examples of searchbias are the example selection criterion and the re�nement operator used. Validationbias determines an acceptance criterion for the learning system, telling the systemwhen to stop the search. For instance, one obvious stopping criterion is when a correcthypothesis is found. Even a biased hypothesis space can be too extensive to make acomplete search nonviable (from a computational point of view). Several types of biashave been studied, such as preference criteria, in order to try to avoid a completesearch.

A bias is declarative if it is explicitly represented. The declarative representationof bias may be achieved by the use of languages that allow bias speci�cation, or bycon�gurable generic methods that allow both speci�cation and implementation of bias.A declarative representation of the bias is required so bias setting and shifting can beeasily performed by an user or by the system. Shift of bias occurs when the languagebias is changed by the ILP system, which may be necessary when there exits nosolution within a certain language bias. Declarative bias may be used by ILP systemsto be more adaptable to particular learning tasks. For an overview of declarative biasin ILP see [NRA+96].

In [Tau94] is reported a study of the impact of several bias constituents in the size

Page 59: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3. THE LEARNING PROCESS 57

of the search space. A conclusion of the study is that not all combinations of biasconstituents are useful. If the bias is weak (not too restrictive) then the search spacemay become so large that a search becomes computationally intractable. If the biasis too strong, i.e., constraints the hypothesis language considerably, the language maybe incapable of representing the solution. For instance, if the search is restricted totheories containing clauses up to 4 literals, but all correct theories contain clauses with5 or more literals, no solution will be found. Hence, there is a trade-o� between thee�ciency of an ILP system, that results from having strong bias, and the quality ofthe theory it comes up with. It is therefore important to have a balanced bias (nottoo week and not too strong) since an inappropriate bias may prevent the ILP systemto �nd the desired hypothesis.

2.3.8 Hypotheses Evaluation Measures

Several evaluation measures can be applied to estimate the quality of the hypothesesgenerated [LFZ99, FF03]. Such measures can provide useful support for interpretingand ranking the hypotheses. The diversity of measures is a consequence of the varietyof learning tasks and applications. Therefore, di�erent methods of evaluation aresought and applied according to the problem addressed.

The computation of evaluation measures often requires knowing the correct classi�-cation and the predicted classi�cation for each class. With these values it is possibleto build a confusion matrix to list the correct classi�cation against the predictedclassi�cation for each class. Each column of the matrix represents the instances in apredicted class, while each row represents the instances in an actual class. The numberof correct predictions for each class falls along the right descendant diagonal of thematrix (true positives and true negatives). The incorrect predictions can be one oftwo types. A false negative is a positive example predicted as negative whilst a false

positive is the opposite. As an example consider the problem of classifying a person ashaving a illness (positive) or not having the illness (negative). A false negative occurswhen a person is diagnosed as healthy when was ill, whilst a false positive occurs whena person is ill but is diagnosed as healthy. The bene�t of using a confusion matrix isthat it is easy to see if and how the learning system is confusing two classes. Table 2.2presents a generic confusion matrix for a two-class (binary) classi�cation problem.

Table 2.3 summarizes some of the rule evaluation measures used in ILP. For a morein-depth discussion about the subject we refer to [LFZ99, FF03].

Page 60: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

58 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

Class Positive (C+) Class Negative(C−)Prediction Positive (P+) True Positives (TP ) False Positives (FP )Prediction Negative (P−) False Negatives (FN) True Negatives (TN)

Table 2.2: Confusion matrix for a two-class (binary) classi�cation problem.

The two most often used measures are coverage and rule accuracy. There are at leasttwo di�erent notions of coverage. One notion, presented in Table 2.3, de�nes coverageas the fraction of instances correctly classi�ed [LFZ99]. Another de�nition can befound in [Sri03], where coverage is de�ned as the di�erence between true positives andfalse positives (i.e., TP − FP ).

Name Formula

Coverage P+

C++C−

Rule Accuracy∗ TPP+

Rule Set Accuracy TP+TNC++C−

Sensitivity† TPC+

Speci�city TNC−

F-measure 2∗(recall∗precision)recall+precision

Table 2.3: Evaluation measures.

Rule accuracy, also known as precision, can also be used to measure the reliability ofthe rule in the prediction of positive cases since it measures the correctness of returnedresults. It is strongly dependent on the class distribution in the dataset rather thanthe characteristics of the examples and it ignores di�erences between error types.

Accuracy (rule set accuracy) re�ects the overall correctness of the model (set of rules)and the overall error rate is (1− accuracy). If both types of errors, i.e., false positivesand false negatives, are not treated equally, a more detailed breakdown of the othererror rates becomes necessary.

Sensitivity, also known as recall, is a measure frequently used in medical applicationsthat measures the fraction of actual positives correctly classi�ed. In medical terms,

∗Rule accuracy is also known as precision (in information retrieval), con�dence, and reliability

(in the prediction of positives).†Sensitivity is also known as recall (recall of positive cases) in information retrieval.

Page 61: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.3. THE LEARNING PROCESS 59

maximizing sensitivity means detecting as many ill patients as possible. Speci�city isalso a measure frequently used in medical applications that can be interpreted as recallof negative cases. Sensitivity and speci�city describe the true performance with greaterclarity than accuracy, but they also have disadvantages. For a particular hypothesisthey represent two measures, one for the positive cases and the other one for negative.Ideally, one wants both precision and recall to be one. Unfortunately, improvingrecall and improving precision is often di�cult to achieve: e�orts to improve one oftenresults in degrading the other. Depending on the applications, di�erent trade-o�s canbe sought. The F-measure is an evaluation measure that combines precision and recall.

2.3.9 Matching Hypotheses

The computation of the quality of a hypothesis, using any of the measures mentionedabove, involves knowing how many positive and negative examples it explains, i.e., howmany examples the hypothesis covers (TP and FP). Generally, a hypothesis is a (setof) clause(s). The overall time needed to compute how many examples a hypothesisexplains depends primarily on the cardinality of E (i.e., E+ and E−) and in thee�ort required to match each example against the hypothesis, a (set of) clause(s),and background knowledge. Thus, scalability problems may arise when dealing witha great number of examples or/and when the computational cost to evaluate a rule ishigh.

In general, matching (or testing) an example on a clause consists in �nding a sub-stitution such that the body of the clause is true given the example and backgroundknowledge. An often used approach in ILP to match hypotheses is to use logicalquerying. The logical querying approach to clause matching involves the use of aProlog engine to evaluate the clause with the examples and background knowledge.The execution time to evaluate a query depends on the number of examples, numberof resolution steps, and on the execution time of individual literals. Prolog enginesoften use SLDNF∗ resolution to prove that the query is a consequence of the data(program). Query execution using SLDNF resolution grows exponentially with thequery length [Str04] (number of literals in the query). Hence, evaluating a singleexample can take a long time.

Another approach to test examples on hypotheses is to perform it in a database. InSection 2.5, this approach and other techniques to improve the e�ciency of hypotheses

∗Abbreviation of linear resolution for de�nite clauses with a selection function with negation as

failure.

Page 62: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

60 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

matching are described.

2.3.10 MDIE

Mode-Directed Inverse Entailment (MDIE) [Mug95] uses inverse entailment togetherwith mode restrictions as the basis to perform induction. MDIE is exploited by severalILP systems and algorithms (e.g., [MF01, Sri03, Cam00, AF99, OdCDPC05]). Theoriginal MDIE algorithm belongs to the family of covering algorithms. The maindi�erence to the generic covering algorithm presented in Section 2.3.1 resides in thelearn_rule() procedure (step 3). In MDIE the learn_rule() executes the followingoperations:

1. Pick an example e from E+ (the seed).

2. Build a most speci�c clause (or bottom clause) ⊥e that entails the selected seedexample. ⊥e is a clause that explains an example e relatively to a backgroundknowledge B (and H if the target predicate is recursive). The bottom clause∗, isusually a de�nite clause with several literals, i.e., it has the form ⊥e: −b1, b2, . . .,where bi are ground consequences of B∧e. Since, in general, ⊥e can have in�nitecardinality, the ground consequences are derived from B using a depth-boundproof procedure for some selected depth.

3. Find the best consistent rule(s) more general than e by performing a general-to-speci�c† search in the space of rules bounded bellow by ⊥e. The clauses' bodiesgenerated during the search are subsets of the literals from ⊥e.

Note that constraints are imposed on 2) and 3) in order to ensure that the algorithmterminates. The great advantage of using a bottom clause is that it bounds (anchors)the search lattice bellow.

∗In general, a bottom clause can also be constructed as the relative least general generalization

(RLGG) of two (or more) examples [MF92] with respect to the given background knowledge, or as

the most speci�c resolvent of an example [Mug91] with respect to the given B†Although it is usual to perform a general-to-speci�c search, other directions may be pursued (see

e.g., [OdCDPC05])

Page 63: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.4. RANDOMIZED ALGORITHMS 61

Example 4 (Bottom-clause)

Consider the set of examples and background knowledge given in Table 2.1.

To build the bottom clause we start by picking a seed (positive) example

�henry is the grandfather of jane�. From this example we can deduce the

following:

⊥1=grandfather(henry,jane):-male(henry),father(henry,john),

parent(henry,john),male(john),parent(john,jane),

father(john,jane),female(jane).

Since the goal is to generalize, the bottom clause is variabilized by trans-

forming the arguments of predicates in the above clause into variables:

grandfather(X,Y):-male(X),father(X,Y),parent(X,Y),male(Y),

parent(Y,Z),father(Y,Z),female(Z),X=henry,Y=john,Z=jane.

2.4 Randomized Algorithms

Randomized algorithms [MR95] make random choices during the execution in thehope of achieving good performance in the �average case�. The solutions producedby randomized algorithms may be the same as regular algorithms or not. The socalled Las Vegas algorithms produce solutions that are correct (or optimal) but thetime to obtain them is variable. On the other hand, Monte Carlo algorithms producesolutions that may not be correct (or optimal). However, by performing several orrepeated random choices, the error (distance to the optimal solution) may be reducedat the cost of increasing the execution time.

In a sense, many ILP systems are Monte Carlo algorithms when using a coveringalgorithm with random example selection (for building bottom clauses in MDIE orRLGG based algorithms). Therefore, ILP systems may produce di�erent theoriesdepending on the selection order of the examples although the same data and settingsare used. Moreover, not only the solutions may be di�erent, but the runtime mayexhibit large variability [vSP02].

This section succinctly introduces two randomized search algorithms for ILP that willbe referred later on.

Page 64: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

62 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

2.4.1 Stochastic Clause Selection

Stochastic clause selection [Sri00] restricts the search space by sacri�cing optimality.It consists of randomly selecting a �xed-size sample of clauses from the search spacewith high probability of containing a good clause. This scheme has shown to performwell in the Aleph system even when compared to complex search methods [Sri00]. Thequality of the solutions were comparable to the ones found by other search methods,but the time taken to �nd them was considerable lower.

The key idea of the algorithm (outlined in Figure 2.7) is to look for a clause thatis, with high probability (for some minimum value α), in the top 100 × k-percentileof clauses of the total ordering. The question is how many clauses must be drawnfrom the hypothesis space for obtaining at least one good enough clause with highprobability. In [Sri00] the following formula

n ≥ ln(1− α)

ln(1− k)

is used to compute the number clauses (n) that must be drawn for the probability ofobtaining one in the top 100× k% to be, at least, α. They assumed that the samplingof clauses is done with replacement, an approximation that is good for large searchspaces. It is if like all clauses were placed in a urn, and then randomly selected (andthen put back again). The formula only holds if we can guarantee that the clauses areobtained by uniform random selection.

Example 5

Suppose that we want to �nd a clause for a given problem so that it is

a clause in the top 1% (k=0.01) with a probability of 99,9% (α = 0.999).

Using the formula given above, the value of n would be 688, i.e., in order

to obtain a clause good enough with a probability of 99,9% we would need

to draw (randomly generate) 688 clauses.

It is important to note that this algorithm softens the goal of �nding the best clause,i.e., the goal is transformed into �nding a clause in the top 100×k-percentile of clauseswith high con�dence. However, even a solution in the top k% may be to far away fromthe best, i.e., a clause that is in the top k% percentile of all clauses may still besigni�cantly worst than the best one.

As pointed out in [Sri00] although the search goal is softened, from a practical pointof view, due to time and computational constraints, a solution on the k percentile is

Page 65: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.4. RANDOMIZED ALGORITHMS 63

learn_rule_scs(nr,E,C,α,k)

Input: Maximum number of rules to return (nr), a set of examples (E), constraints(C), and the probability α of a rule to be on the top 100× k percentile.Output: A set consisting of the best nr rules found.

1. | L̂ | is an estimate of the number of elements in L2. n= ln(1−α)

ln(1−k)

3. N={r1, . . . , rn} be n numbers randomly selected without replacement from 1 to | L̂ |

4.Clauses={c1, . . . , cn} where each ci ∈ Clauses is obtained by mapping the numberri into a clause in L

5. evalOnExamples(E,Clauses)6. return bestOf(nr,Clauses)

Figure 2.7: A high-level description of a learn_rule() procedure that performsstochastic clause selection.

what is often targeted. However, if a true best solution is required, this technique canbe used to �nd solutions �close� to the optimum, that are later used as starting pointsof searches using other exhaustive search methods. This is the approach followed bythe next technique.

In our work, the clauses generated by the ILP system are subsets of literals drawn froma most speci�c de�nite clause ⊥. Additional constraints are that each clause must bein L. To be able to uniformly and randomly generate a clause is useful to have amapping between the elements of L and N (where N denotes the subset of naturalnumbers 1, . . . , | L |), so that by randomly selecting a number one randomly selectsa clause. The tricky part of the random algorithm lies in devising a procedure foruniform random sampling of clauses from the hypothesis space without enumeratingall elements.

In [Sri00] a procedure is described that splits in two phases the process of randomlygenerating a clause. First, the clause length l is randomly selected (using the distribu-tion of clauses by length given by the user or automatically computed [Sri00]). Then,the �rst clause of length l, not drawn before, is randomly drawn.

Note that the set L is not known without enumerating all clauses, therefore, forpractical reasons, it is estimated instead [Sri00]. It may happen that the number oflegal clauses is much lower than the estimated one (e.g., due to the use of bias). This,

Page 66: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

64 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

in turn, makes the job of randomly generate a clause of a given length l more di�cultthan using a specialization or generalization re�nement operator because it becomesharder to �nd a random subset from the bottom clause with l literals that correspondto a clause in L when most subsets of length l do not belong to L. In practice, an ILPsystem spends more time generating clauses randomly than if generates clauses usinga downward re�nement operator (as shown in Figure B.2).

learn_rule_rrr(E,C,maxtries,maxtime)

Input: A set of examples (E), constraints (C), an upper bound on the number ofrapid searches performed (maxtries), and an upper bound on the time that a rapidsearch may take (maxtime).Output: A set with the acceptable rule found.

1. tries=12. while tries≤maxtries do

3. select a random clause c0 from L4. searchtime=05. while searchtime<maxtime and an acceptable clause c is not found do

6. Perform an exhaustive radial search starting at c0

7. end while

8. if c was found then return {c}9. tries=tries+110. end while

11. return ∅

Figure 2.8: A high-level description of a procedure to perform randomized rapidrestarts search (RRR).

2.4.2 Randomized Rapid Restarts

The usual approach of conducting the search is to perform either a top-down or abottom-up search. However, starting the search with a di�erent clause from theinterior of the lattice may reduce the time needed to �nd a clause with the desiredproperties. The Randomized Rapid Restarts (RRR) [vSP02] search algorithm performsan exhaustive search up to a certain point and then, if a solution is not found, restartsat a di�erent part of the search space. With this approach the search algorithm may

Page 67: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.5. IMPROVING THE EFFICIENCY OF ILP SYSTEMS 65

avoid being trapped in a very costly path and exploits the high chance of obtaining abetter path. The application of the RRR in two applications yielded a drastic reductionof the search times with the cost of just a small loss in predictive accuracy [vSP02].

The RRR algorithm, outlined in Figure 2.8, proceeds as follows. It performs several(maxtries) short searches whose time is bounded by maxtime. Both maxtries andmaxtime are user de�ned parameters. A recent study has shown that a good valuefor the number of restarts is 100 [vSP04] and pointed out that there is no practicalway of determining the optimal maxtime. Each search begins by randomly selectinga (starting) clause and then performing a deterministic radial best-�rst search∗.

2.5 Improving the E�ciency of ILP Systems

A problem with ILP systems is their long running times. There are several reasonsto improve the e�ciency of ILP systems. First, e�cient ILP systems may be appliedto larger and, therefore, wider spectrum of problems. Second, one of the challengesfor ILP is the improvement of human-computer interaction [PS03]. To this end, ILPsystems need to be more interactive, which in turn, means that they should be fastin order to make the interaction in real time practical. Third, fast ILP systems aredesirable because it is common to perform several runs to obtain good parametersettings or/and to produce model statistics (e.g., using cross-validation [Koh95]).

In this section we survey a wide range of techniques to improve the e�ciency and scala-bility of ILP systems. Understanding which techniques, or combination of techniques,contribute the most is a quite a di�cult task and is still an open question. The reasonfor this is that the techniques are scattered among a great number of ILP systems†.Hence, in order to make a fair comparison one has to implement and integrate thetechniques on the same ILP engine (which can be a long engineering task). Next,we brie�y outline the proposals and results reported that (attempt to) improve thee�ciency of ILP systems.

We start by classifying the techniques regarding their correctness. In this context, acorrect technique should be understood as yielding the same results as some referencealgorithm that does not employ the technique. A partially correct technique does not

∗In [vSP02] the heuristic used was the number of positives covered minus number of negatives

covered. In our implementation, which we discuss in Chapter 3, it can be user de�ned.†Srinivasan, in a presentation at the ILP 2005 conference, pointed out that around 100 ILP

systems have been developed to date.

Page 68: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

66 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

preserve correctness, but still gives results that are, with high probability, similar (inquality) to the results produced by the reference technique. Finally, a technique thatis neither correct or partially correct is termed not correct.

We further classi�ed them into the following �ve main approaches:

• Optimization of algorithms or programs includes the use of e�cient datastructures, special improvements on the underlying platform where the ILPsystem is implemented (e.g., the underlying Prolog engine, if the ILP system isimplemented in Prolog), the parallelization of ILP systems, and other engineeringpractices.

• Data handling approaches involve the representation and manipulation of datathat, in some way, a�ects e�ciency. Techniques that follow this approachcan perform some kind of data partitioning (breaking the data into subsetsand then learning from one or more subsets, possibly combining the resultsin the end), e�cient data manipulation (such as the one used in learning frominterpretations), or learning directly from databases.

• Reduction of the hypothesis space includes a wide variety of techniquesthat attempt to reduce the hypothesis space, therefore improving e�ciency byreducing the number of hypotheses generated and evaluated.

• Faster hypotheses evaluation includes techniques that, in some way, attemptto reduce the time to evaluate hypotheses.

• Search algorithms includes search techniques that may be used to improvee�ciency.

Table 2.4 summarizes the approaches and general methods to improve e�ciency. Inmany cases the techniques can be exploited simultaneously. Although the survey is notexhaustive, we believe it is representative of the current state of the art on techniquesthat attempt, or have been proved, to have an impact in the e�ciency of ILP systems.The methods are surveyed in some detail in the following subsections. We start withthe algorithms/programs optimizations approach.

2.5.1 Algorithms/Program Optimizations

Performing optimizations in algorithms or at the implementation level can lead togood results. In this spirit, several proposals have been made, ranging from algorithm

Page 69: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.5. IMPROVING THE EFFICIENCY OF ILP SYSTEMS 67

Approach Method

Algorithms/SystemsOptimizations

E�cient data structuresAlgorithm optimizationsImprovements on the underlying framework

Data handlingData representationData partitioningUsing Database Management Systems

Reduction of thehypothesis space

Language biasLayering the hypothesis spaceFeature selection

Faster hypothesesevaluation

Transformation/optimizationsApproximate evaluationStoring results

Search algorithmsLocal searchRandomized searches

Table 2.4: Approaches and methods for improving e�ciency.

optimizations, the use of e�cient data structures, optimizations on the underlyinginference engines, and the parallelization of ILP algorithms. Table 2.5 enumeratesseveral proposed techniques for optimizing ILP algorithms.

E�cient Data Structures

In [FRCS03] the authors propose and evaluate two e�cient data structures: the Trie torepresent lists and clauses; and RL-Tree to represent clauses coverage. The evaluationof their performance showed a substantial reduction in memory usage without incurringextra execution time overheads.

Framework Improvements

Improvements of the underlying framework include improvements on fundamentallibraries used by the ILP system or engines (e.g., Prolog engines). Obviously, changesat this level should only a�ect performance, thus all techniques under this approachare correct.

Page 70: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

68 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

General Method Example Technique

E�cient Data StructuresTriesRL-Trees

Algorithm OptimizationsDiscretizationOptimizing for speci�c problems

Framework Improvements Better Prolog indexing

ParallelismParallel exploration of independent hypothesisParallel exploration of the search spaceParallel coverage testsData parallelism

Table 2.5: Algorithms or implementation optimizations.

It is usual for ILP systems to be developed on top of some logic programming system.The major reason to do so is that the inference mechanism implemented by the Prologengines is fundamental to most ILP learning algorithms. This way, ILP systemsbene�t from the extensive work that has taken place for improving the performance ofProlog engines. In [FCR+] the impact of recent optimizations in a logic programmingsystem is empirically evaluated on an ILP system. The proposed optimizations reducedsigni�cantly the execution time of the ILP system.

Several ILP systems use the theta-subsumption test as the coverage test. In [MS04] isproposed a novel theta-subsumption algorithm algorithm that achieves good perfor-mance gains and scalability.

Algorithm optimization

Work has been reported describing optimizations on previously existing algorithmsor systems. For instance, the FFOIL [Qui96] system is a FOIL [Qui90] inspiredsystem that is specialized on learning functional relations. More speci�cally, FFOIL isspecialized to learn functions with a single output argument and it generally requiresless time for learning than the original FOIL. FOIL-D [BO04] is another example of anextension of the FOIL [Qui90] system that interfaces directly to a relational databasesystem.

TILDELDS [BRJD99] is an extension of the TILDE system [BR98] that loads theexamples one at a time. TILDELDS was empirically compared against the origi-

Page 71: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.5. IMPROVING THE EFFICIENCY OF ILP SYSTEMS 69

nal system that loads all data into memory [BRJD99] and the results showed thatTILDELDS scales up for large datasets (with 100,000 examples), but it is slower thanthe original.

An optimization for building the bottom clause [Mug95] is proposed in [TMM03]. Itconsists in lazily constructing the bottom clause (using a heuristic). The authors claimthat their system, with the proposed technique, was signi�cantly more e�cient thanAleph [Sri03] and mFoil [D�93] in a single dataset.

Another technique that has been shown to improve e�ciency is discretization [BR97]in ILP. This technique helps handling continuous numerical values in the data bydividing a continuous domain of an attribute into several subsets, which can then beused as discrete values. The technique improved the e�ciency of the TILDE systemand the quality of the hypothesis found [BR97].

Parallelism

Parallelism provides an attractive alternative solution for improving e�ciency, asit may both signi�cantly decrease the whole execution time and allow the originaldata to be split among di�erent processors/computers, therefore improving e�ciencyand increasing scalability. Here, we classify the methods to parallelize ILP systemsdescribed in the literature into four main classes: parallel exploration of independenthypotheses [OM99]; parallel exploration of the search space [DDR95, ONM00, OM99,Wie03]; parallel hypothesis evaluation [MISI98, Kon03]; parallel execution of an ILPsystem over a partition of the data [OM99, GPK03]. A thorough review of the stateof the art on parallelism in ILP is presented in Chapter 4.

2.5.2 Reducing the Hypothesis Space

The size of the hypothesis space (| L |) has a strong impact on the performance ofan ILP system. Therefore, reducing its size usually improves performance. Obviously,one cannot restrict L too much, otherwise good hypotheses may be lost. Severalapproaches have been proposed to improve the e�ciency of ILP systems by reducingthe hypothesis space. Here we distinguish three methods (see Table 2.6). The lan-

guage bias determines the hypothesis space L of the possible hypothesis. Layeringthe hypothesis space attempts to reduce L by imposing some ordering on it andthen incrementally considering larger subsets of L, but eventually the whole L is

Page 72: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

70 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

considered (e.g., [SKB03, Cam02]). The third method actually attempts to reduce Lby performing a kind of relational feature selection (e.g., [SK05, BDR96]).

General Method Example Technique

Language biasMode and type declarationsGrammarsRedundancy elimination

Layering the hypothesisspace

Ordering the background knowledgeIncremental language level search

Feature selectionBottom clause literal selectionPropositional feature selection

Table 2.6: Methods for reducing the hypothesis space.

Language bias

As seen in Section 2.3.7, one way for ILP systems to reduce the size of L is throughthe use of language bias, such as types and input/output modes [Mug95], gram-mars/schemes describing the hypothesis space [Coh93, JB95], etc. Some examplesof (declarative) language bias are:

• Admissible vocabulary : The clauses generated may only contain predicates be-longing to a prede�ned set called the vocabulary. One approach to de�ne thevocabulary is through the use of determination declarations [DR87].

• Determinacy : Introduced by Muggleton and Feng [MF92], determinacy restrictsthe quanti�cation of variables in the body of de�nite clauses to have exactly onequanti�cation. This means that for every example e and hypothesized clause c

there must exist at most one valid substitution for the variables in the body of c.j − determinate clauses are constrained to have a maximum of j variables inany literal. ij − determinate clauses are further restricted so that each variableappearing in the clauses has a maximum depth of i.

• Types and input/output modes [Mug95]: The type and input/output mode dec-larations supply information concerning the arguments of each predicate thatmay appear in the hypotheses. There are two advantages in the use of modes.First, if the background predicates and their modes are speci�ed when learning

Page 73: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.5. IMPROVING THE EFFICIENCY OF ILP SYSTEMS 71

a single predicate, the learner can guarantee termination by ensuring that thequeries it generates are mode conform. Secondly, the search can be optimizedby the learner when answering queries. The type declarations are useful becausethe learner needs only to consider a subset of the hypothesis space that is type-conform. Although mode declarations are usually provided by the user, they canbe be automatically extracted from the background knowledge [MS95].

• Linked clause: A clause is linked if all its variables are linked. A variable v1 islinked if it occurs in the head of a clause or there is a literal l in the clause thatcontains the variables v1 and v2 (v1 6= v2) and v2 is linked in l. This restrictionshelps avoiding some potentially useless literals in a clause (e.g., p(X) : −l(Z)).For instance, p(X) : −l1(X, Y ), l2(Y, Z) is a linked clause.

• Antecedent Description Grammar : Antecedent Description Grammar [Coh94],is a kind of de�nite clause grammars (DCGs), used to describe the set of well-formed hypotheses. This is a general approach but may not be easy to use sincethe grammars may become very large [Coh93], hence being hard to write downand to understand.

Another method takes advantage of prior expert knowledge to reduce L. In [FCCS04]the authors propose a classi�cation of redundancy in L and show how expert knowledgecan be provided to an ILP system to reduce it. The technique is correct (if completesearch is performed) and yields good results. However, it has a drawback that thedeclarations must be given manually by the user/expert (a task that can be tedious anderror prone). A possible approach to cope with this problem would be the automaticgeneration of redundancy declarations (preferably in a pre-processing stage).

Layering the hypothesis space

Layering the hypothesis space techniques do not really reduce L but uses incrementallylarger subsets of L. Eventually, the whole L may be considered if a model is not foundusing smaller subsets.

Srinivasan et al. [SKB03] proposed a partially correct technique that explores humanexpertise to provide a relevance ordering on the set of background predicates. Thetechnique follows a strategy of incrementally including sets of background knowledgepredicates in decreasing order of relevance and has shown to yield good results.

Page 74: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

72 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

Another technique that incrementally increases the hypothesis space is the IncrementalLanguage Level Search (ILLS) [Cam02]. It consists of using an iterative search strategythat, starting from 1, progressively increases the upper-bound on the number ofoccurrences of a predicate symbol in the generated hypotheses. Substantial e�ciencyimprovements were reported on several ILP applications [Cam02]. The technique iscorrect if a complete search is performed.

Feature Selection

A common method for reducing the size of the hypothesis space in propositionallearning is by means of using (relevant) feature subset selection (FSS) [JKP94]. Itconsists in �nding a �good� set of features (attributes) under some objective function(e.g., predictive accuracy). The problem of feature selection can be seen as a searchproblem [Lan94], where each state in the search space speci�es a subset of possiblefeatures. Each subset of features needs to be evaluated, independently of the searchstrategy used to traverse the space of feature sets. There are several approaches to solvethe problem of feature selection, being the �lter method and the wrapper method two ofthe most well known [JKP94]. Filter methods select relevant attributes before startingthe induction process. The wrapper method generate a set of candidate features andrun the induction algorithm on the training data (using only the candidate features)to evaluate the accuracy of the resulting model. Obviously, the wrapper approachis computational expensive since it invokes the learning algorithm multiple times.Techniques that rely on some kind of feature selection are partially correct, althoughthe �nal models may be better than the ones found without feature selection [AM04].

The challenge of applying feature selection to ILP is the de�nition of an attribute inILP. For instance, one can consider as an attribute a full relation, some instances of arelation, or simply some attributes of a relation. Another issue is the applicability ofwrapper approaches in ILP due to e�ciency problems of the systems.

Several techniques related to feature selection have been explored in the context of ILP.In [SK05] the authors studied the applicability of general feature selection methods toreduce the size of the hypothesis space by reducing the size of the most-speci�c clause.The technique works as follows: a most-speci�c clause is constructed and transformedinto a bit vector representation; a random sample of clauses is generated based on thebottom clause; then, a vector representation of the clause and its quality is stored in atable of features (representing the clauses); a dimension reduction method is applied onthe table of features to rank the features (literals of the most-speci�c clause); the top

Page 75: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.5. IMPROVING THE EFFICIENCY OF ILP SYSTEMS 73

x percent of the features are selected. In this way, the number of literals in the bottomclause decreases, and consequently, the size of the search space. The results presentedshow that a signi�cant reduction of the search space can be discarded, however theimpact on the execution time (overhead of performing feature selection) and on the�nal models is not reported.

A previous attempt to perform feature selection in ILP involved a change in therepresentation from FOL to a propositional logic [AM02, AM04]. The proposedapproach acts as a �lter, preprocessing the data, prior to model building, and outputsthe examples with empirically relevant literals. Experiments on four applicationsshowed improvements in both execution time and accuracy.

Another technique for relation selection, proposed in [Coh95], consists in discardingtuples from relations that occur bellow a de�ned threshold in the training examples.The author did not report the e�ciency impact of the technique.

2.5.3 Faster Hypotheses Evaluation

Searching the hypothesis space involves generating and evaluating hypotheses withrespect to a dataset. Several techniques have been proposed to improve the e�ciencyof hypotheses evaluation. Table 2.7 summarizes some of the techniques. They can begrouped into three general methods: i) transform the hypotheses in order to optimizetheir evaluation; ii) perform an approximate evaluation, instead of computing an exactevaluation on the available data; and iii) store computations for later reutilization,therefore speeding up hypotheses reevaluation. Another approach would consider toimprove the testing process itself, i.e., in the query execution system used (inferenceengine or Database Management System).

Transformations/Optimizations

The evaluation of a hypothesis can be optimized by transforming the hypothesis intoan equivalent one that can be more e�ciently executed [CSC00, CSC+03]. An impor-tant characteristic of these techniques is that the transformations between equivalenthypothesis are correct. The transformation can be done at the level of individualhypothesis [CSC00, SB03] or at the level of sets of hypotheses [TUA+98, BDD+02].

In [CSC00] two simple transformations are proposed that convert a hypothesis intoan equivalent one that can be more e�ciently executed. The empirical evaluation

Page 76: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

74 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

General Method Example Technique

Transformation/optimizationIndividual hypothesis optimizationOptimizing sets of hypotheses

Approximate evaluationsSample based evaluationNeural network based evaluationLazy evaluations

Storing computationsCoverage CachingTabling

Table 2.7: Methods for improving the e�ciency of the hypothesis evaluation.

showed that the transformations yield very good results, in particular when longerhypotheses were considered. The work was extended and two more transformationswere later proposed [CSC+03]. The empirical evaluation of the transformations wasmade on real-world and synthetic datasets. The results showed that the overhead ofperforming the transformations was small, thus achieving good speedups. Anothertransformation based on reordering of dependent literals was proposed yielding goode�ciency gains [SB03]. The idea was inspired by techniques used in the databasecommunity, and is based on the observation that hypotheses can be evaluated moree�ciently if more �selective� literals are applied/evaluated �rst.

A di�erent, but also complete, method to speedup evaluation of hypotheses exploresthe existing redundancy on the sets of hypotheses evaluated: similar hypotheses aregenerated and evaluated during a search. The query packs [BDD+02] is a strategyfor executing sets of hypotheses that avoids the redundant computations. A similartechnique to query packs is the query �ocks [TUA+98]. A �ock is represented by asingle query with placeholders for constants, and is equal to the set of all queries thatcan be obtained by instantiating the placeholders to constants. The main di�erencebetween the two techniques is that the set of queries in a pack are structurally lesssimilar than the queries in a �ock. The set of queries in a query �ock have the samestructure but di�er in speci�c constants at certain positions. Empirical evaluationof both techniques yielded great improvements on execution time [TUA+98, DJV99,BDD+02].

Page 77: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.5. IMPROVING THE EFFICIENCY OF ILP SYSTEMS 75

Approximate Evaluation

Another method to reduce the execution time while evaluating an hypothesis is toperform some kind of approximate evaluation [SR97, Sri99, GSSB00, KSC01, DS04,BO04]. A stochastic matching [SR97], or stochastic theorem proving, was evaluatedin the Progol [MF01] ILP system yielding e�ciency improvements, in one applica-tion, without sacri�cing predictive accuracy or comprehensibility. This approach wasfurther pursued [GSSB00], and the bene�ts of replacing deterministic matching withstochastic matching were clearly visible.

Another approach evaluates hypotheses in a randomly sub-sample of data [Sri99].The empirical evaluation of this technique on two applications has shown a signi�cantreduction in the execution time while preserving the quality of the �nal models.

Neural networks were used in [KSC01, DS04]. In [KSC01] the approximation hasshown to be an alternative to exact evaluation as the quality of the hypotheses found isconcerned, but the authors did not reported the impact in e�ciency of the technique.Another approximate evaluation technique that uses neural networks to reduce thecomputation time during a search for a hypothesis is described in [DS04]. The tech-nique was empirically compared with the sampling technique described above [Sri99],in terms of quality of the approximations, but the results were not conclusive.

In [BO04] the authors propose an approximate evaluation procedure based on his-tograms. The procedure was evaluated in the FOIL system, using three small appli-cations, and the results showed that the quality of the solutions were equivalent tothe solutions found by FOIL when using the default evaluation procedure, but with acomputationally lower cost.

Finally, another technique called lazy evaluation∗ of examples [Cam03, FCR+] aims atspeeding up the evaluation of hypotheses by avoiding the unnecessary use of examplesin the coverage computations. The authors present three variants of lazy evaluation,some are correct and other partially correct. Further details are given in Section 3.3.3.An evaluation of the three variants showed that one of the variants, lazy evaluation ofnegatives, reduces considerably the execution time [FCR+].

∗The term lazy evaluation is used in the sense of making the minimal computation to obtain

useful information.

Page 78: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

76 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

Storing results

Storing intermediate results during the evaluation for later use (i.e., by performing akind of a cache) can be a solution to reduce the time spent in hypothesis evaluation.The techniques that follow this method are categorized as: improving the evaluationof the literals of a hypothesis, or by reducing the number of examples tested. All thesetechniques are correct and attempt to improve time e�ciency at the cost of increasingmemory consumption.

Materialization (or "extensionalization") of a predicate makes an intentionally de�nedpredicate into an extensional de�ned one. Obviously, this increases the memory re-quirements, but it is particularly useful when the solutions of the extensional predicatesare computationally expensive to compute and there are only a limited (preferablyfew) solutions. ILP systems can automatically extensionalize the predicates in thebackground knowledge automatically either in a preprocessing stage or dynamically(lazily during the ILP system execution). In either case, the predicates selected to beunfolded should be carefully selected to keep the memory requirements of the databasein a tractable size. Usually it is up to the user to select the predicates to extensionalize,but is quite feasible to automatize the process.

In fact, it was recently proposed and evaluated a preliminary application of a logicprogramming technique known as tabling [RFC05] in the ILP context. Tabling per-forms a kind of dynamic transparent lazy extensionalization of predicates. The resultsshowed that tabling can reduce the execution time at the cost of using large amountsof memory (a problem that needs to be solved to further explore tabling in ILP).

Another method to speedup hypothesis evaluation is the storage of results in orderto later reduce the number of examples matched against an hypothesis. This can bedone in top-down ILP systems by keeping the set of examples (usually termed coveragelists) explained by a hypothesis so that the re�nements are only matched against theset of examples that succeeded (matched) the parent hypothesis. A hypothesis cs isgenerated by applying a re�nement operator to another hypothesis cg. Let Cover(cg) =

{e ∈ E | B ∧ cg � e}, where cg is a clause, B is the background knowledge, and E isthe set of positive (E+) and negative examples (E−). Since cg is more general than cs

then Cover(cs) ⊆ Cover(cg). Taking this into account, when testing the coverage ofcs it is only necessary to consider examples of Cover(cg), thus reducing the coveragecomputation time. Cussens [Cus96] extended this scheme by proposing a kind ofcoverage caching. The coverage lists are permanently stored and reused whenever

Page 79: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.5. IMPROVING THE EFFICIENCY OF ILP SYSTEMS 77

necessary, hence avoiding the need to recompute the coverage of equivalent clauses.Coverage lists reduce the e�ort in coverage computation at the cost of signi�cantlyincreasing memory consumption [FCSC03]. Therefore, e�cient data structures shouldbe used to represent coverage lists to minimize memory consumption [FRCS03].

In [BVM04] a caching strategy is explored to improve e�ciency that: i) avoids theregeneration of parts of the search space; and ii) reduces the evaluation e�ort (keepingcoverage lists). This strategy was evaluated on a single domain yielding gains in theexecution time ranging from 20% to 60%.

2.5.4 Data Handling

The approach followed to handle (represent, manipulate and store) data can have agreat impact on the performance of an ILP system. As depicted in Table 2.8, we con-sider three main methods for handling data that includes techniques for representingdata, using subsets of data, and using database management systems to store andmanipulate the data.

General Method Example Technique

Data representationLearning from interpretationsLearning from entailment

Learning from subsetsLayered learningSubsamplingWindowing

Learning from relationaldatabases

Mapping tables to predicatesConverting logical clauses to SQL statements

Table 2.8: Data handling methods.

Data Representation

There are two main settings in ILP that a�ect the way data is represented: learningfrom interpretations [Rae97] and learning from entailment [RD94]. Both settingsdi�er mainly on the way to represent the data, i.e., on how to represent examplesand background knowledge. In learning from entailment there is the assumption thatseveral examples may be related to each other, so they cannot be handled indepen-dently. Therefore, there is no separation of the examples (apart from being positive

Page 80: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

78 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

or negative) nor the background knowledge. Within the learning from interpretationsframework for ILP this assumption is unnecessary, which allows to scale up existing ILPalgorithms [BRJD99]. Each example is represented as sub-database, i.e., a separateProlog program encoding its speci�c properties as sets of interpretations, and thebackground knowledge is given in the form of another Prolog program.

Learning from interpretations is slightly less powerful than the learning from entail-ment setting [Rae97]. For instance, it has problems with learning recursive predi-cates. However, the setting is su�cient for most practical purposes and scales upbetter [BRJD99].

Data Partitioning

In ILP some studies have also been made addressing the problem of learning fromsubsets of data. A procedure called layered learning constructs, in stages, increasinglycorrect theories [Mug93]. The procedure starts by using a small sample of the datato construct an approximately correct theory that is improved in the next stages.The sample at each stage is a superset of the sample of the previous stage and isextended based on the errors of the previous theory. To the best of our knowledge,the technique's impact on the execution of ILP systems was not studied.

Subsampling and logical windowing was studied in the ILP context by Srinivasan [Sri99].Subsampling consists in repeating the holdout method several times, and the estimatedaccuracy is derived by averaging the runs. Windowing is a well known technique thatlearns from subsets of data. It tries to identify a subset of the original data fromwhich a theory of su�cient quality can be learned. It as been shown [Qui93, Für98]that this technique increases the predictive accuracy and reduces the learning time.The empirical results of both sampling techniques showed a reduction in the executiontime, while the accuracy of the theory obtained was comparable in predictive accuracywhen not using sampling [Sri99, F�97].

Data storage

ILP systems usually have the data in Prolog databases, a consequence of being imple-mented in Prolog (e.g., [Sri03, RD97, RL95]) or using Prolog libraries(e.g., [MF01]).However, in real world problems, the data is rarely in Prolog format. In fact, thedata is often kept in relational database management systems (RDBMS ). A trivialapproach is to convert database to a format acceptable to the ILP system. Another

Page 81: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

2.5. IMPROVING THE EFFICIENCY OF ILP SYSTEMS 79

more interesting approach is to link ILP systems to relational databases [BDR96] by:

• Mapping predicates to database tables or views;

• Translating logical clauses into SQL statements.

Both schemes have the advantage of being more scalable since the whole data doesnot need to be loaded by the ILP system. The mapping scheme is a more transparentsolution for the designer of the ILP engine (if the system is implemented in a �rst orderlanguage like Prolog). However, this scheme results in increased communication withRDBMS due to the many accesses made while evaluating a single clause. The secondscheme reduces communication at the cost of the ILP system not being able to learnrecursive de�nitions. In the second scheme one can devise two ways of translatingthe hypothesis to SQL: i) transform the hypothesis into a SQL view; ii) transform thehypothesis into a SQL count statement. The second option reduces the communicationto sending the SQL query to the RDBMS and receiving a number with the count value.

Several ILP implementations are coupled with relational databases, some use themapping scheme, others transform the logical clauses into SQL queries [Mor97, BM96,SL96, BO04], and others use both [Web97]. The level of transparency is also quite vari-able, ranging from no transparency (e.g., the user manually de�nes the views for eachliteral that may appear in a hypothesis [Web97]), to complete transparently [BO04].Despite the many reported work on linking ILP systems with relational databases,very little has been reported concerning the impact on e�ciency of learning from adatabase.

2.5.5 Search Algorithms

The hypothesis space determines the set of possible hypotheses that can be consideredwhile searching for a good hypothesis. Several approaches to reduce the hypothe-sis space were described above. On the other hand, the search algorithm de�nesthe order by which the hypotheses are considered and determines the search space(i.e., the hypotheses e�ectively considered during the search). A wide number ofsearch techniques have been used in ILP systems, namely breadth-�rst search [Mug95],depth-�rst search [Sri03], beam-search [D�93, Sri03], heuristic-guided hill-climbing vari-ants [QCJ93, Sri03], simulated annealing [Sri03, SPR04], just to mention a few. Thechoice of one in detriment of another as several e�ects [RN03], namely on memoryconsumption, execution time, and completeness.

Page 82: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

80 CHAPTER 2. ON INDUCTIVE LOGIC PROGRAMMING

More advanced search techniques have been exploited in ILP systems. A geneticsearch algorithm was proposed in [TNM00] but the impact on e�ciency was notreported. Randomized search algorithms have also been evaluated in ILP systems(described in Section 2.4). Probabilistically searching large hypothesis space [Sri00]restricts the search space by sacri�cing optimality. The evaluation of the techniqueon three real world applications showed reductions in the execution time withoutsigni�cantly a�ecting the quality of the hypothesis found. However, this approachhas di�culties with applications with very few good hypothesis (needle in a haystackproblems). Randomized rapid restarts [vSP02] combines (local) complete search withthe probabilistic search. It performs an exhaustive search up to a certain point (timeconstrained) and then, if a solution is not found, restarts into randomly selectedlocation of the search space, The application of the RRR technique in two applicationsyielded a drastic reduction of the search time at the cost of a small loss in predictiveaccuracy [vSP02].

2.6 Summary

In this chapter we introduced Inductive Logic Programming, including terminology,problem description and the stages of the process to solve the ILP problem. Wedescribed some algorithms and techniques more relevant to this work, such as ModeDirect Inverse Entailment (MDIE) and some randomized ILP algorithms. Finally, wepresented a survey on the state-of-the-art on improving the e�ciency of ILP systems.

A conclusion from the survey is that although a wide diversity of techniques havebeen proposed to improve ILP systems (sequential execution) e�ciency, little is knownabout the impact of their combined integration on ILP systems. Understanding whichtechniques, or combination of techniques, contribute the most is a quite a di�culttask and is still an open issue. The main reason is that the techniques are scatteredamong a large number of ILP systems, therefore to make a fair comparison requiresimplementing them on the same ILP engine, usually a long engineering task.

Page 83: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

The only source of knowledge is experience.

Albert Einstein

3The April ILP system

This chapter presents the April ILP system, focusing on its archi-

tecture and algorithm. We also evaluate April's performance and

compare it with a well established ILP system.

3.1 Introduction

A requirement to study the parallelization of ILP is to have an in-depth knowledgeabout a sequential ILP system that could be adapted for parallel execution. Thesequential system should ful�ll some requirements, namely it should be e�cient andhave a modular implementation. E�ciency is important since the process of evaluatinga parallel algorithm involves comparing its running time against the running time of the�best� sequential algorithm/implementation that performs the same task. Therefore,the sequential ILP system should include and combine the maximum number of tech-niques that may contribute to minimize the running time. A modular implementationof the sequential algorithm is important because it facilitates the job of developing aparallel version of the algorithm on top of the sequential one.

Thus, an obvious question is: which existing ILP system ful�lls the requirements?Since the initial conceptual proposal of Inductive Logic Programming [Mug90] many

81

Page 84: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

82 CHAPTER 3. THE APRIL ILP SYSTEM

ILP systems have been developed. In a presentation at the ILP 2005 conference,Ashwin Srinivasan pointed out that around 100 ILP systems have been developedto date. Thus, it is natural to �nd the many techniques that have been proposedto improve the e�ciency of ILP systems (see Section 2.5 for an overview) scatteredamong many systems. Understanding which techniques, or combination of techniques,contribute the most is therefore a rather di�cult task because one usually comparesthe techniques in di�erent contexts (e.g., di�erent ILP systems or programming lan-guages). Moreover, the e�ects on performance of combining several techniques orthe impact of a new technique when used in conjunction with existing techniques isoften not studied. The Aleph [Sri03] ILP system is one system which includes a largenumber of techniques. Unfortunately, the Aleph system does not ful�ll the modularityrequirement.

Having not found an ILP system with the desired characteristics, we implement a newILP system named April. The main goals of the system are e�ciency, �exibility, andscalability.

April aims to be an e�cient system, i.e., have a low memory usage and a low re-sponse time. To this end, it combines and integrates several techniques to maximizee�ciency. Some of the techniques currently implemented in April are query transfor-mations [CSC+03], tabling [RFC05], randomized searches [Sri00, vSP02], lazy evalu-ation of examples [Cam03], language level search [Cam02], coverage caching [Cus96,FCSC03], e�cient data structures [FRCS03], relational database integration and sev-eral parallel algorithms (described in the next chapters).

April aims to be �exible at two levels: user level and developers level. At the user level,April provides a large number of customizations allowing, for instance, the modi�cationof the search method, heuristic, etc. At the developers level, April's modular imple-mentation facilitates the plug-in of (new and alternative) techniques. For instance,April's modular implementation made the development of parallel algorithms a fastertask.

April's scalability is achieved by exploiting parallelism or by using relational databasesto store the examples and ground background knowledge.

The remainder of the chapter is organized as follows. We start by describing the Aprilsystem, namely the learning task addressed, the ILP semantics used, etc. Then wedescribe April's algorithm and its implementation. We then present the results of anempirical evaluation of the system. Finally, we place the system in the context of othersystems.

Page 85: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.2. SYSTEM DESCRIPTION 83

3.2 System Description

The April system is now described concerning the ILP task addressed, language,semantics, traditional user dimensions, and through the presentation of its meta-language.

3.2.1 Characteristics

April can address predictive or descriptive learning tasks. It addresses predictivelearning tasks by constructing classi�cation rules (using a covering algorithm withMDIE). It can also be applied to �nd association rules (using an algorithm similar tothe Warmr system [DT00]). However, most of the development done so far focused onpredictive ILP.

The ILP semantics used in April is the learning from entailment semantics. Therefore,when learning classi�cation rules, April follows the normal semantics of ILP. Thenotion of coverage used in both tasks is the intensional coverage.

April receives as input prior knowledge B (the background knowledge) and examplesE, and induces a theory H that describes (explains) the examples. The examples E arerepresented as Prolog ground facts and the background knowledge as Prolog programs.The predicates in B can therefore be de�ned either intensionally or extensionally. Thehypothesis language is constrained through the use of meta-language declarations.

April implements a covering algorithm to build a set of classi�cation rules. The rulesare found by performing a search through an ordered space of rules. Similarity to otherMDIE based systems, the search space is bounded by a bottom clause and constraintsgiven by the user (e.g., maximum clause length). April has two search strategies,namely top-down and randomized, and di�erent search methods (e.g., breadth-�rst,beam-search, randomized rapid restarts [vSP02]). Several metrics are also available toscore the rules, namely coverage, accuracy, among others.

Since April is implemented (mostly) in Prolog the data is stored in the Prolog'sdatabase (i.e., in memory). However, April has some extensions that allow the systemto learn directly from relational databases.

Page 86: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

84 CHAPTER 3. THE APRIL ILP SYSTEM

3.2.2 Dimensions

The characterization of an ILP system, from an user perspective, is usually done underseveral dimensions apart from the characteristics mentioned in the previous section.The main user dimensions used to characterize ILP systems are:

• Empirical versus incrementalThis dimension is concerned with the way the examples are given to an ILPsystem. An empirical ILP system (also designated as batch-learning ILP system)receives all examples at the beginning of the execution. An incremental ILP

system receives the examples one by one and adjusts the theory each time a newexample is received. An incremental system typically performs a search usinggeneralization and specialization techniques, while empirical systems use onlyone of these techniques.

• Interactive versus non-interactiveAn ILP system is interactive when it poses questions to an oracle (i.e., theuser) to obtain some additional information. For instance, a system may posequeries whose answers may be used to prune large parts of the search space.A system that does not have the possibility to interact with the user is callednon-interactive. Interactive systems can be considered more user friendly butthey are often impractical due to the long running times of the ILP systems,that hinder the interaction in real time.

• Single versus multiple predicate learningThis dimension considers the number of predicates that an ILP system is capableof learning simultaneously. In single predicate learning, the examples E arecomposed of instances of a single predicate. In multiple predicate learning, theaim is to learn a set of possibly interrelated predicate de�nitions, therefore, theset of examples E contain instances of multiple predicates. Theory revision

is usually a form of incremental multiple predicate learning, where the systemstarts with an initial approximation of the theory. Multiple predicate learningsystems are more powerful than single predicate learning systems but they areless e�cient since the problem is usually much harder [DRL96].

• Predicate InventionPredicate invention [MB88, Kra95] is the process where new predicates (notpresent in E and B) are induced by extending the vocabulary of the learner. It

Page 87: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.2. SYSTEM DESCRIPTION 85

aims at facilitating the learning task from the user point of view at the cost ofincreasing the complexity of the learning task.

• Noise handlingSome ILP systems can only deal with exact, noiseless data whereas others containelaborated mechanisms to deal with real-life, imperfect, data (noise).

System Emp/Incr Int/No Inv/No Sin/Mult Noise/No

MIS [Sha83] Incr Int No Sin NoCigol [MB88] Incr Int Inv Mult NoGolem [MF90] Emp No No Sin NoiseFoil [Qui90] Emp No No Sin NoiseLinus [DL91] Emp No No Sin NoiseProgol [Mug95,MF01]

Emp No No Sin Noise

FORTE [RM95] Emp No No Mult NoiseClaudien [RD97] Emp No No Mult NoICL [RL95] Emp No No Sin NoiseTilde [BR98] Emp No No Sin NoiseWarmr [DT00] Emp No No Mult NoAleph [Sri03] Emp&Incr Int&No No Sin&Mult NoiseApril Emp No No Sin&Mult Noise

Table 3.1: Dimensions of ILP: April and other well known ILP systems.

In Table 3.1 we compare April with other ILP systems along the above mentioned �vedimensions. In the table, the following abbreviations are used: Emp/Inc (Empirical/Incremental), Int/No (Interactive/Non-interactive), Inv/No (predicate Invention/nopredicate invention), Sin/Mul (single predicate learning/multiple predicate learning),Noise/No (noise handling/no noise handling).

From Table 3.1 one can observe that most of the systems are empirical non-interactivesingle predicate learners. A remark should be made about the Aleph [Sri03] system. Itcan be incremental or empirical, interactive or non-interactive, and single or multiplepredicate learning system. April follows the dominant tendency, and can be classi�edas an empirical (non-incremental), non-interactive, single predicate learning system,that does not perform predicate invention and is capable of handling noise. Since Aprilcan be used to learn association rules, in the line of the Warmr system and Aleph, wealso classify it as a multiple predicate learning system.

Page 88: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

86 CHAPTER 3. THE APRIL ILP SYSTEM

3.2.3 Meta-Language

April's meta-language features include determination declarations [DR87], mode andtype declarations [Mug95], background predicates' properties (redundancy declara-tions [FCCS04]), pruning, constraints declarations, and a set of system parametersthat a�ects the way April operates.

Determination declarations

Determination declarations [DR87] specify, for each target predicate symbol, whichother predicate symbols from B can appear in its de�nition. They take the form ofdetermination(TargetName/Arity1 , BackgroundName/Arity2 ). The �rst argu-ment is the name and arity of the target predicate (i.e., the predicate that appears inthe head of hypothesized clauses). The second argument is the name and arity of apredicate that can appear in the body of such clauses.

Typically there will be many determination declarations for a target predicate, cor-responding to the predicates thought to be relevant to its de�nition. April does notconstruct any clauses if there is no determination declarations. April only accepts onetarget predicate at a time. If multiple target determinations are provided by the user,the �rst one is chosen.

Mode declarations

Mode declarations specify the mode of call for predicates that can appear in thehypotheses generated by April. These declarations specify the arguments' types and ifthey are intended to be an input or an output argument. There may be more than onemode declaration for each predicate symbol except for the head of the target predicate.

Mode declarations take the form modeh(1,PredicateMode ) for the target predicate,and modeb(RecallNumber ,PredicateMode ) for the the background knowledge. Thenumber of possible outputs, for each combination of input arguments, is limited bythe RecallNumber . RecallNumber can either be a number specifying the number ofsuccessful calls to the predicate, or * meaning that all answers are to be used∗. Itis usually simpler to specify RecallNumber as * with the side e�ect that the systemmay become slower.

∗In practice, to ensure termination, the * does not mean all answers but is limited to some system

de�ned number of answers.

Page 89: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.2. SYSTEM DESCRIPTION 87

PredicateMode speci�es the arguments mode of a predicate. It has the form:

predicatename( ModeType , ModeType ...).

Each ModeType is either simple or structured. A simple ModeType is one of the form:

• +T specifying that when a literal with predicate symbol predicatename appearsin a hypothesized clause, the corresponding argument should be an "input"variable of type T

• -T specifying that the argument is an "output" variable of type T

• #T specifying that it should be a constant of type T

A structured ModeType is of the form f(..) where f is a function symbol and eachargument is either a simple or structured ModeType.

Types have to be speci�ed for every argument of all predicates that appear in modedeclarations. This speci�cation is done within a mode(...,...) statement. Bydefault April does not perform type-checking, but it can be activated through theparameter typechecking.

Example 1 (Determination and Mode declarations)

Here are some examples of determination and mode declarations:

:- determination(grandparent/2,father/2).

:- determination(grandparent/2,male/1).

:- modeh(1,grandparent(+person,-person)).

:- modeb(1,father(+person,-person)).

:- modeb(1,male(+person)).

Page 90: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

88 CHAPTER 3. THE APRIL ILP SYSTEM

Background Knowledge Properties

April allows a domain expert to provide meta-knowledge about the predicates in thebackground knowledge by describing high-level properties. This information can beused to avoid the generation of redundant clauses [FCCS04]. Experimental resultshave shown that substantial performance improvements can be obtained by usingredundancy declarations [FCCS04].

The meta-knowledge declarations are provided to the system as background knowledgein the form of Prolog rules. The user may use declarations such as tautology,commutative, equiv, contradiction. Table 3.2 presents examples of backgroundknowledge properties declarations and an example of a clause not generated due tothe declaration.

Declaration Example

:-tautology('≥'(X,X)). p(X)←X≥X:-contradiction( ['>'(X,Y),'<'(X,Y)]). p(X)←X<2,X>2

:-commutative(mult(X,Y,R),mult(Y,X,R))

p(X)←mult(X,3,R),

mult(3,X,R)

:-transitive('>'(X,Y), '>'(Y,Z), '>'(X,Z))p(X)←X>Y,Y>Z,

X>Z

semantic_rule('<'(A,B),'<'(A,C)):- B<C p(X)← X<0,X<2

:-equiv('≤'(X,Y), '>'(Y,X)) p(X,Y)←X≤Y,Y>X

Table 3.2: Background knowledge properties declarations.

The declaration :- tautology('≤'(X,X)) informs the system that literals of theform '≤'(X,X) should be discarded.

The commutative declaration indicates that a given predicate is commutative. Asan example consider that we inform the ILP system that the predicate adj(X,Y) iscommutative (e.g., :-commutative(adj(X,Y),adj(Y,X))). That information is usedto prevent the generation of two equivalent literals such as adj(X, 2) and adj(2, X).

The equiv declaration allows an expert to indicate that two predicates, althoughpossibly di�erent, generate equivalent literals. For instance, :-equiv('≤'(X,Y),'>'(Y,X)) informs the system that literals like '≤'(X,1) and '>'(1,X) are equiva-lent. An ILP system using equivalence declarations needs only to consider one literalof each equivalence class. Obviously, there is redundancy in having commutative

Page 91: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.2. SYSTEM DESCRIPTION 89

and equiv declarations, since we can de�ne commutativity using only the equiv

declaration. The main reason for the redundancy is to maintain compatibility withother systems (e.g., Aleph).

The declaration contradiction has the form of contradiction([L1, . . . , Lk]), whereL1, . . . , Lk is the conjunction of literals that when appearing together in a clause makesit inconsistent. For instance, contradiction([X<Y,X>Y]) states that both literalscan not occur in a hypothesis because they would turn it inconsistent.

The transitive declaration indicates that a given predicate is transitive. For in-stance, the rule :- transitive(lt(X,Y),lt(Y,Z),lt(X,Z)) informs that the lt

(less than) predicate is transitive. With such information, a clause containing theliterals lt(X,Y),lt(Y,Z) will not be generated since it is equivalent to one containinglt(X,Z). Moreover, clauses with all three literals are not generated since they are notreduced.

Further domain knowledge can be provided using semantic_rule declarations. Thisdeclaration take the form semantic_rule(L1,L2):-RuleBody, meaning that literalL1 implies literal L2 if the RuleBody is evaluated to true. For instance, the rulesemantic_rule(lt(A,B),lt(A,C)):-B<C. allows the re�nement operator to avoidthe generation of hypotheses containing a literal like lt(X,1) followed by a literal likelt(X,2) (e.g.,p(X)← lt(X,1),lt(X,2)).

Finally, there is a generic declaration that takes the form redundant(Literal, Hypo-

thesisBody):-Body, where Literal is the literal to be added to the HypothesisBodyand Body is a conjunction of literals that specify the conditions where the literal isredundant in a hypothesis. All previous declarations could be handled by using thisgeneric declaration. There are two main reasons to provide a set of declarations insteadof a single declaration. The �rst reason is e�ciency - special cases can be implementedmore e�ciently than a generic declaration of redundancy like redundant. The secondreason is legibility - declarations with names that indicate the property simplify thereading and the understanding of the background knowledge.

A �nal note concerning the implementation. Most of the declarations described(e.g., transitive, contradiction, commutative, and equiv are implemented byperforming a matching between the literals in a declaration and the ones in a clause.The declaration semantic_rule is the only exception to this schema, since it involvesthe evaluation of the body of the semantic rule. The test if a declaration is applicableto a clause involves comparing (matching) the literals in the clause with the ones inthe declaration. Since the re�nement operator in April adds a literal to a clause, the

Page 92: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

90 CHAPTER 3. THE APRIL ILP SYSTEM

newly added literal is always one of the arguments of the tests. The cost of such atest is linear on the size of the clause (number of literals), thus having a relative lowcomputational cost.

Pruning

Pruning is used to exclude clauses and their re�nements from the search. It is veryuseful for stating which kinds of clauses should not be considered. The use of pruninggreatly improves the e�ciency of ILP systems since it leads to a reduction of the sizeof the search space. Note that the declarations described in the previous step alsohave a pruning e�ect.

Two types of pruning can be distinguished within April, built-in and user-de�nedpruning. Built-in, or internal pruning, refers to pruning implemented within Aprilthat performs admissible removal of clauses from a search, and is currently availablefor all evaluation functions.

User-de�ned pruning statements can be written to specify the conditions under whicha user knows that a clause (and its re�nements) could not possibly be acceptable.Such clauses are pruned from the search. The prune de�nitions are written in thebackground knowledge �le using rules of the form

prune((ClauseHead:-ClauseBody)) :- Body.

Example 2 (Pruning declaration)

The following example states that every extension of a clause with two

or more "pieces" is unacceptable, and that the search should be pruned at

such a clause.

prune((Head:-Body)) :-

has_pieces(Body,Pieces),

length(Pieces,N),

N>=2.

has_pieces(Body,Pieces):-...

Page 93: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.2. SYSTEM DESCRIPTION 91

User-de�ned constraints

April accepts the de�nition of Integrity Constraints (IC) that should not be violatedby a generated hypothesis. Integrity Constraints can be used to impose syntacticor semantic properties on the hypothesized clauses. They are are written in thebackground knowledge �le in rules of the form:

is_constraint(HypothesisBody):- ConstraintBody.

or

constraint(HypothesysHead,HypothesisBody):- ConstraintBody.

ConstraintBody is a set of literals that specify the condition(s) that should not beviolated by hypotheses found by April.

Note that negative examples are a special case of integrity constrains. If April appliesa constraint successfully to a hypothesis then it will be considered inconsistent andwill not be accepted. An integrity constraint is di�erent from a pruning declarationin that it does not state that the re�nements of a hypotheses that violates one (ormore) constraint(s) will also be unacceptable. To achieve this pruning should be usedinstead.

Example 3 (Constraint declaration)

The following example states that hypotheses with less than three "pieces"

are unacceptable.

constraint(Head,Body):-

has_pieces(Body,Pieces),

length(Pieces,N),

N =< 2.

Parameters

The April system has a wide number of parameters. For instance, they can be used tochange the verbosity level of the messages printed at runtime, can choose the way thatsearch is performed, i.e., by selecting a particular search strategy or the evaluationfunction. The set(Parameter,Value) predicate is used for setting the values of the

Page 94: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

92 CHAPTER 3. THE APRIL ILP SYSTEM

parameters.∗

Example 4 (Parameters' declaration)

The following declarations de�ne the noise level to 10, the maximum clause

length of the clauses generated to 5, and that all clauses to be accepted

should explain at least 5 positive examples.

:- set(noise,10).

:- set(clauselength,5).

:- set(mincover,5).

3.3 April's Algorithm

April's main algorithm is outlined in Figure 3.1. It accepts as input: a training set con-sisting of positive (E+) and, optionally, negative (E−) ground examples; backgroundknowledge (B) in the form of de�nite clauses; and constraints (C) represented bymeans of meta-language declarations that include determination declarations, modeand type declarations, background predicates' properties, and facilities to changesystem parameters. As output, generates a reduced theory H that is consistent andcomplete† considering the data provided.

April's algorithm is a less greedy variant of the covering algorithm presented in Fig-ure 2.3. The main di�erence concerns the structure of the algorithm, it has now twomain stages: �rst a set of hypotheses are found (clause generation); then, an algorithmis applied to select a subset of the clauses (clause selection). The positive examplesmade redundant by the subset of clauses selected are removed from the set of positiveexamples (E+

cur). The outer cycle ends when there are no positive examples left oranother stopping condition is satis�ed. Finally, the �nal set of clauses (H) is reduced,i.e., redundant clauses that may exist are removed.

The clause generation cycle produces a number of clauses based in a given example e+,a.k.a. seed. At each iteration, an example e+ is selected sequentially or randomly (thechoice is made by the user) from E+

cur. The selected example e+ is used to constructthe most speci�c hypothesis ⊥ - the bottom clause. The most speci�c clause imposes a

∗Once April is installed, a list of April's parameters' descriptions can be obtained by running

april_opts at the command line.†Consistency and completeness conditions may be relaxed by the constraints in C.

Page 95: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.3. APRIL'S ALGORITHM 93

April(E+,E−,B,C)

Input: A training set of positive (E+) and negative (E−) examples, backgroundknowledge (B), and a set of constraints and settings (C)Output: A consistent and complete theory (H).

1. Rules_Learned=∅2. H=∅3. Pool=∅4. E+

cur=E+

5. SampleSize=C(samplesize) /* Size of the sample */

6. while not_�nish_condition_ok() do /* Default condition: E+cur 6= ∅ */

7. BestSoFar=best(Pool) /* Get best clause value in the pool */

8. i = 19. do /* Clause generation phase */

10. e+i = select_example(E+

cur,C,SampleSize,i)11. ⊥ = saturate(B,H,C,e+

i ) /* See Section 3.3.1 */

12. Ci = learn_rule(⊥,B,H,C,E+cur,E

−, BestSoFar) /* See Section 3.3.2 */

13. if best(Ci) better than BestSoFar then BestSoFar=best(Ci) endif

14. Pool=Pool ∪ Ci

15. i=i + 1

16. while i ≤ SampleSize and i ≤| E+cur |

17. Cs=select_clauses(Pool,B,H,C,E+cur,E

−) /* Clause selection phase */

18. E+cur=E+

cur − covered(B, H, E+cur, Cs) /* Remove redundant examples */

19. H=H ∪ Cs /* Add clauses to theory */

20.end while

21.H=rem_redundant_clauses(B,H,E+) /* Remove redundant clauses from H */

22.return H

Figure 3.1: April's main algorithm.

Page 96: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

94 CHAPTER 3. THE APRIL ILP SYSTEM

limit on the size of the search space, since it corresponds to the most speci�c hypothesisthat may be generated during a search. The algorithm used in April to construct thebottom clause is basically the same as used in Progol [Mug95]. More details are givenin Section 3.3.1. The learn_rule() procedure searches for a clause in a search spaceof clauses where the literals in the body are a subset of the literals that appear in thebody of the bottom clause. The top k best acceptable clauses found during the rulegeneration cycle are added to a pool of clauses. A clause is acceptable if it does notviolate any constraint de�ned in the meta-language (e.g., minimum coverage, noisevalue, clause length, language level, etc). The clauses in the pool are ordered by theirevaluation function value and length. The clause selection phase selects a subset fromthe clauses found previously in the generation phase.

The idea underlying the two phases is similar to the cautious induction methodimplemented in CILS [AF99]. As in CILS, April �rst generates a set of candidateclauses (clause generation). April then selects the clause(s) with higher quality andadds it(them) to the theory (clause selection). The di�erence to CILS is that April canapply di�erent clause selection methods, namely best�rst, beam search, and greedysearch, while CILS provides two methods, greedy search and complete search. Thegreedy search reduces the execution time at the cost of probably obtaining a �naltheory with slightly reduced quality.

April has many con�guration options, and many of them slightly modify the behaviorof the algorithm presented. For instance, the covering algorithm presented in Sec-tion 2.3.1, can be simulated in April using a search selection strategy that selects thebest clause in the pool and a samplesize of 1.

Next we describe in more detail the four main components of April's algorithm:example saturation; clause generation; clause evaluation; clause selection.

3.3.1 Saturation

Saturation constructs the bottom clause ⊥ that corresponds to the bottom of ageneralization lattice. The process consists in gathering all �relevant� ground atomsthat can be derived from B ∧ e+

i and satisfy the constraints C∗ . The bottom clausewill contain all literals that may be found in the clauses generated during a search.This step removes useless clauses from consideration when searching for a clause.

∗When learning recursive predicates the partially found hypothesis H is considered as part of the

program B.

Page 97: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.3. APRIL'S ALGORITHM 95

April' saturation algorithm is based on Progol's saturation algorithm [MF01]. Thealgorithm works as follows: for each modeb declaration of some predicate p it replacesthe +type argument with a constant relevant to the example, and the −type and #type

arguments with variables; the predicate p is executed as a Prolog goal on the programB; if the goal succeeds with ground substitutions then an atom is recorded and itsgeneralization added to the body of the bottom clause. The recall number controls thenumber of substitutions by limiting the amount of backtracking for non-deterministicpredicates. The April parameter i is used to de�ne an upper bound on the depth ofvariables.

All relevant atoms collected during the construction of the bottom clause are �attened∗.April �attens all function symbols by introducing equalities, internally represented assat_eq(Var,Value) to distinguish from other equalities that may be introduced inthe background knowledge through determination and mode declarations.

The input or output arguments of the atoms collected are transformed into skolemizedvariables. Then the new atom (with skolem variables) is recorded, to be used laterin the process of generating clauses, together with its i − depth and an identi�er (anauto − number greater than 0). The head literal has the identi�er 0. The identi�ernumber imposes an ordering on the literals. Example 6 shows a bottom clause as seenby the user, and how it is internally represented using literal identi�ers.

Example 6 (Bottom-clause)

Consider the set of examples and background knowledge given in Table 2.1.

By picking a seed (positive) example �henry is the grandfather of jane� we

get the bottom clause (see Example 4):

grandfather(X,Y):-male(X),father(X,Y),parent(X,Y),male(Y),

father(Y,Z),parent(Y,Z),female(Z),X=henry,Y=john,Z=jane.

The bottom clause literals represented by they literals identi�ers:

0 :-1,2,3,4,5,6,7,8,9,10,11,12.

∗Flattening is a method to make a theory function-free and was introduced in ILP by

Rouveirol [RP89, Rou92, Rou94].

Page 98: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

96 CHAPTER 3. THE APRIL ILP SYSTEM

3.3.2 Clause Generation

The bottom clause generated from an example e+j is the most speci�c clause (⊥j)

that subsumes e+j relative to the background knowledge B (possibly together with

H). Thus, for the example e+j , the search for an acceptable hypothesis c is limited

to the bounded sub-lattice � � c �⊥j. The bottom clause is often too speci�c tobe of interest because it sometimes just subsumes e+

j . For this reason ⊥j must begeneralized. The generalizations considered by April are subsets of the bottom clause.

April's default approach to �nd an interesting clause more general than ⊥ involves atop-down search. The generalization lattice is traversed starting at the more generalclause (that has the same predicate symbol of the target concept). The search thenproceeds by repeatedly applying a re�nement operator to a clause.

Re�nement Operator

The re�nement operator in April is designed with the primarily concern of maintainingthe relationship � � c �⊥ for each clause c generated. The operator avoids the useof siblings lists (for example, as used by Indlog and Aleph) by exploring the orderingof the bottom clause literals.

April's re�nement operator receives a clause (represented as a list of bottom clauseliteral identi�ers), and all variables found in the clause (bound, unbound, and theclause's unbound head variables) and selects a literal from the bottom clause that canbe added to the clause. Each literal selected must be mode and language level conform,and its identi�er bigger than the last identi�er added to the clause. By imposing anordering on clauses' literals, based on the order by which the literals were added tothe bottom clause, the operator eliminates combinations of literals that would lead toequivalent clauses, although syntactically di�erent. Example 5 presents some of theclauses generated, by applying repeatedly the re�nement operator to the head literal(0) of the bottom clause in Example 6. Example 6 shows some clauses that are notgenerated.

Page 99: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.3. APRIL'S ALGORITHM 97

Example 5

Some clauses that the re�nement operator may generate:

grandfather(X,Y):- male(X).

grandfather(X,Y):- father(X,Y).

grandfather(X,Y):- parent(X,Y).

grandfather(X,Y):- father(Y,Z).

grandfather(X,Y):- male(X),male(Y).

grandfather(X,Y):- male(X),father(X,Y).

grandfather(X,Y):- parent(X,Y),father(X,Y).

grandfather(X,Y):- parent(X,Y),father(Y,Z).

Example 6

Considering the generation of the clauses of Example 5, the ordering im-

posed on the literals prevents the generation of redundant clauses like:

grandfather(X,Y):- male(Y),male(X).

grandfather(X,Y):- father(Y,Z),father(X,Y).

grandfather(X,Y):- father(X,Y),male(X).

grandfather(X,Y):- parent(X,Y),father(X,Y).

Search

April can use di�erent search methods to �nd a clause, namely beam-search, heuristic-search, breadth-�rst search, and randomized search. To guarantee that the search ends,several constraints can be de�ned by the user. For instance, the user may de�ne themaximum depth of the search through the parameter clause_length. This constraintcan be turned o� by setting a value of 0. Other constraints available include limitingthe number of clauses generated (through the parameter nodes) or limiting themaximum time that a search may take (through the parametermax_search_time).

Like in the Aleph [Sri03] system, the search methods are implemented using a genericbranch-and-bound procedure (similar to the algorithm outlined in Figure 2.6). Thechoice of the node (that corresponds to a clause) to expand is based on comparisons of adual search key (primary and secondary) associated with each clause. The value of thekey depends on the search method and evaluation function. For instance, to performa breadth-�rst search the primary key would be -L (where L is the number of literals

Page 100: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

98 CHAPTER 3. THE APRIL ILP SYSTEM

in the clause), and the secondary key the value of the clause given by the selectedevaluation function. This ensures that, by selecting the clause with maximum �rst keyvalue, clauses with fewer literals will be selected �rst for expansion. The branching ofa node is done by applying a re�nement operator to the clause corresponding to thenode.

April automatically performs admissible pruning of clauses for several evaluationmeasures. For instance, a clause is pruned if its positive coverage is lower thanthe value de�ned in the parameter minpos. This parameter speci�es a lower boundon the number of positive examples that a clause must cover. The user can de�nepruning statements by writing the rules in the background knowledge that specify theconditions for a clause to be pruned.

The search method used in April becomes an iterative deepening search methodwhen used in conjunction with Incremental Language Level Search (ILLS) [Cam02]or Incremental Number of Variables Search (INVS). The ILLS is an iterative searchstrategy that, starting from 1, progressively increases the upper-bound on the numberof occurrences of a predicate symbol in any clause. The INVS is a similar iterativesearch strategy that, starting from 1, progressively increases the upper-bound on thenumber of di�erent variables that appear in any clause.

Both iterative procedures work as follows. If a clause is not legal in the current(language or number of variables) level it is placed on a list of suspended clauses (andis not evaluated). When the search ends in a level, and a solution is not found, thelevel is incremented. Then, a new search is performed in the new level. The clausessuspended that no longer violate the current level are unsuspended and evaluated.The search proceeds from the unsuspended clauses. The process repeats until a clauseis found or the maximum level is reached.

3.3.3 Clause Evaluation

Clause evaluation consists in testing a clause ci against the positive and negativeexamples. This is done by verifying for each example e in E if B ∧H ∧ ci ` e. Apriluses a Prolog depth-bound SLDNF theorem-prover to check if an example e can bederived from B ∧H ∧ ci. The time needed to compute the coverage of an hypothesisdepends primarily on the cardinality of E (i.e., E+ and E−) and on the theoremproving e�ort required to check if an example succeeds in the clause ci.

Page 101: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.3. APRIL'S ALGORITHM 99

As it will be shown in Section 3.6.5, most of the execution time is spent in clauseevaluation. Therefore, an e�cient coverage computation is crucial for the perfor-mance of an ILP system. In Section 2.5.3 several approaches proposed to improvecoverage computation e�ciency were referred. April exploits some techniques, namelyquery optimizations [CSC00, CSC+03], lazy evaluation of examples [Cam03], cover-age caching [Cus96, FCSC03], tabling [RFC05], and parallel coverage computation(described in the next chapter).

Lazy evaluation

Lazy evaluation [Cam03, FCR+] of examples is a technique that consists in avoiding orpostponing the evaluation of each clause against all examples. Three variants of lazyevaluation have been proposed: lazy evaluation of positive examples, lazy evaluationof negative examples and total laziness. The three lazy evaluation modes are availablein April and can be activated/deactivated through the parameter lazy_eval.

The rationale underlying lazy evaluation is the following. A hypothesis is allowed tocover a small number of negative examples (the noise level) or none. If a clause coversmore than the allowed number of negative examples it must be specialized. Lazy

evaluation of negatives can be used when we are interested in knowing if a hypothesiscovers more than the allowed number of negative examples or not. Testing stops assoon as the number of negative examples covered exceeds the allowed noise level orwhen there are no more negative examples to be tested. Therefore, the number ofnegative examples e�ectively tested may be very small, since the noise level is quiteoften very close to zero. If the evaluation function used does not use the negativecounting then this produces exactly the same results (clauses and accuracy) as thenon-lazy approach but with a reduction on the number of negative examples tested.

One may also allow the positive cover to be evaluated lazily (lazy evaluation of

positives). A clause is either specialized (if it covers more positives than the bestconsistent clause found so far) or justi�ably pruned away otherwise. When using lazyevaluation of positives it is only relevant to determine if a hypothesis covers morepositives or not than the current best consistent hypothesis. We might then justevaluate the positive examples until we exceed the best cover so far. If the best coveris exceeded we retain the hypothesis (either accept it as �nal if it is consistent or re�neit otherwise) or we may justi�ably discard it. We need to evaluate its exact positivecover only when accepting a consistent hypothesis. In the event of this latter case wedon't need to restart the positive coverage computation from scratch, we may simply

Page 102: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

100 CHAPTER 3. THE APRIL ILP SYSTEM

continue the test in the point where we left it before.

Lazy evaluation can be taken to the extreme and simply do not evaluate the positivecover (total laziness). The evaluation of a hypothesis is divided in two steps. In the �rststep, we perform a lazy evaluation of negatives examples. If the clause is inconsistentthen we are done, no extra evaluation e�ort is required and the clause is retained forspecialization. On the other hand, if we �nd a consistent clause then an exact positivecoverage is carried out. The advantage of the total laziness is that for each clause weonly test it on the negatives until it covers at least the noise level. One should notethat in systems that constraint the number of hypotheses generated, it is necessary torelax the nodes limit constraint (i.e., increase the upper bound limit on the numberof generated hypotheses). Although we may generate more hypotheses, we may stillgain by the increase in speed of their evaluation process since the computational costof generating hypotheses is usually much inferior than the cost of evaluating them.

Coverage Caching

To reduce the time spent on computing clauses coverage, April, like other ILP systems(e.g., Aleph [Sri03], Indlog [Cam00], and FORTE [RM95]), keeps lists of examplescovered (coverage lists) for each hypothesis that is generated during execution. One listfor the positive examples and another for the negative examples covered by a clause.The rationale underlying coverage caching and coverage lists has been described inSection 2.5.3.

Coverage lists reduce the e�ort in coverage computation at the cost of signi�cantlyincreasing memory consumption [FCSC03]. Therefore, coverage lists should be repre-sented using e�cient data structures to minimize memory usage [FRCS03]. The datastructure used to maintain coverage lists in systems like Indlog or Aleph are Prologlists. For each clause, two lists are kept: a list of positive examples covered and a list ofnegative examples covered. A number is used to represent an example in the list. Thepositive examples are numbered from 1 to |E+ |, and the negative examples from 1 to|E− |. The systems mentioned reduce the size of the coverage lists by transforminga list of numbers into a list of intervals. For instance, consider the coverage list[1, 2, 5, 6, 7, 8, 9, 10] represented as a list of numbers. This list represented as a list ofintervals corresponds to [1− 2, 5− 10]. Using a list of intervals to represent coveragelists is an improvement to lists of numbers but it still presents some problems. First,the e�ciency of performing basic operations on the interval list is linear on the numberof intervals and can be improved. Secondly, the representation of lists in Prolog is not

Page 103: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.3. APRIL'S ALGORITHM 101

very e�cient as far memory usage is concerned.

April can use a list of intervals to represent coverage lists or some other e�cient datastructures, namely RL-Trees [FRCS03] and bitmaps. Coverage caching can be acti-vated/deactivated through the parameter cache, while the parameter cache_storageallows the the selection of the data structure used to represent the coverage lists.

Query Optimizations

Query optimizations [CSC00, CSC+03] techniques consist in performing exact trans-formations in the clauses generated to make them more e�cient to execute in a Prologengine. The transformations involve the elimination of redundant literals, additionof cuts between islands of independent literals (to reduce useless backtracking), anddropping literals from the clause by exploring the knowledge of previous evaluations(it works for top-down searches with coverage lists). This last transformation is notactive in April's implementation. These techniques can be activated/disabled throughthe parameter optimise_clauses.

Tabling

Tabling [Mic68] is a logic programming technique that performs a kind of dynamic andtransparent lazy extensionalization of predicates. It consists in storing intermediateanswers for sub-goals so that they can be reused when a repeated call appears thusavoiding redundant recomputation. The results of exploiting tabling in the ILPcontext [RFC05] showed that it can reduce the execution time at the cost of usinglarge amounts of memory (a problem that needs to be solved to further explore tablingin ILP). Tabling can be activated/disabled in April through the parameter tabling.

3.3.4 Clause Selection

The clause selection, as the name suggests, selects a subset from the set of clausesfound previously in the generation phase. Several selection methods can be envisaged.April already has the following methods∗: single best clause selection; greedy clauseselection; best-�rst clause selection search; and beam clause selection search. Allmethods receive as input a set of clauses and work as follows.

∗The best-�rst and beam-search clause selection methods are still under development.

Page 104: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

102 CHAPTER 3. THE APRIL ILP SYSTEM

• The single best clause selection method simply chooses the best clause (accordingto some evaluation function).

• The greedy clause selection method performs a greedy beam-search (with a beam-width of one) for a subset of clauses that maximize some evaluation function.The search traverses a space of theories, where each theory is composed by asubset of clauses passed to the clause selection phase.

• The beam-search clause selection method performs a beam search for a minimalsubset of rules that maximize some quality evaluation function.

• The best-�rst clause selection method performs a best-�rst search for a minimalsubset of rules that maximize some quality evaluation function.

April provides the parameter theory_search to indicate the clause selection methodthat should be used, and theory_heuristic to indicate the evaluation function.

3.4 Coupling with Relational Databases

April has several strategies to couple to a relational database, using a deductivedatabase engine [SFR05], namely:

• Mapping predicates to database tables or views;

• Translating logical clauses into SQL views;

• Translating logical clauses into SQL count statements.

The mapping scheme results in increased communication with RDBMS due to themany accesses made while evaluating a single clause. The second and third schemesreduce communication at the cost of the ILP system not being able to learn recursivede�nitions. The third scheme reduces the communication to the database managementsystem by sending a SQL query and receiving a number with the count value.

In order to optimize query execution, April automatically creates indexes in the tablesusing the mode information. For each mode declaration (that a�ects some table)two indexes are automatically created. One index on the �elds (arguments) indicatedas input or constants, and another index on all the �elds. The rationale is that allqueries generated will be mode conform. The projections on the tables will always

Page 105: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.5. IMPLEMENTATION 103

be performed using the input and/or constants �elds. Sometimes, the input �elds areused in conjunction and for this reason an index is created on all �elds.

Table 3.3 presents some results∗, namely the time taken in seconds to evaluate 688

randomly generated clauses using each scheme on four arti�cial applications [BGSS03].The Translate-Count-widx scheme di�ers from the Translate-Count by creating theabove mentioned indexes. One can observe that the Mapping and Translate-Viewschemes are much slower than evaluating the clauses in the YAP [CDRA89] Prologengine (noRDBMS). However, the translate-count-widx scheme is considerably fasterthan noRDBMS. In this case, the indexes created are fundamental for achieving goodperformances.

SchemeProblem

p.m8.l27 p.m11.l15 p.m15.l29 p.m21.l18

noRDBMS 12 40 21,490 340,279

Mapping 35,583.984 >1 day >1 day >1 day

Translate-View >1 day >1 day >1 day >1 day

Translate-Count 39,783 153,834 > 1 day >1 day

Translate-Count-widx 7 19 256 883

Table 3.3: Time (in seconds) to evaluate 200 examples on 688 randomly generatedclauses from four arti�cially generated problems [BGSS03].

3.5 Implementation

April is implemented mainly in Prolog and runs in the YAP [CDRA89] Prolog compiler.By using a Prolog compiler like YAP, April takes advantage of its tested and fastdeductive engine. YAP implements some advanced techniques in Logic Programming,such as tabling [RSC00], that could contribute to improve April's execution time.

The choice of using Prolog, and Prolog engines, also carries some drawbacks. Adrawback is the inability to e�ciently implement complex data structures in Prolog.To circumvent this limitation, some data structures have been implemented in theC language to improve response time and reduce memory usage. Examples of suchdata structures are the RL-trees and Tries data structures [FRCS03]. The basic ideabehind the trie [Fre62] data structure is to partition a set T of terms based upon

∗The relational database management system used was MySql [WA02] version 4.1 and the Prolog

engine used was YAP [CDRA89] version 5.1 with indexing on all arguments.

Page 106: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

104 CHAPTER 3. THE APRIL ILP SYSTEM

their structure so that looking up and inserting these terms will be e�ciently done.An essential property of the trie structure is that common pre�xes are representedonly once. The e�ciency and memory consumption of a particular trie data structurelargely depends on the percentage of terms in T that have common pre�xes. ForILP systems, and April in particular,this is an interesting property that we can takeadvantage of. More speci�cally, hypotheses in the search space have common pre�xes(literals). Furthermore, not only the hypotheses are similar but the informationassociated to them is also similar (e.g. the list of variables in an hypothesis is similar toother lists of variables of nearby hypotheses). April exploits e�ciently the similaritiesamong the hypotheses by using Tries to store hypotheses and related information.

The design and implementation of the April system follows a modular architecture.The main modules of April are depicted in Figure 3.2. This design has two mainadvantages. First, it simpli�es the maintenance and development e�orts when trying toadd a new feature. Secondly, we hope that the modularity will allow other developers,with knowledge of the Prolog language, to create an ILP system suited to their needsby selecting a subset of the modules or by replacing an existing module with their ownimplementation.

Some modules were implemented in C for e�ciency reasons. These �external� Prologmodules implement functionalities like data structures (e.g., Tries andRL-Trees [FRCS03]), and LAM MPI interface. The remaining modules are fullyimplemented in Prolog. Usually each technique, rule evaluation measure, idea orfeature is added to the system as a module.

The modules are divided into three major types: data modules, functional modules,and extension modules. The data modules are used to store data, while the functionalmodules implement an algorithm or some functionality. A third type of modules,the extension modules, implement ideas available in other systems, or described inpapers. Some of the extension modules available are: the language level module, imple-ments Camacho's Incremental Language Level Search [Cam02]; the clause optimizationmodule that implements the optimizations described by Costa et al. [CSC+03]; theRRR search module implements the Randomized Rapid Restart search [vSP02]; theSCS search implements the stochastic clause selection [Sri00]; the Tabling moduleimplements the functionalities needed to integrate tabling in April [RFC05]; andthe RDBMS implements the functionalities that allow April to learn directly froma relational database using a Deductive Database System.

Some of the modules are tagged as driver modules. A driver module is a module

Page 107: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.5. IMPLEMENTATION 105

Figure 3.2: April's main modules.

that, for e�ciency reasons, should be loaded at April's start up and should not changeafterward. This contrasts with the other modules that are always loaded, and areinvoked, or not, at run time by always testing some condition. For instance, thedecision of using coverage caching is left to the user. Since the cache module is not adriver, every time an example is evaluated an if statement is evaluated to determine ifthe cache is being used. These tests can have a considerable impact on the executiontime and should be reduced. At the moment we cannot remove the if statementsbecause we would loose modularity. To tackle this problem we are currently workingon incorporating conditional compilation and macro substitution in the underlyingProlog engine.

Themain algorithm module, as the name suggests, implements April's main algorithm.The example selection module provides functionalities for selecting an example to beused in the saturation module. The clause generation module provides the function-alities to generate clauses, including re�nement operators and search algorithms. Inpractice, the clause generation module is composed by several modules. The clausesgenerated during the search are evaluated to compute their coverage against the givenexamples. The coverage computation and the explicit calls to the Prolog interpreter

Page 108: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

106 CHAPTER 3. THE APRIL ILP SYSTEM

are done in the evaluation module. The theory module processes the clauses foundby the reduction module to generate the �nal theory that is presented to the user.The cache module implements the coverage caching scheme described in the previoussection. The data structures used to represent the coverage lists are available inthree modules (Interval Lists, RL-Trees, Bitmaps). The examples module stores theexamples provided by the user. The examples may be stored in YAP clausal database(Examples PDB) or, in a relational database system (ExamplesRDBMS ). The biasdeclarations provided by the user are stored in the bias module and settings module.

Since April exploits parallelism (as described in the next chapters), there are somemodules that provide the functionalities to develop parallel ILP algorithms. In thescheme shown in Figure 3.2 two modules are referred: parallel utils and MPI. The�rst module provides a set of high level predicates that facilitates the implementationof parallel algorithms in Prolog, namely a task pool, to be used in master-workeralgorithms, and high-level communication functions. The task pool keeps the tasksthat are scheduled to be run remotely, in another worker, or are unscheduled. TheMPI

module provides an interface to the LAM MPI [SL03], an open-source implementationof the Message Passing Interface (MPI) speci�cation, that can be used for applicationsrunning in heterogeneous clusters, grids or multi-processor computers.

The UserSpace module stores all background knowledge provided by the user andthe clauses committed by the theory module. This special module is used to reducepredicate name clashes between the predicates in April's code and the ones encodedin the background knowledge.

3.6 Experimental Evaluation

April was developed aiming to achieve modularity, e�ciency and scalability. Themodularity of the system, at the user and developer levels, was succinctly describedin the previous sections. The scalability and e�ciency issues will be addressed in thenext chapters by exploiting parallelism. The open question is: is April a competitivesequential ILP system as far e�ciency is concerned? To answer this question weempirically evaluated April on several applications and compared it with a matureILP system.

Page 109: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.6. EXPERIMENTAL EVALUATION 107

3.6.1 Aims

Our main goal is to assess if April is a competitive sequential ILP system. To thisend we selected a leading ILP system, Aleph [Sri03], and evaluated both systems onsome well-known applications. The evaluation is focused on the training time, memoryusage, and predictive accuracy.

Another goal of the experiments was to assess how much time April spends in hy-potheses generation, hypotheses evaluation, saturation, and hypotheses selection.

3.6.2 Materials

Data

Table 3.4 characterizes the applications used in the experiments in terms of the numberof examples (positive and negative) and the background knowledge size (i.e., numberof relations used in the learning task). The carcinogenesis [SKMS97], mutagene-sis [SMKS94], and pyrimidines [KMS92] are applications from Molecular Biology. Themesh [DBJ97] is an application from the area of Mechanical Engineering. None of theapplications have a generally �accepted� target theory.

Application | E+ | | E− | | B |Carcinogenesis [SKMS97] 182 155 38Mesh [DBJ97] 2841 290 29Mutagenesis [SMKS94] 125 63 21Pyrimidines [KMS92] 1394 1394 45

Table 3.4: Applications characterization.

Algorithms and Machines

All experiments were performed on a Intel Pentium M 1600MHz processor computerwith 1GB of RAM, and running the Linux Fedora OS version 5 with kernel 2.6.15. TheAleph ILP system used was version 5∗, and the April ILP system 0.9. Both systemsare implemented in Prolog and ran on the YAP Prolog system version 5.1.

∗Available http://http://www.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.pl

Page 110: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

108 CHAPTER 3. THE APRIL ILP SYSTEM

Settings

Table 3.5 shows the main settings used for each application. The parameter nodesspecify an upper bound on the number of rules (nodes-restriction) generated whilesearching for a rule. The i -depth [MF92] corresponds to the maximum depth of aliteral with respect to the head literal of the rule. The parameter CL de�nes themaximum length that a rule can have, i.e., the maximum number literals in a clause.MinAcc speci�es the minimum accuracy that a rule must have in order to be acceptedas good. Finally, the noise parameter de�nes the number of negative examples that arule may cover in order to be accepted.

Application i-depth Nodes Noise MinAcc CL

Carc 4 50000 8 70% 10Mesh 4 20000 3 85% 8Mut 3 700 5 70% 4Pyr 3 10000 12 85% 10

Table 3.5: Settings used by April and Aleph for each application.

Both ILP systems were con�gured to perform node-restricted breadth-�rst branch-and-bound top-down search to �nd a rule. The search was guided using a heuristicthat relies on the number of positive and negative examples.

3.6.3 Methodology

We used a 10-fold cross validation methodology∗. For each dataset an ILP system wasapplied and the time, accuracy of the rule found, and other statistics were recorded.

Both systems use the same underlying algorithm to build a theory (MDIE basedcovering algorithm) and can be con�gured to use similar search techniques to �nd aclause. Therefore, with the same (or equivalent) settings the theories produced by bothsystems should be similar. However, small di�erences in the order by which clauses areadded to the theory, or even small di�erences in the clauses, may signi�cantly a�ect theexecution time because it may increase or decrease the amount of search performed.

∗In a K-fold cross validation, the dataset is randomly partitioned into k subsets and k datasets

(folds) are generated by using 1 subset as testing set and the remaining subsets as training set. The

k results obtained from applying an algorithm to the data are averaged to produce a single result.

Page 111: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.6. EXPERIMENTAL EVALUATION 109

Since e�ciency is the main concern of the comparison, both systems were con�gured to�nd a single rule instead of a theory. This avoids the above mentioned problem, andwe e�ectively evaluate the performance of both systems since they perform almostthe same search - in fact, this guarantees that they generate the same number ofclauses, but not exactly the same ones due to the ordering imposed on the clauses.Moreover, in order to perform a fair comparison, we did not activated features thatwere not available in both systems and that could a�ect performance (e.g., tabling,use of e�cient data structures, etc).

3.6.4 Performance Evaluation: April vs Aleph

Figure 3.3 presents the average values of execution time, predictive accuracy andmemory usage obtained by each system on each application using a 10-fold cross-validation. Further details are given in Appendix B.1. The results presented showthat April has comparable execution times to Aleph in the Mut application, slowerin the Pyr application, and considerably faster in the other two (the di�erences arepresented in Table B.1).

There are still improvements that can be made. For instance, coverage caching inApril is implemented using YAP's internal database, meaning that while evaluating aclause several inserts and delets are performed in the database to update some values.It has been shown [FCSC03] that the coverage caching e�ectively reduces the theoremproving e�ort but the high cost of accessing the YAP's internal database preventsApril to achieve better results. An improvement, currently under development, is touse a special purpose data structure for the cache that avoids the use of YAP's internaldatabase.

April not only compares favorably with Aleph, as far execution time is concerned,but also �nds clauses of comparable quality. Details are given in Figure 3.3 b). Thiswas expected since both systems use the same search technique to traverse the searchspace. A notable exception is observed in the Mut application where April obtainsconsiderable better accuracies. This could be explained by the di�erent ordering ofclauses used during the search. This, in conjunction with the fact that a low numberof clauses are generated (due to the nodes value), may prevent Aleph to �nd a betterclause (similar or equal to the one that April �nds).

Another issue concerning e�ciency is memory usage. Figure 3.3 c) illustrates thememory usage of both systems (Table B.3 tabulates the values). Although both

Page 112: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

110 CHAPTER 3. THE APRIL ILP SYSTEM

a) Time

b) Accuracy

c) Memory usage

Figure 3.3: Comparison of April and Aleph in term of average execution time, accuracy,and memory usage.

Page 113: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.6. EXPERIMENTAL EVALUATION 111

systems are implemented in YAP Prolog, April's memory usage is considerably lowerthan Aleph. The main reason for the reduction must reside in the relative low weightin memory usage of the cache (coverage lists) in April as depicted in Figure 3.4.Two factors contribute for the reduction. First, April uses bitmaps to representcoverage lists∗ whilst Aleph uses Prolog lists. Second, April shares coverage listsbetween equivalent clauses whilst Aleph does not. This two factors combined arean explanation for the di�erence in memory usage between both systems. However,Aleph's approach is faster since it does not perform lookups in the database to �ndthe coverage list of some clause. We should point out that April's memory usage canbe further reduced, with a small overhead in execution time (around 1%), by usingtries to store information about each clause generated (e.g., the Prolog clause, list ofvariables in the clause, etc) [FRCS03].

Figure 3.4: Average memory usage of April in coverage lists (cache), search space(clauses's data) and other components of the system (including the Prolog enginememory usage).

We may justi�ably conclude, from the results presented, that April is comparable toAleph regarding the execution time and accuracy. However, as far memory usage isconcerned, April is considerably better than Aleph. This may suggest that for largerapplications (larger set of examples or greater search spaces) the April system shouldbehave much better, as it would be more scalable than Aleph.

∗Coverage lists can also be represented in April by the use of RL-Trees [FRCS03] that are slightly

slower than bitmaps but, in average, consume less memory.

Page 114: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

112 CHAPTER 3. THE APRIL ILP SYSTEM

Figure 3.5: Average distribution of April's execution time on the four main components(saturation, clause generation, clause evaluation, and clause selection) whenperforming a top-down search.

3.6.5 Performance Analysis of April

We carried out a pro�ling analysis to determine which components of the April systemshould be optimized for speed. We considered four main components of April'salgorithm: saturation, clause generation, clause evaluation, and clause selection. Usingthe same methodology, we ran April on each dataset but this time we recorded thetime spent in each part of the code.

Figure 3.5 presents the average times obtained for each application. Overall, most ofthe time is spent in evaluating clauses, and a smaller amount of time in generatingthem. The time required to perform saturation and clause selection is negligible. TheCarc application is an exception - most of the time is spent on clause generation.We conjecture that this is due to the fact that the datasets are relatively small interms of number of examples together with a background knowledge with little non-determinism. Therefore, clause evaluation is done quickly for each example.

In conclusion, the results show that most of the execution time is spent in evaluatingclauses and, in a smaller percentage, in generating the clauses. This suggests thatthe e�orts to improve e�ciency should focus mainly on clause evaluation. It is alsorelevant to point out that the clause selection should have a bigger weight in theexecution time if a theory was learned (instead of a single rule) and if the strategy toselect clauses involved a search through a space of theories.

Page 115: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

3.7. RELATED WORK 113

3.7 Related Work

April can address predictive ILP tasks (as FOIL, Golem, Progol, Aleph, Indlog, orMIS). April follows a learning from entailment semantic in the line of FOIL, Golem,Progol, Indlog, Aleph, and uses a intensional notion of coverage (as opposed to theextensional notion used in FOIL).

April traverses the generalization lattice as MIS, FOIL and Progol. Like Progol,Aleph, and Indlog, April generates an initial clause to bound the generalization lattice,thus reducing the search space. Unlike FOIL and Golem, April can handle non-ground background knowledge. In the line of many other ILP systems (e.g., MIS,Indlog, Skilit [JB95], CILS [AF99], Aleph, ICL, April) is implemented using the Prologlanguage,

April is specially related to Aleph [Sri03] systems and Indlog [Cam00]. Like in April,the core algorithm used in these systems is based on Mode Direct Inverse Entailment(MDIE), a technique initially used in the Progol [Mug95] system. April further di�ersfrom those two systems at the implementation level by using speci�c data structures,such as RL-trees and Tries, in order to reduce memory usage and improve executiontime. April implements many of the features found in Aleph and Indlog. Due to thisclose relation, April maintains high compatibility with the parameters and input �lesformat of Indlog and Aleph.

The main di�erences between April and Indlog are in the strategy of the searchalgorithm and in bottom clause construction. Indlog uses an iterative deepening searchor best-�rst while April uses the same search methods plus several others, includingrandomized search methods.

The Indlog's Incremental Language Level Search [Cam02] strategy is also available inApril and can be enabled in April by de�ning the language_level parameter.

Finally, a main di�erence between April and other systems resides in the ability ofApril to explore several parallelization strategies (as described in the next chapters).

3.8 Summary

In this chapter we presented the April ILP system. April's performance was evaluatedon four applications and compared to the Aleph ILP system. The results show that

Page 116: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

114 CHAPTER 3. THE APRIL ILP SYSTEM

April is competitive in terms of execution time and memory usage. This gives uscon�dence to use it as the sequential benchmark when comparing to implementationsof parallel algorithms.

A performance analysis of April showed that most of the execution time, when per-forming a top-down search, is spent in evaluating clauses and, in a smaller percentage,generating clauses. This result suggests that the e�orts to improve e�ciency shouldbe focused mainly on clause evaluation.

April is still work in progress, and for this reason the list of future work is extensive.For instance, when performing a top-down search, April �nds the shortest clause withthe largest cover for each example saturated. However, there is no guarantee thatthe �nal set of clauses will have the minimum number of literals. Finding the bestset of rules would require collecting all good clauses for each example saturated, andthen selecting the best subset with minimum number of literals. By changing some ofApril's parameters this could be done but, currently, the clause selection procedure isnot very e�cient. Future research includes improving the procedure to combine rulesin order to �nd the best combination according to some evaluation metric.

At the implementation level, some improvements should be made at the modulesinterface. Currently, some modules have the problem of being too coupled, thus when�xing some bug or adding a new feature it may require changing several modules.Therefore, as future work we plan to improve some modules interfaces and minimizetheir coupling (completely decoupling if possible). To improve the performance ofdriver modules we are working on incorporating conditional compilation and macrosubstitution in the Prolog engine.

Page 117: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

The control of a large force is the same principle as the control of

a few men: it is merely a question of dividing up their numbers.

Sun Tzu

4Parallelization Strategies for ILP

In this chapter we survey the state-of-the-art research on parallel

ILP. We implemented several parallel algorithms and studied their

performance using well known applications and under the same test

environment. We also conducted an empirical study to assess if the

parallel approaches produce better theories and in less time than a

sequential randomized based search, i.e., to assess which approach

(deterministic versus randomized) produces the best results.

4.1 Introduction

Many approaches have been proposed to scale up and improve the performance of ILPsystems (as referred in Section 2.5). One of the approaches is to exploit parallelism.ILP may pro�t from parallelism by: i) signi�cantly decreasing learning time and ii)improving the quality of the induced models. Still, the exploitation of parallelismintroduces several challenges. Designing and validating a parallel algorithm is oftenharder than designing and validating sequential algorithms. Performance issues arecomplex: splitting work into too many tasks may introduce signi�cant overheads,whereas using fewer tasks may result in load imbalance and bad speedups.

115

Page 118: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

116 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

In this chapter we survey the state-of-the-art research on parallel ILP. The manyimplementations described in the literature are succinctly presented together withreported results. A comparison of the algorithms based solely on previously reportedresults is hard since they were observed on di�erent systems, applications, and plat-forms. We therefore implemented several parallel algorithms, that accomplish the mainparallelization strategies that we have identi�ed, and studied their performance usingwell known applications and the same test environment (i.e., the same underlying ILPsystem and the same parallel architecture). An empirical study was also conducted toassess whether the solutions of the parallel runs are better than the ones produced bya random search, i.e., which approach produces better results given the same amountof time. The results gathered are presented and discussed.

The remainder of this chapter is organized as follows. The next section brie�y in-troduces parallelism concepts. Section 4.3 describes the main strategies to exploitparallelism in ILP systems and Section 4.4 presents a survey on parallel ILP implemen-tations. Section 4.5 describes the implemented parallel algorithms. Finally, Section 4.6presents and analyzes the results of an empirical evaluation of the algorithms described.

4.2 Parallelism

By parallelizing an algorithm one aims to improve its performance. Several tools andplatforms have been produced that attempt to simplify the development of parallelprograms. Despite such tools, designing e�cient parallel algorithms is still, ingeneral, a non-trivial task. The main challenges associated with the design of parallelalgorithms include minimizing I/O, synchronization and communication, e�ective loadbalancing, good data decomposition, and minimizing/avoiding duplication of work.

In order to clarify the discussion about parallel algorithms, we shall �rst brie�y de�necommon terms [GGKK03]. The process of dividing a computation into logically, highlevel, independent smaller parts is called decomposition. Tasks are the smaller partsthat result from the decomposition. Tasks can be of arbitrary size, but once de�ned,they are regarded as indivisible units of computation. Di�erent tasks can have di�erentsizes. The simultaneous execution of multiple tasks is the key to exploit parallelisme�ciently. A task is dependent of one or more tasks if uses data produced by them,and thus needs to wait for these tasks to �nish execution. Synchronization is used tocoordinate the executions of tasks to ensure correctness and avoid deadlocks. Paralleltasks are independent tasks that can be executed simultaneously (concurrently). The

Page 119: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.2. PARALLELISM 117

maximum number of tasks that can be executed simultaneously, at any time, in aparallel algorithm determines its degree of parallelism. The granularity of a task is ameasure computed as the ratio between the amount of computation done in a paralleltask and the amount of communication. The scale of granularity ranges from �ne-

grained (very little computation per communication-byte) to coarse-grained (extensivecomputation per communication-byte). The �ner the granularity, the greater thelimitation on performance, due to the amount of synchronization needed.

The �rst step to exploit parallelism in ILP is to devise a parallel algorithm. In general,we start with a sequential algorithm (that performs some sequential computation) toobtain some result and then parallelize it by splitting the computation evenly amongthe available processors, where each processor executes part of the computation andthen combines, in some way, the partial computations to obtain the �nal result. Theultimate goal in the parallelization process is to improve performance, i.e., minimizethe execution time.

One would expect that increasing the number of processors would result in a pro-portional decrease of the execution time of a program. In practice, this is rarelyobserved due to overheads associated with parallelism. There are three major sourcesof overheads: interprocess communication, idling (e.g., as a result of synchronizationpoints in the computation or load imbalance), and extra computation (e.g., scheduling,packing/unpacking of messages, etc).

A sequential algorithm is usually evaluated using its execution time (sometimes ex-pressed as a function of the size of the input data). The execution time of a parallelalgorithm depends on the number of processors used, on the amount and speed ofinterprocess communication, and on the size of the input data.

A number of performance metrics have been devised to be used in the study of parallelalgorithms performance [GGKK03]. The serial runtime (TS) of a program is the timeelapsed between the beginning and the end of its execution on a sequential computer.The parallel runtime (TP ) is the elapsed time from the beginning of the parallelcomputation until it ends. The speedup (S) is the most often used measure whenstudying the performance of parallel algorithms. It captures the relative bene�t ofsolving a problem in parallel and is de�ned as the ratio between the time taken tosolve a problem on a single processor and the time required to solve the same problemon a parallel computer with p identical processors.

Page 120: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

118 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

S =TS

TP

Theoretically, the speedup can never exceed the number of processors p. This isbecause many algorithms are essentially sequential in nature∗. However, in practice,a speedup greater than p, called super-linear speedup, is sometimes observed. Thishappens when the work performed by a sequential algorithm is greater than its parallelversion or due to hardware features that slowdown the sequential algorithm (forinstance, as a result of using slower memory, i.e., disk).

A parallel algorithm is said to be scalable if the speedup increases with the numberof processors. A parallel algorithm is also said to be scalable if it maintains theexecution time �xed while increasing the number of processors and size of the problemproportionally.

4.3 Strategies

Parallel algorithms divide the work (computation) among the available processors withthe goal of �nding a solution more quickly. The main di�culty faced by developers ison how to divide the work to maximize e�ciency. Ideally, one would want to dividethe computation and data evenly, and, at the same time, minimize the communicationamong processors, striving for a coarse-grained parallelism.

We classify the approaches to exploit parallelism in ILP systems described in theliterature into four main strategies: parallel exploration of independent hypotheses;parallel exploration of the search space; parallel coverage test; and parallel executionof an ILP system over a partition of the data. Surely, one could consider other views,however, we consider that these cover the main approaches to exploit parallelism in an

∗The Amdahl's law [Amd67] is used to estimate the maximum expected improvement of an overall

system when only a part of the system is improved. In the case of parallelizing an algorithm, the law

states that if S is the fraction of a computation that is sequential (i.e., cannot be parallelized) and

(1 − S) is the fraction of the parallel algorithm, then the maximum speedup that can be achieved

when using p processors is 1S+(1−S)/p . In the limit, as p tends to in�nity, the maximum speedup

tends to 1/S. For instance, if 90% of a calculation can be parallelized (i.e., 10% is sequential) then

the maximum speedup on 5 processors is 1/(0.1+(1-0.1)/5) or roughly 3.6 (i.e. the program can

theoretically run 3.6 times faster on �ve processors than on one). For this reason, parallel computing

is only useful for either small numbers of processors, or problems with very low values of S: so-

called embarrassingly parallel problems. A great part of the craft of parallel programming consists

in attempting to reduce S to the smallest possible value.

Page 121: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.3. STRATEGIES 119

ILP system. A parallel algorithm may also combine several strategies. Each strategyis now described.

4.3.1 Parallel Exploration of Independent Hypotheses

The strategy of parallel exploration of independent hypotheses works as follows. Letn be the number of class values of the target concept. Learning each class value is anindependent task and can be performed in parallel. This strategy requires that eachprocessor owns a replica of the whole data.

Parallel exploration of independent hypotheses has a major drawback: it is not ageneral approach. It is adequate only for applications where the target concept is com-posed by several independent class values. Learning a de�nition of the target conceptcan be seen as learning several subconcepts, where subset corresponds to a class value.Since the induction of sub-concepts is inherently independent (the clauses are disjoint),it can be easily performed in parallel. For example, consider the task of learning apredicate that classi�es emails into categories according to priority. Consider thatpriority is encoded by the predicate priority(+Email,-Priority), where Priority

∈ {low, medium, high}. The learning task can thus be divided into 3 subtasks, onelearning task for priority(+Email,low), another for priority(+Email,medium),and priority( +Email, high).

The degree of parallelism of this strategy corresponds to the number of subconcepts.The granularity is very high, since learning each subconcept corresponds to calling anILP system for each of the n sub-concepts. Although the size of the parallel tasks canbe large, they may be considerably unbalanced, leading to lower speedups. Moreover,parallelism is always limited by the number of classes of the target concept.

4.3.2 Parallel Exploration of the Search Space

The search for a hypothesis involves traversing a search space in some way (e.g., top-down, bottom-up, bidirectional). The strategy of exploring the search space in parallelinvolves some division of the search space among the processors. Then each processorexplores, in parallel, its part of the search space to �nd a suitable hypothesis.

The degree of parallelism and granularity of this strategy depends on the approachadopted to divide the search space.

Page 122: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

120 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

4.3.3 Data Parallelism

Data parallelism consists in partitioning the data into subsets, assigning each subsetto a processor. Each processor applies an algorithm (or part of an algorithm, e.g.,coverage test), or the whole sequential ILP algorithm, on its local data. Generally,data partitioning is usually only performed in the beginning of the execution. Thishappens because it is often expensive to reassign the data during execution.

When a sequential ILP algorithm is applied to a subset of the training data a problemarises: the hypotheses may be locally consistent and complete, but they may notbe globally consistent. A solution to this problem may involve sharing the locallygood hypotheses among all processors to obtain a global view. Another problem thatresults from partitioning the set of positive examples is the possibility of not learningrecursive rules [DDR95]. The only solution to this problem is the replication of the setof positive examples through all processors while dividing the set of negative examples.

The degree of parallelism of this strategy depends on the size of the data. Thegranularity depends on the algorithm applied to the dataset and size of the data.

4.3.4 Parallel Coverage Tests

As seen in the previous chapter, a considerable part of the execution time of April,and ILP systems in general, is spent performing coverage tests. The time to computea hypothesis coverage depends on the cardinality of E+ and E−. Each example canbe independently tested to determine if it is entailed by a rule c, the backgroundknowledge B, and partially theory H. The parallel coverage test strategy consists inperforming the coverage test in parallel, i.e., for each example e ∈ E the coverage test(B ∧H ∧ c ` e) is performed in parallel.

The degree of parallelism depends on the number of examples evaluated in parallelby each processor. The granularity is relatively low but can be enlarged either byincreasing the number of examples in each processor or/and by evaluating severalrules in parallel instead of a single one.

Page 123: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.4. PARALLEL ILP SYSTEMS 121

4.4 Parallel ILP Systems

We next survey the parallel ILP implementations, focusing on the strategy used andon the reported results.

The �rst parallel ILP system we are aware of is Claudien [DDR95]. It follows a strategybased on the parallel exploration of the search space where each processor keeps a poolof clauses to specialize, and shares part of them to idle processors (processors withan empty pool). In the end, the p set of clauses found (one set in each processor)are combined and returned as the solution. One should note that Claudien follows anon-monotonic setting of ILP instead of the normal ILP setting. The parallel systemwas evaluated on a shared-memory computer with two datasets and exhibited a linearspeedup up to 16 processors.

Matsui et al. [MISI98] evaluated and compared two algorithms based on data paral-lelism (background knowledge and the set of examples) and, what they called, parallelexploration of the search space. The later approach consisted in evaluating, in parallel,the re�nements of a clause, therefore, corresponding to a strategy based on parallelcoverage tests. The two strategies were implemented in the FOIL [QCJ93] system andwere evaluated on a distributed memory computer using the trains dataset [Mic80].The results of the �search space parallel approach� showed very low speedups. Thereason advanced by the authors for the poor results was that the size of the dividedtasks may not all be the same, hence reducing the e�ciency. The other two approachesbased on data parallelism (background knowledge and the set of examples) showed alinear speedup up to 4 processors. The speedup decreased above 4 processors as aresult of an increase in communication due to the exchange of the training set.

Ohwada and Mizoguchi [OM99] implemented an algorithm (based on Inverse En-tailment) using a logic programming parallel language that explored three types ofparallelism: parallel coverage tests; parallel exploration of independent hypotheses;and parallel exploration of the search space (each processor followed a branch ofthe search space). The parallel system was applied to three variants of an emailclassi�cation dataset and the experiments performed evaluated each strategy. Theresults on a shared-memory parallel computer showed a non linear speedup in allstrategies. The strategy that appeared to yield better results, on average, was parallelcoverage tests.

Ohwada et. al [ONM00] implemented an algorithm that explores the search spacein parallel. The job allocation (set of nodes to be explored) was dynamic and was

Page 124: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

122 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

implemented using contract-net [Smi80] communication. The parallel system wasevaluated on two applications and showed an almost linear speedup on a ten-processorparallel machine.

Wang and Skillicorn [WS00] parallelized the Progol [MF01] system by partitioning thedata and applying a sequential algorithm to each partition. The data partition methodconsisted in dividing the positive examples among all processors and by replicatingthe negative examples. Each processor applies the sequential algorithm on its localdata to �nd a locally good clause. Such clause is then shared among all processors toevaluate its quality on the whole training set. If a processor considers that a clauseis globally good then it exchanges this information with all processors, so that allprocessors may add the clause to the local theory and remove the examples explainedby it. It is important to point out that this algorithm is not complete when comparedto the sequential algorithm, i.e., the theory found by the parallel algorithm may bedi�erent from the one found by the sequential algorithm. The evaluation of thealgorithm focused solely on the speedup, therefore it did not evaluate the impacton accuracy. They reported double and linear speedups in their experiments withthree datasets. The experiments were performed on shared-memory machines (with 4and 6 processors).

Graham et al. [GPK03] implemented a parallel ILP system using the PVM [pvm]message passing library. They employed data partitioning and parallel coverage testsof parts of the search space on each processor. They reported an almost linear speedupup to 16 processors on a shared memory machine.

Konstantopoulos [Kon03] implemented a data parallel version of the Aleph ILP sys-tem [Sri03] system using MPICH [GLDS96] MPI [For94] library. His algorithm per-forms the coverage tests evaluation in parallel. This algorithm, although very similarto the Graham et al. algorithm, only evaluates a single clause at a time in parallelwhile Graham et al. evaluates a set of clauses. The smaller granularity of the paralleltasks of Konstantopoulos' algorithm is, probably, the main reason for the poor resultsreported. Another di�erence between the two is concerned with the target computerarchitecture: the Graham et al. [GPK03] performed the experiments on a sharedmemory computer while Konstantopoulos used a distributed memory computer.

Wielemaker [Wie03] implemented a parallel version of Aleph for shared memory ma-chines. The strategy adopted was the parallel exploration of the search space. Thealgorithm exploits parallelism by executing concurrently several randomized localsearches [vSP02]. The implementation was evaluated on the Carcinogenesis [SKMS97]

Page 125: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.4. PARALLEL ILP SYSTEMS 123

Strategy Arch. Speedup/#procs. Reference Work

Parallelexploration ofindependenthypotheses

SharedMemory

2/6 [OM99]

Parallelexploration ofthe searchspace

SharedMemory

linear/16 [DDR95]3/6 [OM99]7/16 [Wie03]8/10 [ONM00]

DataParallelism

DistributedMemory

4/15 (linear up to 5) [MISI98]not reported [CK03]

SharedMemory

linear andsuper-linear/6

[WS00]

5/8 [GPK03]

Parallelcoverage tests

DistributedMemory

1/15 [MISI98]no [Kon03]

SharedMemory

4/6 [OM99]5/8 [GPK03]

Table 4.1: Summary of the parallel ILP implementations and reported results

dataset. The Aleph system was con�gured to perform 16 random restarts, and made 10moves per restart, on each processor. The reported speedups (e.g., 7 on 16 processors)can be considered low when compared to other shared memory implementations.Notwithstanding the results, this is an interesting proposal that could accomplishbetter results if the granularity of the tasks is enlarged. This can be easily accomplishedby increasing the number of moves or the number of restarts done by each thread.

PolyFarm [CK03] is a parallel ILP system for the discovery of association rules targetedto distributed memory computers. The system follows a master-worker scheme. Themaster generates the rules and reports the results. The workers perform the coveragetests of the set of rules received from the master on the local data. The counts areaggregated by a special type of worker (Merger) that reports the �nal counts to themaster. No performance evaluation of the system was presented.

Table 4.1 summarizes the survey by presenting, for each parallelization strategy, tar-geted computer architecture, reported results and reference where more details can

Page 126: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

124 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

be obtained. The �rst observation concerns the fact that the majority of the par-allel implementations were made for shared memory architectures where the costof data transmission is lower, when compared to distributed memory architectures.Notwithstanding the high cost of the communication, parallel ILP systems runningon distributed memory computers may still achieve good speedups. The worst resultsreported were observed when using the parallel coverage test strategy. The resultsreported with this strategy di�er considerably if the target architecture is sharedmemory or distributed memory. The poor results in distributed memory computerscan be explained by the higher communication cost not being compensated by thegranularity of the tasks.

Even though most implementations just described were targeted for shared memorymachines, we share the view of the recent work reported in [Kon03, GPK03], that is,to target distributed memory architectures when parallelizing ILP systems. To thisend, one should favor coarse-grained based strategies to parallelize ILP systems.

4.5 Parallel Algorithms

In this section we describe several parallel ILP algorithms implemented that illustratethe strategies just overviewed in the previous sections. All algorithms follow a master-worker scheme. One of the processors is designated as master and the remainingas workers. In the beginning of the execution the worker enters a loop and waitsfor requests from the master. The tasks executed by a worker are received throughmessages. After receiving a message, a worker executes the task contained in it. Taskresults are sent to the sender (master) after task completion. For each algorithm, werefer the reader to Figure 4.1 on page 126 for a schema of the messages exchangedbetween the master and the workers.

4.5.1 Parallel Coverage Tests

The parallel coverage tests algorithm (PCT ) exploits parallelism by dispatching clausesto workers for evaluation on the local subset of examples. The master's algorithm issimilar to the covering algorithm of Section 2.3.1 with three main changes: �rst, theexamples are divided evenly among the processors in the beginning of the execution(this could be done in the �rst line of the covering algorithm) and are then loadedby each worker; secondly, line 7 of the learn_rule is changed to

Page 127: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.5. PARALLEL ALGORITHMS 125

broadcast(evalOnExamples(NewRule))

Val = collectAndCombine()

where broadcast() is a procedure that sends a command (in a message) to all workersto be executed; each worker executes the command received and returns the result tothe master. This means that each worker evaluates a rule on its local set of examplesand then returns the coverage value to the master. The master collects and combinesthe coverage information using the collectAndCombine() procedure. Thirdly, theremoval of examples covered is performed in parallel on all workers (steps 5 and 6 ofthe covering algorithm).

This algorithm is basically the algorithm implemented by Konstantopoulos [Kon03].However, there are two main di�erences at the implementation level: i) we used asyn-chronous message passing communication for all operations involving the sending ofa message, while Konstantopoulos only used synchronous message passing operations;ii) our implementation used LAM as opposed to the MPICH MPI implementationused by Konstantopoulos.

4.5.2 Parallel Stochastic Clause Selection

The Parallel Stochastic Clause Selection (PSCS ) algorithm is a parallelized versionof Srinivasan's Stochastic Clause Selection [Sri00]. The idea of the algorithm is torandomly select a �xed-size sample of clauses from the search space. The sample willcontain a clause in the top k percentile with some probability α (both de�ned by theuser). The best clause from the sample is selected. The sequential stochastic clauseselection algorithm was described in detail in Section 2.4.1.

The master's algorithm is similar to the covering algorithm described in Section 2.3.1.The main di�erences are: i) in the beginning of the execution the master replicatesthe data among all workers; ii) the procedure outlined in Figure 2.7 on page 63 ischanged in the following way. The master randomly draws a set of clauses from thesearch space and then distributes the clauses evenly among the workers. Each workerevaluates the subset of clauses received on the local data and, after evaluating allclauses, sends the best one to the master. The master then receives the best clausesfound by each worker. The best rule received is then returned as the result of thelearn_rule_scs() procedure.

The proposed parallel version of the SCS algorithm is similar to the PCT algorithm.

Page 128: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

126 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

Master Worker 1 Worker p

<E ,E ,B>+1

-1

<E ,E ,B>+ -

<E ,E ,B>+p

-p

BroadcastEval rule

Eval rule Eval rule

Send Result

SendResult

CollectResults

Broadcastload()

+ -PartitionE and E

a) Parallel Coverage Tests (PCT )

Master Worker 1 Worker p...

<E ,E ,B>+ -

BroadcastEval rules

Eval set of rules

Eval set of rules

Send Results

SendResults

CollectResults

BroadcastAddRule

AddRule2theory

AddRule2theory

<E ,E ,B>+ - <E ,E ,B>+ -

Master Worker 1 Worker p...

<E ,E ,B>+1

-

<E ,E ,B>+ -

<E ,E ,B>+p

-Broadcastload()

Broadcastlearn rule

learn rule learn rule

Send Rule

Send Rule

BroadcastEval Rules

Eval Rules Eval Rules

CollectRules

Send Result

Send Result

CollectResults

BroadcastAddRule

AddRule2theory

AddRule2theory

+PartitionE

b) Parallel Stochastic Clause Selection (PSCS ) c) Data Parallel Learn Rule (DPLR)

Figure 4.1: Simpli�ed schemes of the messages exchanged by the parallel algorithms.Solid lines represent the execution �ow, horizontal dashed lines represent messagepassing between the processes, and vertical dashed lines represent idleness. Thealgorithms are ordered by the granularity of their parallel tasks, from the �nest-grainedto the most coarse-grained.

Page 129: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.5. PARALLEL ALGORITHMS 127

Master Worker 1 Worker p...

<E ,E ,B>+ -<E ,E ,B>+ - <E ,E ,B>+ -

Broadcastradial search

radial search radial search

Send Rule

Send Rule

CollectRules

BroadcastAddRule

AddRule2theory

AddRule2theory

d) Parallel Randomized Rapid Restarts (PRRR)

Master Worker 1 Worker p...

<E ,E ,B>+ -

CollectTheories

Broadcastinduce

induce induce

Send Theory

Send Theory

<E ,E ,B>+1

-1 <E ,E ,B>+

p-p

CombineTheories

e) Data Parallel ILP (DPILP)

Figure 4.1: Continuation of Figure 4.1.

Page 130: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

128 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

In the PCT algorithm a single rule is evaluated in parallel (on each subset of data)while in the PSCS algorithm a set of rules is evaluated in parallel. Therefore, thegranularity of this algorithm is larger than the PCT algorithm.

Another possible approach to parallelize the SCS algorithm consists in generating andevaluating the set of rules in parallel. More speci�cally, each worker would keep areplica of all data and generate N/p rules, where N is the number of rules generatedby the SCS algorithm. The best clause found by each worker would then be sent tothe master that, after receiving all rules, selects the best one. This approach has thedrawback of possible generation of redundant work, since a clause may be generatedand evaluated in more than one worker.

4.5.3 Data Parallel Learn Rule

The data parallel learn rule algorithm (DPLR) is based on the Wang et al. [WS00]algorithm, referred in the previous section, and outlined in Figure 4.2.

The algorithm consists of the following steps: 1) divide evenly the set of positiveexamples among all processors and replicate the negative examples; 2) learn p rules inparallel (line 3 of the covering algorithm) starting at di�erent points of the search space(using di�erent seeds), where p is the number of workers available; 3) exchange therules found among all processors to obtain their coverage values on the whole trainingset; 4) select a rule and mark the examples covered by the rule on each subset. TheaddRule2theory(R) performs steps 5 and 6 of the covering algorithm of Section 2.3.1,i.e., adds the rule R to the set of rules learned and removes the set of the positiveexamples covered by R, and returns the number of examples covered on the localsubset.

Note that whilst the �rst algorithm described (PCT ) returns the same solution as thesequential algorithm, this algorithm may not return the same solution due, mainly, tothe di�erent order by which the rules are found and added to the theory.

4.5.4 Parallel Randomized Rapid Restarts

The Randomized Rapid Restarts (RRR) algorithm performs an exhaustive search upto a certain point and then, if a solution is not found, restarts at a di�erent part of thesearch space [vSP02]. The algorithm performs several (maxtries) short time-bounded(by maxtime) searches. Both maxtries and maxtime are parameters de�ned by the

Page 131: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.5. PARALLEL ALGORITHMS 129

dplr(E)

Input: set of examples E.Output: A set of consistent rules.

1. Rules_Learned=∅2. E+=POSITIVES(E), E−=NEGATIVES(E)3. Remaining=|E+ |4. < (E+

1 , E−), . . . , (E+p , E−) >=partition E+ into p subsets

5. broadcast(load_data())6. while Remaining > 0 and other stopping condition not met do

7. RulesBag=collect(broadcast(learn_rule()))8. while RulesBag 6= ∅ do

9. Results=collect(broadcast(evaluate(RulesBag)))10. R=pickBest(RulesBag)11. RulesBag=RulesBag \ {R}12. Rules_Learned = Rules_Learned ∪ {R}13. Remaining=Remaining-

∑pk=1 collect(broadcast(addRule2Theory(R)))

14. end while

15. end while

16. return Rules_Learned

Figure 4.2: The data parallel learn rule algorithm (DPLR).

Page 132: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

130 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

user. Each search for a clause begins by randomly selecting a (starting) clause andthen performing a deterministic radial best-�rst search.

Parallel Randomized Rapid Restarts (PRRR) is a parallel version of the RRR algo-rithm (described in Section 2.4.2). The PRRR master's algorithm is similar to thesequential covering algorithm (outlined in Figure 2.3) with two di�erences: i) themaster replicates the data to all workers in the beginning of the execution; ii) theremoval of examples covered (steps 5 and 6 of the covering algorithm) is performedin parallel on all workers. The remaining changes are done in the learn_rule_rrr()and are next described.

prrr(k,E,C,maxtries,maxtime)

Input: Maximum number of rules to return (k), a set of examples (E), constraints(C), an upper bound on the number of rapid searches performed maxtries, and anupper bound on the time that a rapid search may take (maxtime).Output: A set of good rules.

1. tries=12. Rules=∅3. while tries<maxtries do

4. Worker=next_free_worker()

5. select a random starting clause c0

6.Perform an exhaustive radial search starting at c0 on Worker duringmaxtime

7. if rules_received() then break endif

8. tries=tries+19. end while

10. Rules = Rules ∪ collect()

11. return bestOf(k,Rules)

Figure 4.3: An high-level description of a procedure to perform parallel randomizedrapid restarts. The di�erences to the sequential version of the algorithm (Figure 2.8)are underlined.

Figure 4.3 outlines the PRRR algorithm, a parallel version of the RRR algorithmoutlined in Figure 2.8 on page 64, that performs maxtries searches in parallel, whosetime is bounded by maxtime. Each search is performed in parallel by a di�erentworker. The next_free_worker() procedure returns the identi�er of an idle worker,

Page 133: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.6. EXPERIMENTAL EVALUATION 131

or it blocks until one worker becomes available. Then a random clause is selectedfrom the search space and one message is sent to the selected worker to start a radialsearch using the selected clause as a starting point. The result of the search is sentto the master. The master checks if a rule as been received from one of the workers,by invoking the rules_received() procedure. If a rule has been found, the loop isinterrupted. Then, the master collects all rules found in the other searches that werelaunched in parallel. Finally, the best rule found is selected and returned as the outputof the search.

The granularity of the parallel tasks of this algorithm varies accordingly to themaxtimevalue. The bigger the value, the larger the granularity of the parallel task are.

4.5.5 Data Parallel ILP

The data parallel ILP algorithm (DPILP) starts by partitioning the set of examples(positive and negatives) and distributing the subsets among all processors. It theninduces p theories in parallel, using the covering algorithm on each subset, and thencombines the p theories found using the whole training set. The combination of thetheories (i.e., rules that compose the theories) can be made using several strategies. Inorder to make the comparison with the sequential algorithm clearer, a simple strategywas selected, very similar to the one used by the sequential algorithm. The rules areordered using some metric (in our implementation we used coverage). The best rule isadded to the theory and the remaining rules are reevaluated and reordered (the rulesthat are no longer considered good are discarded). The process is repeated while thereare good rules to add to the theory. Like the previous algorithms, the solution returnedby this algorithm may not be the same as the sequential version. It is straightforwardto see that the DPILP algorithm has the largest granularity of the �ve algorithmshere described.

4.6 Experimental Evaluation

In previous sections we summarized current state-of-the-art research on parallel ILPalgorithms. However, it is hard to compare the results of the referred implementationssince they were observed on di�erent systems, platforms, and datasets. We thereforeimplemented the �ve parallel algorithms described in the previous section and eval-uated them in a distributed memory computer (a Beowulf computer). The parallel

Page 134: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

132 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

algorithms exploit the most general strategies to parallelize ILP systems, namely par-allel exploration of the search space, parallel coverage tests, and data parallelism. Noalgorithm was implemented based on parallel exploration of independent hypothesesbecause, as discussed before, this strategy is not usable in all applications.

4.6.1 Aims

Our main goal is to make a fair comparison of the algorithms, by implementing them onthe same platform, using the same techniques to distribute work among the processingunits, and using the same applications.

The goals of the empirical study are two-fold:

1. to assess the speedup obtained with each algorithm and the impact on estimatedpredictive accuracy of the theories found;

2. to assess if the solutions of the parallel runs are better than the ones returnedby a sequential randomized search algorithm. Obviously, randomized searchescan also be performed in parallel. Therefore, we consider two parallel algorithmsthat use randomized searches as described in the previous section.

The evaluation was focused on training time and (predictive) accuracy. The accuracyis also evaluated, and not only the time, because some parallel algorithms may producetheories di�erent from the ones obtained with the sequential version. We present theaccuracy variation as the di�erence between the predictive accuracy observed whenusing p processors and the predictive accuracy observed when using a single processor.

4.6.2 Materials

Data

The parallel algorithms were empirically evaluated on 4 real-world applications. Theseapplications have considerably large search spaces, and their number of examples,background knowledge size and complexity is diversi�ed (see Table 4.2).

Table 4.2 characterizes the applications used in the experiments using the numberof examples (positive and negative) as well as the background knowledge size (i.e.,number of relations used in the learning task). The AET is the average time required

Page 135: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.6. EXPERIMENTAL EVALUATION 133

to test if an example is explained by a rule. This value (presented in microseconds)was estimated by dividing the sequential execution time by the number of examplesevaluated during execution. This estimate of the cost of evaluating a single examplemay be an useful indicator when choosing a parallelization strategy, as it will becomeclear later on.

Application | E+ | | E− | | B | AET (µs)

Carcinogenesis 182 155 38 305Mesh 2841 290 29 46Mutagenesis 125 63 21 20846Pyrimidines 1394 1394 45 35

Table 4.2: Applications characterization. AET is the average estimated time toevaluate a single example.

Algorithms and Machines

All experiments were performed on a Beowulf Cluster composed by 4 nodes. Eachnode is a dual processor computer with 2 GB of memory, and running the LinuxFedora OS. We used the YAP Prolog system version 5.1.

The parallel algorithms were implemented using the Prolog language. In the commu-nication layer we used LAM [SL03] MPI. LAM is a high-quality open-source imple-mentation of the Message Passing Interface (MPI ) speci�cation, that can be used forapplications running in heterogeneous clusters or in grids.

We started with a sequential implementation of the April ILP system. For simplicityof presentation, all algorithms follow a master-worker scheme. In the beginning ofthe execution the worker enters a loop and waits for work from the master. Themaster shares one processing unit with one of the workers. The ILP system performsa node-restricted breadth-�rst branch-and-bound top-down search to �nd a rule. Thesearch was guided using a heuristic that relies on the number of positive and negativeexamples.

In the experiments we distinguish between: 1) BFBB , to denote April performing asequential node-restricted breadth-�rst branch-and-bound search; 2) RRR, to denoteApril performing a sequential Randomized Rapid Restart search; 3) SCS , to denoteApril performing a sequential Stochastic Clause Selection to �nd a rule; 4) PCT , to

Page 136: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

134 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

denote April performing parallel coverage tests; 5) DPLR, to denote April using thedata parallel learn rule algorithm; 6) DPILP to denote April using the data ParallelILP algorithm; 7) PRRR, to denote April performing a parallel randomized rapidrestart search; 8) PSCS , to denote April performing parallel stochastic clause selection.

Settings

The settings of the ILP system were tuned so that the BFBB runs did not tookmore than two hours to complete (except for the Mut application). Table 4.3 showsthe main settings used for each application. The parameter nodes speci�es an upperbound on the number of rules (nodes-restriction) generated while searching for a rule.The i -depth [MF92] corresponds to the maximum depth of a literal with respect tothe head literal of the rule. The parameter CL de�nes the maximum length that arule may have, i.e., the maximum number literals in a clause. MinAcc speci�es theminimum accuracy that a rule must have in order to be accepted as good. Finally,the noise parameter de�nes the maximum percentage of negative examples that a rulemay cover in order to be accepted.

Application i-depth Nodes Noise MinAcc CL

Carc 4 50000 5% 70% 10Mesh 4 20000 1% 85% 8Mut 3 700 5% 70% 4Pyr 3 20000 1% 85% 10

Table 4.3: Settings used by April in the experiments.

The randomized algorithms require some extra parameters. For the RRR and PRRR

we used the following settings: restarts = 100, maxtime = 20s. The parametersof the SCS and PSCS were α = 0.999, k = 0.01, and s = 400, where k de�nes thetop percentile composed of �good enough� clauses, α is the minimum probability forobtaining a good enough clause, and s is the sample size for estimating the size of thesearch space. The choices for these parameters, although admittedly arbitrary, werethe same used previously by other authors [Sri00, vSP02]. For the RRR, PRRR, SCSand PSCS algorithms the nodes parameter is ignored.

Page 137: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.6. EXPERIMENTAL EVALUATION 135

4.6.3 Methodology

We used a 5-fold cross validation methodology. For each fold an algorithm was appliedand recorded the time, accuracies of the theories constructed, and some other statistics.Each dataset was processed using 2, 4, 6 and 8 processors using all parallel algorithms.

The following details are relevant. We used an evaluation function that returns thedi�erence in the positive and negative coverage of a clause. The ILP system wascon�gured to use several optimizations, namely coverage caching, lazy evaluation ofpositives (for the clauses with length equal to maximum de�ned). These optimizationswere used whenever applicable by the sequential and parallel algorithms. The SCS

algorithm evaluates the clauses using a total laziness strategy.

4.6.4 Base (Sequential) Results

Figure 4.4 shows the average runtimes and speedups (when compared to BFBB) of thethree sequential algorithms on each application. More detailed results are available inAppendix B.3.

Overall, the RRR algorithm achieves the best values. In three out of four applications,the RRR algorithm reduces the running times, seldom signi�cantly a�ecting negativelythe accuracy (see Table 4.4). The SCS algorithm also achieves good speedups but(frequently) at the cost of accuracy loss. A paradigmatic example is the substantialaccuracy reduction with the Mesh application.

Application RRR SCS

Carc +0.3 -2.4Mesh -0.5 -28.0

Mut +2.1 +0.5Pyr -4.8 -2.7

Table 4.4: Comparison of SCS and RRR with BFBB as far predictive accuracy isconcerned. Signi�cant changes on accuracy, for a p-value of 0.02, are marked in bold.

Table 4.5 ranks the three sequential algorithms. An entry of 1 under time (T ) meansthat the corresponding algorithm achieved the lowest runtime for the given problem,while an entry of 1 under accuracy (A) means that corresponding algorithm achievedthe highest (predictive) accuracy for the given problem. The last row shows the median

Page 138: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

136 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

a) Time

b) Speedup

Figure 4.4: Sequential execution: time a) and speedup b) of the randomized algorithmsin relation to the BFBB algorithm.

Page 139: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.6. EXPERIMENTAL EVALUATION 137

BFBB RRR SCSApplication

T A T A T A

Carc 3 2 1 1 2 3Mesh 3 1 1 2 2 3Mut 3 3 2 1 1 2Pyr 3 1 1 3 2 2

3 1 1 1 2 3

Table 4.5: Rankings of the three sequential algorithms. An entry of 1 under time(T ) means that the corresponding algorithm achieved the lowest runtime for thegiven problem, while an entry of 1 under accuracy (A) means that the correspondingalgorithm achieved the highest predictive accuracy for the given problem. The lastrow shows the median ranks of each algorithm.

ranks for each algorithm. The results con�rm the conjecture that the use of completesearch methods, like the BFBB , that are more computationally demanding, often donot have a signi�cant positive e�ect on the accuracy of the models found [Sri00].

To better understand the behavior of the two randomized algorithms we collectedmore statistics on each run, namely: the number of epochs, the number of clausesgenerated, and the number of examples tested on the generated clauses. Figure 4.5shows the number of epochs and the number of clauses generated together with thespeedup observed. One can observe that the speedup achieved by the RRR algorithmis not a consequence of a reduction in the number of epochs. In fact, in some cases(Mesh and Pyr) there is an increase on the number of epochs and, at same time,speedups.

The number of epochs increased for all applications when using the SCS algorithm.The larger increase is observed in the Mesh application, the same application wherethe SCS had the larger negative impact on accuracy (−28%). This suggests that,in many searches, the algorithm is unable to �nd a good clause, therefore increasingthe number of epochs required to process the whole dataset. This is a problem ofthe SCS algorithm: it is unsuccessful in problems where �nding a good rule can beseen as �nding a needle in a haystack. The Mesh application, with the settings used(in particular the minacc of 85%), is an example of an application where good (oracceptable) clauses are rare.

In Figure 4.5.b) one can observe some interesting details: i) the RRR algorithm reduces

Page 140: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

138 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

a) Number of epochs (bars) and speedup (lines)

b) Clauses generated (bars) and speedup (lines)

Figure 4.5: Sequential execution: relation between the number of epochs a) and thenumber of clauses generated b) with the speedup observed.

Page 141: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.6. EXPERIMENTAL EVALUATION 139

the execution time notwithstanding the increase in the number of clauses generated; ii)the SCS generates much less clauses than the RRR and BFBB without achieving lowerrunning times. The increased number of clauses generated by the RRR algorithm is notaccompanied by an increase in the number of tested examples (see Figure 4.6). Thisis mainly an e�ect of being able to exploit other sequential techniques that improvee�ciency, namely coverage caching which has been shown to reduce the number ofexamples tested [FCSC03]. The SCS algorithm is not able to exploit this techniquedue to the lack of locality and therefore more examples are evaluated. Figure 4.6 alsogives an indication for poorer results of the RRR algorithm in the Mut application:the number of examples evaluated increases considerably justi�ed by the increase inthe number of clauses generated. As referred in Table 4.2, evaluating examples in theMut applications is far more costly than in the other applications, thus an increaseon the number of examples evaluated has a considerable impact on performance.

Figure 4.6: Sequential execution: clauses generated (lines) and examples evaluated(bars).

In conclusion, the results presented show that the use of the RRR randomized algo-rithm signi�cantly reduces the running times of ILP systems with little loss in accuracy.

4.6.5 Parallel Results

The main question that we address here is: what is the best parallel algorithm? Weconsider that a parallel algorithm is better than the others if it yields better speedups

Page 142: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

140 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

with none or little loss in accuracy.

Figure 4.7 shows speedups observed with each parallel algorithm on each applicationusing 2, 4, 6, and 8 processors (more detailed results are available in Appendix B.3).The speedups were computed using the execution time of the corresponding sequentialversion of the algorithm, i.e., SCS for the PSCS , RRR for the PRRR, and BFBB forthe remaining algorithms.

a) Carc b) Mesh

c) Mut d) Pyr

Figure 4.7: Speedups observed with each parallel algorithm for 2, 4, 6 and 8 processors.

The e�ects on the execution time of the parallel coverage tests approach (PCT ) showquite di�erent behaviors. In the Carc, Mesh, and Pyr applications the parallelversion is often slower than the sequential one, while in the Mut application a con-siderable speedup is observed. Since the Mut has less examples than the other three,we conjecture that the higher cost of evaluating an example (see Table 4.2) is themain reason for the speedups. However, the speedups observed in Mut tend todecrease as the number of processors is increased. This is a result of the decreasein the granularity of the parallel tasks due to the reduced size of the subsets. The

Page 143: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.6. EXPERIMENTAL EVALUATION 141

poor results observed for Carc, Mesh, and Pyr applications suggest that the workdistribution, and consequent parallel evaluation of the examples, does not compensatethe cost of communication. Clearly, this �ne-grain approach to parallelize ILP systemseems only suited for datasets with: i) complex background knowledge, where the costof evaluating an example is high, or ii) large number of examples, where the parallelevaluation of the examples on a subset outweighs the overhead of parallelism.

A way to increase the task granularity is to evaluate in parallel a set of rules insteadof evaluating a single rule, in the line of what was proposed in [CK03] and exploitedin the PSCS algorithm.

The PSCS algorithm increases the tasks granularity at the cost of increasing theamount of communication (a set of clauses is sent to the workers to be evaluated). Theincrease on the task granularity allows the PSCS to achieve slightly higher speedupsthan the PCT algorithm, but the increase on communication overheads prevents thealgorithm to achieve better speedups. Another reason for the PSCS achieving lowspeedups is the low fraction of the time spent in evaluating clauses as comparedto generate them when generating clauses randomly (see Figure B.1). Having thisin mind, one can easily conclude that better results can be achieved if the randomgeneration of clauses is performed in parallel.

The impact of the DPLR algorithm on the execution time is quite variable. In theCarc application we observe a super-linear speedup with two processors, and thenit stabilizes around 3 up to eight processors. In the Mesh and Pyr applications thespeedup steadily increases. Noteworthy is the speedup of 10 observed with 4 processorsin the Mut application.

One should note that the order by which the rules are found and added to the theory isa crucial factor to the execution time since it conditions the amount of the hypothesesspace traversed. Recall that each worker gets a subset of E+ but all of E−. If oneof the workers does not �nd a globally �good� rule using its local subset, it will haveto do a more extensive search. This may happen when a rule has an accuracy belowsome threshold in the subset of the data where it is being learned (and thus not beingconsidered good) and is above the threshold if the whole dataset is considered. Wealso observed that the �nal set of rules found by the DPLR algorithm is far largerthan the set found by the sequential algorithm. This suggests that the algorithm isunable to �nd a smaller number of rules.

As mentioned in the previous section, the DPLR is a master-worker implementation

Page 144: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

142 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

of the algorithm described by Wang et al. [WS00]. The results reported here are quitedi�erent from the ones previously reported. In [WS00] super-linear speedups (up to6 processors) were reported while the speedups we observed are not linear in most ofthe cases. The reason for this is two-fold. First, Wang et al. ran the experiments ina shared memory machine whereas we ran them on a distributed memory machine (aBeowulf cluster). Second, in the experiments performed by Wang et al. no good ruleswere lost while learning because they did not de�ne parameters, such as minimumaccuracy or minimum coverage. These parameters are used when assessing if a rule isgood or not. However, when dealing with real world applications, these parameters areoften used to make the learning process more tractable and to discard (undesirable)rules with very low coverage.

Since the DPLR is not complete, when compared to the sequential algorithm, thepredictive accuracy of the theory found may vary. In Figure 4.8 we can see thatits predictive accuracy is a�ected negatively. The reason for this is also related toloosing �good� rules while looking for a rule in the subsets. The theories found bythe algorithm are composed by much more speci�c and lengthier rules than the onesfound by the sequential version.

The speedups observed with the DPILP algorithm are the best ones overall. Moreover,sometimes it improves slightly the predictive accuracy. The DPILP algorithm di�ersfrom DPLR algorithm in the granularity of the parallel task and in the amount ofnegative examples used. In DPILP each worker gets a percentage of the total negativeswhereas in DPLR each worker gets a percentage of the positives but all of the negatives.DPILP also needs much less communication among the processors. This con�rms theconjecture that coarse-grained parallelism yields greater speedups. Noteworthy arethe variable speedups observed in the Mut application. They re�ect a great variationin the runtime that depends on the training data and seeds selected, a phenomenonalso observed by other researchers [vSP02]. The variability in the runtimes are thus apossible explanation for the superlinear speedup observed for two processors.

Finally, the PRRR algorithm shows good speedups with the Carc application, whilein the Mut and Pyr applications no speedup is observed. In fact, the best resultswere observed on the same application as the sequential version (RRR).

Table 4.6 ranks the �ve parallel algorithms. The last row shows the median ranksfor each algorithm. The results show that the overall best strategy to parallelize ILPsystems in a distributed-memory computer is one of the simplest to implement: dividethe set of examples into p subsets; run the ILP system in parallel on each subset;

Page 145: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.6. EXPERIMENTAL EVALUATION 143

a) Carc b) Mesh

c) Mut d) Pyr

Figure 4.8: Accuracy variation observed with each parallel algorithm for 2, 4, 6 and 8processors.

Page 146: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

144 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

PCT DPILP DPLR PSCS PRRRApplication

S A S A S A S A S A

Carc 5 = 1 1 3 5 4 = 2 =Mesh 5 = 2 1 1 5 4 = 3 =Mut 3 = 2 2 1 1 4 = 5 =Pyr 5 = 1 1 2 5 4 = 3 =

5 = 1 1 2 5 4 = 3 =

Table 4.6: Rankings of the �ve parallel algorithms. An entry of 1 under speedup (S)means that the corresponding algorithm achieved the overall best speedup while anentry of 1 under accuracy (A) means that corresponding algorithm had the lowest orno loss in accuracy for the given problem (�=� means that accuracy of the parallelalgorithm is equal to the accuracy of the sequential). The last row shows the medianranks of each algorithm.

and combine the theories found into a single theory. This approach (DPILP) notonly reduces the execution time but can also improve predictive accuracy. The secondbest approach, in terms of speedup, is the DPLR, but it is ranked last when rankingby accuracy. The rankings of the algorithms based on the observed speedups almostmatch the ordering of the algorithms based on the granularity of the tasks. The maindi�erence is the DPLR and PRRR switch of positions. However, as stated earlier, thegranularity of the PRRR algorithm depends of the maxtime parameter. Therefore, anincrease of the value of this parameter results in longer local searches, thus increasingthe granularity of the parallel computations.

4.6.6 Parallel versus Randomized Algorithms

Based on the results gathered we may assess if the solutions of the parallel runs arebetter than the ones produced by a randomized search, i.e., assess which approach(deterministic versus random) produces the best results.

By observing Figures 4.4 b) and 4.7 it becomes clear that, in the Carc application,the sequential RRR (randomized algorithm) achieves greater speedups (more than20 times) than the parallel algorithms up to 8 processors. In fact, the speed of theRRR can be further increased by parallelizing the algorithm (around 5 times for 8processors).

Page 147: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

4.7. SUMMARY 145

In the Mesh application the RRR sequential algorithm has slightly lower speedupsthan the DPLR algorithm, which has the best performance among the parallel algo-rithms on this application. However, the DPLR signi�cantly decreased the quality ofthe theories found.

Parallelization clearly pays o� in the Mut application. Here the DPLR and otheralgorithms achieved greater speedups than the best sequential randomized algorithm(PSCS ). In the Pyr application a sequential randomized algorithm yields greaterspeedups than the parallel algorithms.

In conclusion, the results suggest that a sequential randomized algorithm (RRR) isable to achieve greater speedups than any of the parallel algorithms described, withoutsacri�cing the quality of the solutions found. The running time of the sequentialrandomized algorithm (RRR) may be further reduced using parallelism (PRRR). TheRRR algorithm requires that the whole dataset to be loaded into memory, which forvery large datasets can be a problem. On the other hand, data parallel algorithmssuch the DPILP can cope with this problem.

4.7 Summary

In this chapter we presented a survey about the state-of-the-art on parallel ILP im-plementations and studied the performance of �ve parallel algorithms on a distributedmemory computer was studied using real world applications.

The parallel ILP algorithms described in the literature were grouped into four mainapproaches: parallel exploration of independent hypotheses; parallel exploration of thesearch space; parallel coverage test; parallel execution of an ILP system over a partitionof the data. Parallel exploration of independent hypotheses is not a general approachsince it is only adequate for applications where the target concept is composed byseveral independent subconcepts. However, when this approach is applicable it can becombined with other approaches to learn the subconcepts.

The parallel algorithms implemented were based on the three more generic strategies:parallel exploration of the search space; parallel coverage test; parallel execution ofan ILP system over a partition of the data. The results show that a good approachto parallelize ILP systems in a distributed-memory computer is one of the simplest toimplement: divide the set of examples into p subsets; run the ILP system in parallel oneach subset; combine the theories found. This approach not only reduces the execution

Page 148: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

146 CHAPTER 4. PARALLELIZATION STRATEGIES FOR ILP

time but can also improve predictive accuracy.

We have also noticed a signi�cant di�erence between the results reported for sharedmemory machines and the ones we collected in distributed memory machines. Inshared memory machines the communication overhead is signi�cantly reduced andstrategies, like DPLR, may give super-linear speedups. However, in distributed mem-ory machines, where the communication costs are higher, �ne-grained strategies areseverely penalized.

The results also suggest that a sequential randomized algorithm is able to achievebetter speedups, than the parallel algorithms that perform complete searches, withoutsacri�cing the quality of the solutions found. The results also show that a parallelversion of a randomized algorithm often further decreases the learning time. TheRRR algorithm only has the drawback of requiring that the whole dataset be loadedinto memory, which for very large datasets can be a problem. Data parallel algorithms,such the DPILP , can cope with this problem. In fact, we believe that we can havethe best of both algorithms by using the RRR algorithm to perform the search in theDPILP algorithm. We plan to evaluate this approach in the near future.

A natural extension of this work is to perform a larger experimental evaluation onlarger clusters and applications. It would also be interesting to extend the evaluation ofthe strategies to shared memory architectures. Finally, the communication is always animportant factor on the performance of parallel algorithms. Therefore, improvementsin the communication layer implementation should yield greater reductions in thelearning time.

Page 149: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

In battle, there are not more than two methods of attack�the

direct and the indirect; yet these two in combination give rise to

an endless series of maneuvers.

Sun Tzu

5A Pipeline Data-parallel Algorithm for ILP

In this chapter we propose a novel parallel covering algorithm

based on data partitioning and pipelining. It takes advantage of

data parallelism by dividing examples equally among the processors

and learning rules from each subset in parallel. The best rules

learned from each subset are streamed in a pipeline fashion. Our

experiments show that the proposed algorithm achieves substantial

performance improvements, even in a fully distributed environment,

while preserving the quality of the models.

5.1 Introduction

The rule covering algorithm used in many ILP systems is intrinsically sequential. Onecan improve its e�ciency using a strategy that divides the search graph and distributesthe subgraphs among the processors available to be explored. Since all processors needto keep a copy of the whole data in memory, the disadvantage of this approach is thatit does not scale up well with the increase on the data size. In contrast, data parallel

approaches divide data into subsets and perform similar computations on each subsetsimultaneously. Thus, making it possible to process larger data sets in main memory.However, just training on small subsets of the whole data might reduce the quality of

147

Page 150: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

148 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

the models due to over�tting or by not having, in the subset, a representative sampleof the population.

In this chapter we present a novel parallel ILP algorithm that exploits data paral-lelism while preserving the quality of the models. The algorithm takes advantageof data-parallelism by dividing examples equally among p processors, and runningp learn_rule() procedures on each subset. We ensure model quality by streamingthe best rules for each subset in a pipeline fashion. The experiments show that thealgorithm achieves substantial performance improvements, even in a fully distributedenvironment.

In the remainder of this chapter we describe the proposed parallel algorithm andpresent the results of its evaluation in a Beowulf cluster using several applications.Finally, we put our work in the context of related research and draw some conclusions.

5.2 Pipelining and Parallelism

Pipelined data-parallel algorithms [KCN90] are a class of algorithms that use pipelinedoperations and data level partitioning to achieve parallelism. This class of algo-rithms are ideal for distributed-memory parallelism since it is possible to achievebalanced computations (fundamental to obtain speedup) by controlling the granu-larity of data communication, using data partitioning and overlapping the operationsthrough pipelining.

Pipelining is an extremely general technique that can increase the throughput of asystem (circuit, processor, computer, software, etc). It is very e�ective in cases whereone has to repeat a job on many pieces of data. For instance, it is used by processorsat the hardware level to improve execution times by overlapping the various stages ofinstruction execution (fetch, schedule, decode, etc).

Pipelining is usually explained using the example of a car assembly-line. For instance,consider that the assembling task can be divided into 5 pipeline steps and that thetotal time for assembling a car is 50 hours. Then one assembly line (with the �vestages) will produce 1 car every 10 hours. In 500 hours one would produce 10 carswithout pipelining, while with pipelining the number of cars produced would be 46.This corresponds to an almost 5-fold increase in production.

The concept of pipelining can also be used to design parallel algorithms. In the parallelpipeline model, a stream of data is passed on through a succession of processes, each

Page 151: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.2. PIPELINING AND PARALLELISM 149

performing some task on the data. This simultaneous execution of potentially di�erentprograms on a data stream is called stream parallelism [GGKK03]. Except for theprocess initiating the pipeline, the arrival of new data triggers the execution of a newtask by a process. Another way of seeing a pipelining is in the context of the producer-consumer model. Each process in the pipeline can be viewed as a consumer of dataprovided by the preceding process in the pipeline, and as a producer of data for thenext process in the pipeline.

Figure 5.1: A pipeline with 5 stages.

Figure 5.1 shows a generic pipeline with 5 stages. In each stage, as stated previously,the data received as input is transformed by some procedure. If the pipeline is afunctional pipeline then it will simply apply the procedure to the received data.If the pipeline is a data pipeline then each stage will have a subset of the datathat is used by the procedure during the transformation of the input data. Whenconsidering parallelism, each stage corresponds to a processor. The �ow of data alongthe processors (represented as arrows) form a data stream, and it may vary between thestages. Since the procedure implemented in each stage transforms the data receivedfrom the stream, an increase in the amount of data received may lead to an increase inthe time required to process it. Thus, the size of the data transferred between stagesusually determines the granularity of the algorithm. The challenge is to minimize thecommunication overhead and maximize parallelism.

The process of �pipelining� a sequential algorithm that performs a particular taskinvolves splitting the task into smaller subtasks that are executed in sequence onvarious pieces of data. Parallelism can be obtained either by executing each stage ofthe pipeline simultaneously or by having several pipelines (each processor, at a giventime, executes a stage of one pipeline).

In pipelined data-parallel computations, the processors of the multicomputer are usu-ally con�gured as arrays of pipelines. Thus, there may be multiple data streams

Page 152: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

150 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

�owing in di�erent directions simultaneously. The computation can be viewed as anetwork of processors linked together by data streams. The total execution time canbe determined as the time taken by the longest data stream to traverse all processors.

The most important characteristic of pipelined data-parallel algorithms is the asyn-chronous data �ows. The �ows of data connect the processors in the system into anetwork of one or more pipelines. Each pipeline operates on one data stream. Eachprocessor may participate in several pipelines during the course of the computation.The initial data is partitioned and distributed among all processors in the beginningof the computation. No global synchronization is generally required in pipelined data-parallel algorithms.

The performance of these algorithms can be �ne tuned by controlling parameters suchas the number of pipelines, the number of stages of the pipeline, and the size of thedata exchanged on the streams (pipeline width).

5.3 A pipelined data-parallel rule learning algorithm

This section presents∗ a new pipelined data-parallel covering algorithm, p2− covering

algorithm for short. Our parallel covering algorithm combines data parallelism andpipelining. It exploits data-parallelism by distributing the data (set of examples E)among all processors available and by learning rules in parallel on each processor.Pipelining is achieved by breaking the process of learning the best rule into a sequenceof stages. Each stage performs a search using a local subset of examples. The goodrules found are sent to the next stage (worker) to be used as starting point of anew search with a di�erent subset of examples. Therefore, the search for a rule issequentially performed using di�erent sets of examples. When the pipeline completesthe newly found rules are sent to the master.

In order to maximize parallelism we need to keep all p processors busy. We ensurethis by launching p pipelines (almost) simultaneously, one per worker, as shown inFigure 5.2. The example shows a depth 3 pipeline. The pipelines are initialized whenthe master divides the examples among the workers W1, W2 and W3. Each workerthen starts a new pipeline, using its own set of examples to perform a �rst search.Figure 5.3 details the pipeline created by W1. The �rst pipeline stage is performed byW1 itself and �nds a single interesting rule h1. At this point, as shown in Figure 5.2,

∗The algorithm is described using a master-worker scheme as described in Section 4.5

Page 153: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.3. A PIPELINED DATA-PARALLEL RULE LEARNING ALGORITHM 151

W1 sends its good rules (only h1 in our running example) to W2. Similarly, W2 sendsits good rules to W3, and W3 sends to W1. All workers restart using the new rules asseeds. Figure 5.3 shows this process for the pipeline started by W1; W2 now starts asearch from h1 and �nds two more rules (h2 and h3). The last pipeline stage is calledby sending good rules from W2 to W3, W3 to W1, and W1 to W2. W3 is now in chargeof W1's initial pipeline. Rule h1 is still a good rule but it has already been expanded.Rule h2 cannot be expanded, so all the work is performed for h3. The last step is tosend the newly found rules back to the master. W1's pipeline has found 3 rules: h1

(found by W1) and h2, h3 (found by W2).

it will

W2 W3

R2

...

R3 R1

Master

W1

Master

learn_rule’

learn_rule’

learn_rule’

learn_rule’ learn_rule’

learn_rule’

learn_rule’

learn_rule’

learn_rule’

Figure 5.2: Parallel pipelined rule search with 3 workers.

5.3.1 The algorithm

In order to simplify the description of the parallel algorithm and its understandingwe introduce the following de�nitions and notation. We represent the number ofprocessors/workers available as p and follow a master-worker computational model.Since we assume a distributed environment, we use message-passing to communicatebetween master and workers. Three communication operations are used to exchangemessages: send, broadcast, and receive. The send procedure sends a message toa (speci�c) worker and receive receives a message from a worker. The broadcast

procedure is used to send a message to all workers. The broadcast and send operations

Page 154: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

152 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

W2

...

Master

W1

Master

W3

h1

h1

h2 h3

h1

h2 h3

={h1,h2,h3}R1

Figure 5.3: Example of a pipelined rule search. A squared box represents a node in thesearch space that is considered good. The good nodes are used as the search startingpoints in the next worker.

are non-blocking operations for sending a message, while the receive is blocking. Themaster sends task-messages to workers for execution. A worker upon receiving a task-message, executes it, and once completed, if necessary, sends a result-message to themaster with the results of the task.

To explore pipelined data-parallelism we introduce some changes to the coveringalgorithm presented in Section 2.3.1, namely:

• partitioning the training set of examples into p subsets, where p corresponds tothe number of workers. Each subset is assigned to one worker.

• �pipelining� the search for a rule, and

• modifying the covering algorithm to start p pipelines simultaneously.

Figure 5.4 outlines the p2 − covering algorithm as executed by the master. As inthe sequential algorithm, epochs are run sequentially, and execution will consist of as

Page 155: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.3. A PIPELINED DATA-PARALLEL RULE LEARNING ALGORITHM 153

p2-covering(E+,E−,p,w)

Input: set of positive (E+) and negative (E−) examples, number of workers (p),pipeline width (w)Output: A set of rules (Rules_Learned)

1. Rules_Learned=∅2. Partition E+ and E− into p subsets3. broadcast load_examples() /* Each worker loads the subset of data */

4. Remaining=| E+ |5. while Remaining > 0 and other stopping condition not met do

6. for k=1 until p do

7. send k learn_rule'(1,w,∅) /* Launch a pipeline in worker k */

8. end for

9. RulesBag=∪pk=1 (receive k Rulesk)

10. broadcast evaluate(RulesBag)11. RulesQuality=

∑pk=1 (receive k Qualityk)

12. while RulesBag 6= ∅ do

13. R=pickBest(RulesBag,RulesQuality)14. RulesBag=RulesBag \ {R}15. Rules_Learned=Rules_Learned ∪ {R}16. broadcast (mark_covered(R)) /* i.e., (E+

k = E+k \ {e ∈ E+

k | R explains e}) */17. Remaining=Remaining-PosCovered(R,RulesQuality)18. broadcast evaluate(RulesBag)19. RulesQuality=

∑pk=1 (receive k Qualityk)

20. Rules2Delete={r ∈ RulesBag|r not good_rule(r)}21. RulesBag=RulesBag \ Rules2Delete22. end while

23.end while

24.return Rules_Learned

Figure 5.4: The p2 − covering algorithm.

Page 156: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

154 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

many epochs as needed to have all positive examples covered or other stopping criteriais veri�ed (e.g., a time constraint).

Each epoch is managed by the master. The master performs four major steps. First,the master randomly and evenly partitions the examples into p subsets. Each workeris then noti�ed to load its subset of examples and other data. Second, the masterstarts p pipelines in parallel, one pipeline per worker. This is done by sending a task-message, to each worker, invoking a modi�ed version of the learn_rule() procedure.The master then receives result-messages from the workers with the rules found at theend of each pipeline. These are collected into a bag (RulesBag). Third, the masterbroadcasts the RulesBag to every worker in order for each worker to compute itsquality. Then, the master receives the quality values from all workers and computesthe global quality of the rules (RulesQuality). Fourth, and last step, the masterconsumes the rules from the bag and adds them to the learned rules, following analgorithm that closely emulates the sequential algorithm:

1. select and remove the best rule (R) from the bag, according to a prede�nedcriteria;

2. remove the positive examples covered by the rule in each subset;

3. update the Remaining number of examples by subtracting the positive examplescovered by the rule R;

4. reevaluate, in parallel, the quality value of the rules in the bag (in each subset)and collect the results (RulesQuality);

5. remove the rules from RulesBag that are no longer �good�.

The consumption of rules stops when there are no more rules in the bag (or some othercriteria is veri�ed). At this point the epoch ends. This is a main di�erence to thesequential algorithm, since several rules can be added to the theory in a single epochas opposed to a single rule in the sequential algorithm.

Pipelining is exploited in the learn_rule′() procedure, shown in Figure 5.5. It issimilar to the learn_rule() described previously with the following di�erences. First,it includes a parameter to indicate the set of starting points of the search. Eachsearch should traverse di�erent parts of a search space in order to avoid redundantcomputation. This is achieved by randomly selecting di�erent positive examples asseed. The procedure then generalizes the seed as much as possible and, in the process,

Page 157: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.3. A PIPELINED DATA-PARALLEL RULE LEARNING ALGORITHM 155

learn_rule'(step,w,S)

Input: current pipeline step (step), pipeline width (w), initial set of rules S

Output: A set of rules

E+ and E− are, respectively, the local set of positive and negative examples,and B the background knowledge.1. Good=S2. while stop criteria not satis�ed do

3. Pick=pickRule(S)4. S=S \ {Pick}5. NewRules=genNewRules(Pick)6. Vals=evalOnExamples(E+,E−,B,NewRules)7. Good=Good ∪ {r ∈ NewRules|good_rule(r)}8. end while

9. Good=bestOf(w,Good)10. if is_last_step(step) then

11. send master Good12. else

13. NextWorker=next_worker()14. send NextWorker learn_rule'(step+1,w,Good)15. endif

Figure 5.5: A pipelined learn_rule′() procedure. The main di�erence whencompared to the learn_rule() concerns the S argument that contains a set of rulesthat de�ne the starting points of the search space. next_worker() computes theidenti�er of the next worker on the pipeline. genNewRules() generates a set of rulesfrom a given rule.

Page 158: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

156 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

may traverse di�erent parts of the search space. If no rule is found, the seed exampleis returned. Second, the procedure has a parameter that indicates the current pipelinestage (step). When the pipeline starts this value is set to 1 and it is incremented whenthe rules found are passed to the next stage in the pipeline. Third, the procedureoutputs a set containing the �best� w rules instead of just the single �best� rule. Thew can be seen as a parameter that de�nes the pipeline width. After performing thesearch, the best w rules are selected and sent to: i) the master if the pipeline hasreached the end (i.e., current stage value is p); or ii) to the next stage of the pipeline(i.e., next worker).

As the computation evolves and the rules are learned, the examples explained bythem are removed from each subset. This may lead to unbalanced subsets, i.e.,their sizes may vary considerably, which in turn may result in unbalanced compu-tations. A possible solution to cope with this problem, with a considerable cost inmessage communication, could be to always repartition the examples before startingthe pipelines. However, we did not considered this approach mainly because of thehigh communication cost of repartitioning.

5.3.2 Considerations

The proposed approach to exploit parallelism in rule learning covering algorithmsraises some questions that we now discuss.

Data Partitioning

By partitioning the data, each worker learns (on each stage of the pipeline) using onlya subset of the data. Intuitively, one might think that any rule found using a subset ofthe data would not be representative of the dataset as a whole, but in reality, this doesnot happen in typical datasets. There are two reasons why the proposed scheme isexpected to work. First, a dataset tends to contain much redundancy so that a randomsubset is likely to be representative of the entire dataset, provided it is not too small.Therefore, any concept acceptable on the full dataset will be acceptable on at least onedisjoint subset of the full data. Second, the pipelining process goal is to ensure that,in the end, some rules are valid in the full (training) dataset. Therefore, although thetraining data is partitioned and learning is performed using each partition, the wholedataset is taken into account by �pipelining the search�.

Page 159: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.3. A PIPELINED DATA-PARALLEL RULE LEARNING ALGORITHM 157

In our algorithm, the size of a subset depends on the pipeline length. The set ofexamples E is randomly divided into p mutually exclusive subsets of approximatelyequal size (keeping the proportion of positive and negative examples). For statisticalpurposes each subset should contain at least 30 examples ∗, since too small subsets mayprevent good rules from being found. This partitioning approach has the disadvantageof imposing an upper limit on the degree of parallelism. A solution to deal with thisproblem may be to generate the subsets with reposition, however we did not studythis approach.

Handling parameters based on data size

It is usual for a user to set several parameter values based on the (whole) set of trainingexamples, such as noise level, minimum number of examples that a rule must cover,minimum accuracy, etc. For instance, let us consider the noise parameter, used inmany systems to relax the consistency condition. The noise parameter indicates thata �good� rule can wrongly classify some examples (the noise value). In the proposedalgorithm, since learning is performed from subsets of the training data, there is apossibility that a rule may be globally consistent but inconsistent in some subsets.Therefore, we should take into account this variation, otherwise globally good rulesmay be discarded because the noise level is slightly higher, in some subset, than theacceptable value.

For the sake of simplicity let us assume that the parameters are given as percentages,i.e., their values range from 0 to 1. Since the algorithm randomly divides the set oftraining samples E into p subsets E1, . . . , Ep and then learns using a subset at a time,a question arises: how to compute the value of a parameter par to be used on eachsubset (pars) given their values for the whole set (parg)?

A �rst approach would be the use of parg itself. However, as stated previously, thisapproach does not take into account the variation that may be observed on eachsubset, that results from discarding globally good rules while learning from a subset.For instance, if the noise value is set to 0.10 from the whole set of 100 negative examplesand p is 2, the expected noise on each subset would be 5 from each subset consisting of50 examples. If the noise observed in the two subsets, 1 and 2, was 4 and 6 respectively,the rule although globally good would be discarded because it is not good on subset2.

∗Many statistics textbooks consider samples large when their size is greater than 30 and small

otherwise.

Page 160: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

158 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

Another approach that takes into account the variation of the values on each subsetconsists in computing an interval for pars. For instance, in the case of the noiseparameter, we could relax the noise value used in the subsets in order to avoiddiscarding (globally) acceptable rules while learning using a subset. How to computethe interval associated to each parameter?

To compute the interval we will use a model based on Binomial distribution. Binomialdistribution is well suited for problems where the probability of observing a property(usually called �success�) is some constant value and the results of each test areindependent. The value of a parameter represents the proportion of elements in theset of examples with the property in question and it is constant on all trials. Theoutcome of a trial (verifying a property on an example) does not a�ect the outcomeof another.

Let nt and ns be, respectively, the size of the set on the training examples and the sizeof a subset, and pars the value of the parameter. The variance expected on each subsetis σ2

pars= nx ∗ parg(1 − pars), where x is t or s. The interval value can be computed

as pars ±√

σ2v

nx. Note that the division by nx is done to compute the variation on the

proportion.

Now that we have an interval we must choose a value from it. If we select a lowervalue, we are trying to ensure that no globally good rules are discarded at the cost of�nding rules too general, as a result of stopping the search early. On the other hand,if we choose a higher value from the interval we may avoid overgeneralizing at the costof not �nding some globally good rules.

Example 7 (Noise range)

As an example consider that we want to compute the interval for the noise

parameter. The noiseg is set to 0.10, i.e., a rule may cover up to 10% of

the negative examples and is still considered consistent. The number of

negative examples is 100 and the size of the pipeline p is 2. Therefore, ns,

the number of negative examples on the subset, is 100/p, and noises =

0.10±√

0.10∗0.9ns

= 0.10± 0.04.

Dealing with skewed distributions

It is often the case that applications have skewed distributions of examples, i.e., witha number of positive far greater than negative examples or vice-versa. This may raise

Page 161: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.3. A PIPELINED DATA-PARALLEL RULE LEARNING ALGORITHM 159

some scalability issues when the size of a subset becomes too small, a consequence ofusing too many processors. If the number of examples is small we may have di�cultiesbecause the random subset may not be representative.

In these cases, replication of the negative examples could be a solution. Since theset of examples is not large and is propagated only once, the computational cost oflearning with the whole set of negative examples is not signi�cant, thus speedups arestill expected to occur. If the total number of examples is too small we should notneed to parallelize.

5.3.3 Characteristics

The approach followed to parallelize the covering algorithm using pipelined dataparallelism has the following characteristics:

• It fosters scalability on the number of examples. This results from partitioningthe set of examples and then incrementally learn partially correct rules on eachsubset.

• The granularity of the tasks executed in parallel are very similar. This leadsto balanced computations, hence simplifying load balancing. The granularity ofthe parallel tasks are medium or high, and depend primarily on the number ofexamples on each subset.

• It is possible to control the granularity of data exchanged between the workersin the pipeline (i.e., by limiting the number of good rules passed on each stageof the stream).

• The algorithm may not return the same rules as the sequential covering algo-rithm. There are two main reasons for this. First, learning from subsets of datamay prevent some other rules to be found or allow some rules to be found becausethe data used in learning is di�erent. Another cause for the �incompleteness�concerns the order in which rules are consumed from the bag, i.e., the order ofadding the rules to Rules_Learned.

Page 162: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

160 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

5.3.4 Reducing Over�tting by Exploring Sample Variance

Each rule kept in the �bag� (RulesBag) is evaluated using all subsets of data, origi-nating the set of values pos1, . . . , posp and neg1, . . . , negp, that are, respectively, thenumber of positive and negative examples covered by a rule on each subset of data.An open question is how to order the rules in the �bag�?

A simple approach is to use the global values, i.e., the number of positive and negativeexamples covered by a rule on all subsets. The behavior of the algorithm using thisapproach will be very similar to a sequential algorithm. In fact, this was the approachimplemented.

Another approach is the use of mean estimates taking into account the sample varianceinstead of computing the global values. The potential advantage of this approachresides in the reduction of over�tting.

For instance, consider that we are using rule accuracy to evaluate the quality of a rule.The global and average accuracies are an estimate of the real accuracy of a rule, the realvalue is usually unknown because not all data is available. A rule r has an accuracy onthe subset Ek of (acc(r, k) =

rk

| Ek |) where rk is the number of examples in Ek (where

Ek = E+k ∪E−

k ) that are correctly classi�ed by r, i.e., rk = posk − (|E−k | −negk). The

global accuracy of a rule r (accg(r)) would be computed as follows

accg(r) =

∑pi=1 ri

| E |

while the average accuracy of a rule r (acca(r)) is computed as follows

acca(h) = x̂− s2

where s2 =√Pp

i=1 acc(r,i)−x̂

pis the sample variance and x̂ = 1

p

∑pi=1 acc(r, i) is the

average value. The sample variance is used to have estimates closer to the valuesobserved on unseen data.

Page 163: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.4. P 2 −MDIE: A PIPELINED DATA-PARALLEL ALGORITHM FOR ILP 161

Example 8 (Reducing over�tting)

As an example consider that p = 2 and that we have a rule r1 and a rule

r2 that correctly classify {2, 2} and {4, 1} examples, respectively, out of 5

examples on each subset (10 in total).

Using global accuracies to select or order the rules, r2 is �rst selected since

accg(r2) = 0.5 is higher than accg(r1) = 0.4. However, one can see that the

variance of rule r2 is higher than r1 (3 as opposed to 0). By using average

values with sample variance, the accuracies would be acca(r1) = 2/5 = 0.4,

which means that rule r1 would be selected. By taking into account the

sample variance, the predictive quality of rule r2 is probably worst than

rule r1.

5.4 p2−mdie: A pipelined data-parallel algorithm for

ILP

In the previous section we described and discussed a general parallel covering algo-rithm. We next describe an instance of the p2 − covering algorithm. The p2 −mdie

algorithm is a pipeline data-parallel covering algorithm based on MDIE. Figure 5.6outlines the algorithm (the notation used is explained in Section 5.3.1).

The main di�erences from this algorithm to the generic p2-covering algorithm residesin the arguments of the procedure itself and in the pipelined search procedure. Thealgorithm starts by randomly and evenly partitioning the examples into p subsets,where p corresponds to the number of workers available. This task is only performedby the master. Each worker is then noti�ed to load its subset of examples togetherwith the remaining data (prior knowledge, constraints, . . . ). Next, p pipelines arestarted, one pipeline per worker. In the �rst stage of the pipeline (see Figure 5.7) themost speci�c rule ⊥ for the selected example is built, and then the pipelined searchbased on ⊥ is started.

In this algorithm we assumed that �les can be shared by all workers and, for thisreason, we have no messages exchanged between the master and the workers containingbackground knowledge (B), the constraints (C), and the examples (E+ and E−). If�le sharing is not possible we can exchange messages containing the referred data.Note that the data may be partitioned before computation. The transmission cost iskept low in both approaches because the data must be loaded only once. The proposed

Page 164: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

162 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

p2-mdie(E+,E−,B,C,p,w)

Input: set of positive (E+) and negative (E−) examples, background knowledge(B), constraints (C), number of workers (p), and the maximum number of consistentrules (w) passed on each stage of the pipeline.Output: A set of complete and consistent rules.

1. Theory=∅2. Partition E+ and E− into p subsets (E+

1 ,E−1 ), . . . ,(E

+p ,E

−p )

3. broadcast load_examples() /* Each worker loads its subset of examples */

4. Remaining=| E+ |5. while Remaining > 0 and other stopping condition not met do

6. for k=1 until p do

7. send k start_pipeline(w) /* Launch a pipeline in worker k */

8. end for

9. RulesBag=∪pk=1 (receive k Rulesk)

10. broadcast evaluate(RulesBag)11. Results=

∑pk=1 (receive k Resultk)

12. while RulesBag 6= ∅ do

13. R=pickBest(RulesBag,Results)14. RulesBag=RulesBag \ {R}15. Rules_Learned=Rules_Learned ∪ {R}16. broadcast mark_covered(R)17. Remaining=Remaining-PosCovered(R,Results)18. broadcast evaluate(RulesBag)19. Results=

∑pk=1 (receive k Resultk)

20. Rules2Delete=notGood(RulesBag,Results)21. RulesBag=RulesBag \ Rules2Delete22. end while

23. end while

24. return Rules_Learned

Figure 5.6: p2-mdie - A pipelined data-parallel covering algorithm based on MDIE.

Page 165: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.5. EXPERIMENTS AND RESULTS 163

algorithm has the drawback of being unable to learn recursive concepts.

start_pipeline(w)

Input: The maximum number of consistent rules w passed on each stage of thepipeline.1. k=∅2. e=select an example from E+

3. ⊥e=build_bottomclause(e,B,Hi−1, T rain, C)

4. learn_rule'(1,B,C,w,∅,⊥e)

evaluate_rules(Rules)

Input: Set of rules (Rules)1. Stats=∅2. foreach rule in Rules

3. Stats=Stats ∪ evalOnExamples(rule)4. endfor

5. send master Stats

load_examples()

1. me=processor id2. <B,C,E+,E−>=load(me) /* Loads the subset of examples and remaining data */

mark_covered(R)

Input: Rule (R)1. B=B ∪ { R }2. E+=E+ \ {e ∈ E+ | R ∧B ` e}

Figure 5.7: Worker view of the pipeline data-parallel covering algorithm based onMDIE.

5.5 Experiments and Results

We carried out two sets of experiments. The �rst set evaluates the algorithm speedupand the impact of the pipeline width on its performance. The second set tests thescalability of the algorithm on the number of examples.

Page 166: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

164 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

5.5.1 Materials

We used four ILP applications in the experiments. Table 5.1 characterizes the appli-cations used by the number of positive (|E+ |) and negative (|E− |) examples, andnumber of relations in the background knowledge (|B |).

It is important to point out that although these applications are not very large in size,the time required to process them sequentially may be considerable. We thereforetuned the settings of the ILP system so that the sequential runs would not take morethan two hours (e.g., by imposing a threshold on the number of rules generated on eachsearch). Obviously, the side e�ect of constraining the search this way is a penalizationon the performance of the models found.

Application | E+ | | E− | | B |Carcinogenesis 182 155 38Mesh 2841 290 29Mutagenesis 125 63 21Pyrimidines 1394 1394 244

Table 5.1: Applications characterization.

We implemented the ILP pipeline data-parallel covering algorithm based on MDIE asdescribed in the previous section. The complete algorithm was implemented using theProlog language. The Prolog engine used was the Yap Prolog system [CDRA89]. Weimplemented the algorithm in Prolog as part of the April ILP system [FSCC03], version0.9. The Prolog engine used was the YAP Prolog, version 5.1. For the communicationlayer we used LAM [BDV94, SL03] MPI.

Settings

Table 5.2 shows the main settings used for each application. The parameter nodesspeci�es an upper bound on the number of rules (nodes-restriction) generated whilesearching for a rule. The i -depth [MF92] corresponds to the maximum depth of aliteral with respect to the head literal of the rule. The parameter CL de�nes themaximum length that a rule may have, i.e., the maximum number of literals in aclause. MinAcc speci�es the minimum accuracy that a rule must have in order to beaccepted as good. Finally, the noise parameter de�nes the maximum percentage ofnegative examples that a rule may cover in order to be accepted.

Page 167: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.5. EXPERIMENTS AND RESULTS 165

Application i-depth Nodes Noise MinAcc CL

Carc 2 100000 30% 70% 6Mesh 4 20000 10% 90% 10Mut 2 50000 5% 70% 8Pyr 3 20000 5% 90% 8

Table 5.2: p2: Settings used by April in the experiments.

5.5.2 Method

We ran the experiments on a Beowulf Cluster with 4 dual-processor nodes, each with2 GB of RAM and running the Linux Fedora OS. April was con�gured to performa branch-and-bound breadth-�rst top-down search to �nd a rule on all stages of thepipeline. The rules generated during a search are ordered using a heuristic that relies onthe number of positive and negative examples. We performed a 5-fold cross validation.The results shown are the average values obtained in the 5 folds.

Two series of runs were performed. The aim of the �rst series was to collect runtimestatistics in order to compute the speedups. The aim of the second series was to collectfurther statistics, namely the amount of data transferred, number of epochs, etc.

5.5.3 Speedup Analysis

We ran the algorithm with three di�erent pipeline widths (w): no limit was imposedon the width of the pipeline (inf), i.e., all good rules were passed between stages ofthe pipeline; and a limit of 1 and 10 rules between stages was imposed. Note thatthe number of good rules found at each stage of the pipeline can be quite large. Forinstance, in the Mesh application it is common to have some thousands rules at theend of a pipeline. Therefore, the amount of communication exchanged between stagescan be quite large, as it will be shown.

Figure 5.8 plots the average speedups observed with the proposed algorithm. Theresults shown are the average values of a 5-fold cross validation. Speedup results areoverall very good as they are almost linear and, in many cases, super-linear. Thespeedups improve with the increase of the number of processors, and for applicationssuch as Mesh and Mut, limiting the width of the pipeline clearly improves perfor-mance.

Page 168: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

166 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

a) Carc b) Mesh

c) Mut d) Pyr

Figure 5.8: p2-mdie: Average speedups observed with 2, 4, 6 and 8 processors and apipeline width of 1, 10, and unlimited (inf).

Page 169: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.5. EXPERIMENTS AND RESULTS 167

Two factors may explain the super-linear performance. First, dividing the data amongmore processors speeds up the evaluation of each rule in a subset, a consequenceof having less examples to test. Second, increasing the number of processors mayalso increase the number of rules each epoch generates, therefore further reducing thenumber of epochs. Figure 5.10 shows that this is indeed the case. It presents theaverage number of epochs for all applications. We can notice that, in all cases, thereis a reduction in epochs as we increase the number of processors, indicating that wedo bene�t from learning several rules in parallel.

We also observed that the best speedups are obtained when constraining the width ofthe pipeline to 10. This happens because a wider pipeline leads to more data beingexchanged between processors, while narrowing the pipeline increases the number ofepochs. Figure 5.9 does show that the communication is much higher when the width ofthe pipeline is unconstrained. Namely for theMut and Pyr applications the speedupfor 8 processors is below linear for unlimited pipeline width, but super-linear when thepipeline is constrained. In fact, for these applications, the amount of communicationexchanged increases considerably with the number of processors. In two applications,when going from 4 to 8 processors, the amount of data exchanged grows 10 times.Figure 5.10 allows us to observe that the number of epochs is lower with an unlimitedpipeline width. Despite of this, the execution time is considerably higher. Hence,we conclude that the large amount of messages being transferred (and their size) isnegatively a�ecting performance. On the other hand, using the unlimited pipelinewidth we obtain better results with the Carc application. With this application, theamount of data transferred between stages is quite small (more detailed results inTable B.12), suggesting that not enough rules are found at each stage of the pipelineto cause a (signi�cant) communication overhead.

Figure 5.11 plots the averaged accuracy variation for each application. Our conclusion,up to a 98% con�dence level, is that accuracy does not signi�cantly change for mostruns (see Table B.10). Only 4 out of the 11 signi�cant changes are a reduction inaccuracy (approximately one third), the remaining changes are increases in accuracy.Moreover, the reduction in accuracy was observed only in the Pyr application whilebeing processed using six processors. Taking into account the results collected, wemay state that, overall, the p2 speedups the execution without signi�cantly a�ecting(negatively) the quality of the models.

Table 5.3 compares the speedups of the p2 algorithm (with pipeline width of 10) withthe best parallel algorithm from the previous chapter (DPILP). It is clear that p2

Page 170: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

168 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

a) Carc b) Mesh

c) Mut d) Pyr

Figure 5.9: p2-mdie: Average amount of communication exchanged (in MB) with 2, 4,6 and 8 processors and a pipeline width of 1, 10 and unlimited (inf).

Page 171: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.5. EXPERIMENTS AND RESULTS 169

a) Carc b) Mesh

c) Mut d) Pyr

Figure 5.10: p2-mdie: Average number of epochs with 2, 4, 6 and 8 processors and apipeline width of 1, 10 and unlimited (inf).

Page 172: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

170 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

a) Carc b) Mesh

c) Mut d) Pyr

Figure 5.11: p2-mdie: Average accuracy variation with 2, 4 , 6 and 8 processors and apipeline width of 10 and unlimited (inf).

Page 173: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.5. EXPERIMENTS AND RESULTS 171

outperforms DPILP in all applications.

In the other hand, if we compare the speedups observed with the p2 algorithm, usingeight processors, and the RRR algorithm we have a tie. However, we should stressthat the RRR algorithm requires that the whole dataset to be loaded into memory,which for very large datasets can be a problem. On the other hand, the design of thep2 algorithm makes it capable of handling large datasets with considerable speedups.Therefore, greater speedups are achieved by increasing the number of processors.

Application Alg. 2 4 6 8

Carc DPILP 1.2 2.8 3.5 9.9p2 w10 1.2 3.0 8.0 11.9

Mesh DPILP 0.4 1.8 1.9 3.1p2 w10 1.7 4.6 6.5 7.1

Mut DPILP 5.7 2.4 2.2 5.5p2 w10 2.3 4.6 4.5 5.9

Pyr DPILP 1.3 2.1 3.9 4.5p2 w10 2.0 4.2 6.5 8.3

Table 5.3: p2-mdie: Comparison of the DPILP with p2−mdie (with a pipeline widthof 10) as far speedup is concerned.

Overall, the speedups observed are quite good, specially when compared with otherdistributed ILP algorithms presented in the previous chapter. By constraining thepipeline width (to 10) increased speedups can be achieved without compromising thequality of the models.

5.5.4 Scalability Analysis

Another question that we addressed was: does the algorithm scale up with the numberof examples? To answer this question we selected the two largest applications (Mesh

and Pyr) and re-run the experiments such that the number of examples used by eachprocessor remained the same as the number of processors varied. The rationale is thatthe algorithm scales if the execution time remains approximately the same when thenumber of examples is increased proportionally with the number of processors. Inother words, if we double the number of examples and the number of processors thenthe execution time will remain approximately the same.

Page 174: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

172 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

Applications 2 4 6 8

Mesh 520 600 444 613Pyr 707 522 603 698

Table 5.4: p2: Scalability results. Execution time (in seconds) observed when �xingthe number of examples used by each processor.

Table 5.4 summarizes the results obtained. One can observe that despite increasingthe data 4-fold, using two and 8 processors, the execution time is not signi�cantlya�ected. This suggests that the algorithm scales up on the number of examples.

5.6 Related Work

Our proposal is closely related, at the conceptual level, to Incremental Batch Learning

(IBL) [CCHB89]. In IBL the set of examples E is randomly partitioned into p subsetsof approximately equal size (maintaining the proportion of positive and negativeexamples). A learning procedure is applied to a subset of examples Ei to build anoutput Ci using Ci−1 as input. By breaking up the set of examples into p subsets, theprocess of learning a hypothesis is broken into a sequence of p stages (�pipeline�) suchthat once the �rst process is completed the output is passed as input to the second,and so on, until all p processes are executed. The main di�erence between our proposaland IBL concerns the pipelined operation. We advocate the parallel pipelining of thesearch procedure while IBL approach simply pipelines the entire algorithm.

Our proposal combines, in a sense, some previous work, in parallel ILP, based ondata partitioning. Like in PolyFarm [CK03] and Konstantopoulos' proposal [Kon03],the data is partitioned among the workers and they evaluate the rules sent by themaster. However, we go further and propose pipelining the search for a rule throughall processors (therefore using all data). Unfortunately, training on smaller subsetsof the whole data may reduce the quality of learning. We addressed this problem bystreaming the best rules over every subset in a pipelined fashion.

Another approach often used to reduce learning time and handle large datasets is tolearn from samples of data (see e.g., [Bre99, Für98]). It has been shown that learningfrom samples of data can signi�cantly reduce the learning time without negativelya�ecting the quality of models [Bre99]. However, even when learning from samples,

Page 175: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

5.6. RELATED WORK 173

the training set size can still be considerable. Consider a dataset containing 50GB ofdata. Learning from a sample of 10% can still be a considerable challenge. Even witha few hundred examples, ILP systems may still take unacceptable large amounts oftime to �nd a model.

Notwithstanding the use of sampling methods, parallelism can be exploited to speedupthe process of searching for a model. Therefore, parallelism and sampling are or-thogonal and can be combined to achieve greater reductions in the learning time.In [CMH+03] the authors concluded that large datasets can be handled by simplepartitioning to form a committee of classi�ers. The performance of the committee ofclassi�ers can be expected to exceed that of a single classi�er built from all the data.Parallelism was exploited by learning the classi�ers in parallel. Another example ofsuccessful combination of sampling and parallelism is the distributed version [CHB+02]of Breiman's Rvote and Ivote methods [Bre99]. In both approaches, classi�ers needto be learned (and are done sequentially). Our parallel algorithm could be used toquickly learn those classi�ers using a sample as the training data.

Some studies have also been made, within an ILP context, related to learning fromsubsets of data [Mug93, Sri99]. A procedure called layered learning [Mug93] constructs,in stages, increasingly correct theories. It starts by using a small sample of the datato construct an approximately correct theory, that is improved in the next stages.The sample at each stage is a superset of the sample of the previous stage. This is amajor di�erence in relation to our approach. Subsampling and logical windowing wasstudied in the ILP context by Srinivasan [Sri99]. Subsampling consists in repeatingthe holdout method k times, and the estimated accuracy is derived by averaging theruns. Our approach is orthogonal to subsampling, since it is applied to the trainingset generated by the subsampling procedure. The empirical results of applying thetechniques [Sri99] showed a reduction in the execution time, while the accuracy ofthe theory obtained was comparable in predictive accuracy to not using sampling. Amain di�erence from logical windowing to our proposal is that while in windowingthe sample at each stage depends on the theory generated at previous stages and isa superset of the sample of the previous stage, in our proposal the dataset is simplypartitioned into independent subsets and learning is performed at each stage using asingle subset.

To conclude, our proposal is orthogonal to subsampling. It may, and should, be usedin conjunction with subsampling to minimize the learning time (without compromisingthe results). A simple approach to do that is to perform the desired subsample methodand then use a learning system equipped with a p2 like algorithm to quickly learn a

Page 176: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

174 CHAPTER 5. A PIPELINE DATA-PARALLEL ALGORITHM FOR ILP

classi�er in parallel from a sample.

5.7 Summary

We have proposed a novel parallel algorithm for predictive rule learning that combinestwo techniques: data parallelism and pipelining. The granularity of the tasks togetherwith the capability to scale up to many processors makes this approach suitable fordistributed computer systems. The algorithm is general and is applicable to otherlearning systems based on the covering algorithm approach.

The approach followed to parallelize the covering algorithm using pipelined dataparallelism has the following characteristics: i) it fosters scalability on the numberof examples; ii) the granularity of the tasks executed in parallel are very similar. Thisleads to balanced computations, hence simplifying processor load balancing; iii) allowsone to control the granularity of data exchanged between the processors in the pipeline(i.e., by limiting the number of good rules passed on each stage of the stream). Byconstraining the number of good rules streamed along the pipeline, one can furtherreduce the learning time at the cost of loosing some quality in the models found. Bysetting the pipeline width to a number close to the maximum number of processorsused, a compromise was achieved between learning time and quality of the models.However, further studies need to be performed in order to determine if an ideal pipelinewidth exists.

The p2 − covering algorithm was integrated in a widely used ILP technique - MDIE.An empirical evaluation of the modi�ed MDIE algorithm was carried out on a Beowulfcluster. It showed very good results, achieving super-linear speedups, in some cases,without signi�cantly a�ecting the quality of the models.

Page 177: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

By three methods we may learn wisdom: First, by re�ection,

which is noblest; Second, by imitation, which is easiest; and third

by experience, which is the bitterest.

Confucious

6Conclusion and Further Research

Inductive Logic Programming (ILP) has been successfully applied in many applicationareas but one of the its shortcomings is the long execution time [PS03]. It is importantto solve this problem so that ILP systems become more interactive and able to handlelarger problems. A possible solution is to explore parallelism in ILP.

In this dissertation we studied and developed techniques to exploit parallelism in ILPsystems. We then present a brief summary of the dissertation, point out the maincontributions and discuss directions for future research.

6.1 Summary

A new e�cient and modular ILP system, called April, was designed and implemented.It combines and integrates several techniques to maximize e�ciency. Among thetechniques implemented in April are: query transformations [CSC+03], randomizedsearches [vSP02], coverage caching [Cus96], lazy evaluation of examples [Cam03],tabling [RFC05], and parallelism [FSCC05, FSC05]. April's ability to explore severalparallelization strategies is the main di�erence from other systems. April has shown

175

Page 178: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

176 CHAPTER 6. CONCLUSION AND FURTHER RESEARCH

to be competitive, in terms of sequential execution time and memory usage, whencompared to the Aleph ILP system.

It was observed that when performing a top-down search, April spends most of the timeevaluating clauses and, in a smaller percentage, generating clauses. However, whenperforming a stochastic clause selection search, the time is mostly spent generatingclauses.

The results gathered show that the use of a randomized algorithm (RRR), as opposedto using a deterministic top-down branch-and-bound algorithm, signi�cantly reducesthe running times with little loss of accuracy.

There is undoubtedly a variety of parallel schemes and algorithms that can be exploitedin ILP. A survey on the state-of-the-art in parallel ILP was presented and the parallelILP algorithms described in the literature were grouped into four main approaches:parallel exploration of independent hypotheses; parallel exploration of the search space;parallel coverage test; and parallel execution of an ILP system over a partition of thedata.

A comparative study of several parallel algorithms was conducted, in a distributedmemory computer, to assess the best overall approach. The results gathered showthat the overall best strategy to parallelize ILP systems is one of the simplest toimplement: divide the set of examples into p subsets; run the ILP system in parallelon each subset using p processors; and combine the p theories found into a singletheory.

The performance results observed for the sequential randomized algorithms triggereda detailed comparison with the implemented parallel algorithms, not just in terms ofexecution time but also on the quality of the theories produced. Obviously, randomizedsearches can also be performed in parallel, thus two parallel randomized algorithmswere implemented.

The results gathered suggest that a sequential randomized algorithm (RRR) is ableto achieve better speedups than the parallel algorithms, with the exception of p2 −covering algorithm. The running time of the RRR algorithm was further reducedusing parallelism. The drawback of the RRR algorithm is that it requires the wholedataset to be loaded into memory, which can be a problem for very large datasets. Onthe other hand, data parallel algorithms such the DPILP can cope with this problem.

We have proposed a novel parallel covering algorithm (p2 − covering) that combines

Page 179: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

6.2. KEY CONTRIBUTIONS 177

data parallelism and pipelining. The algorithm fosters scalability on the number ofexamples and allows the control of the granularity of data exchanged between theprocessors in the pipeline.

The p2 − covering algorithm was integrated in a widely used ILP technique - MDIE.An empirical evaluation of the modi�ed MDIE algorithm was carried out on a Beowulfcluster. The algorithm yielded very good results, achieving super-linear speedups insome cases, without signi�cantly a�ecting the quality of the models. Furthermore, thealgorithm scaled up on the number of examples.

When comparing the speedups observed between the p2−covering and RRR we cometo a draw. However, the RRR algorithm requires that the whole dataset to be loadedinto memory, which for very large datasets can be a problem. In the other hand, thedesign of the p2− covering algorithm makes it capable of handling large datasets withclose to linear speedups. Therefore, greater speedups are achieved by increasing thenumber of processors.

Finally, for an ILP practitioner, we conclude that the RRR algorithm is a compu-tational �cheap� method of achieving substantial reductions in the execution time,which outperforms most parallel algorithms. However, given its main drawback, thatthe whole dataset must be loaded into memory, the p2 − covering algorithm is abetter choice when a parallel computer is available since it scales on the number ofexamples and achieves a close to linear or super-linear speedups without compromisingthe quality of the theories found.

6.2 Key Contributions

This dissertation makes the following major contributions:

• We designed and implemented a new ILP system - called April - capable of trans-parently (from a user point of view) exploiting several parallelization strategiesin distributed and shared memory computers. Moreover, we have shown thatApril sequential execution performance is comparable to other state-of-the-artILP systems.

• We showed, through an empirical study, that a good approach to parallelize ILPsystems, in a distributed-memory computer, is one of the simplest to implement:divide the data among all processors available; run the ILP system in parallel

Page 180: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

178 CHAPTER 6. CONCLUSION AND FURTHER RESEARCH

on each processor using a subset of the data; combine the theories found by eachprocessor into a single theory. This approach did not only reduced the executiontime, but it also improved the predictive accuracy of the models found.

• We showed, through an empirical study, that a sequential randomized algorithm(RRR) outperforms parallel algorithms doing complete searches running withup to eight processors, without sacri�cing the quality of the solutions found.Furthermore, we have shown that a parallel version of a randomized algorithmfurther decreases the learning time.

• We designed, implemented and evaluated a new parallel algorithm (p2−covering)for ILP whose main innovation is the combination of two parallelization tech-niques: data parallelism and pipelining. We have shown that p2 − covering

outperforms previous parallel algorithms and is capable of achieving linear andsuper-linear speedups, in a distributed memory computer, without a�ecting thequality of the models found. Furthermore, the algorithm has shown to scale onthe number of examples.

6.3 Future Work

Although we contributed to the study of parallelism in ILP, further research is stillrequired. In the following lines we discuss several directions for future work that couldimmediately follow the work described in this dissertation.

Parallel algorithms

The experimental evaluation has shown that the DPILP is a good approach to exploitparallelism in ILP systems, being outperformed only by the p2 − covering algorithm.We also observed that the randomized RRR algorithm achieves great speedups, but itrequires loading all data into memory which DPILP does not. Naturally, an interestingline of research is to combine the best of both algorithms. We plan to evaluate thisapproach in the near future.

Another direction to pursue is to use other approaches to combine the theories foundin each subset(e.g., [PF05]).

Finally, a natural extension of this line of research is to evaluate the parallel algorithmson larger parallel computers and applications.

Page 181: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

6.4. FINAL REMARKS 179

Extending the p2 − covering algorithm

The p2−covering algorithm can be implemented in many di�erent ways. One can usedi�erent search procedures, such as randomized random restarts, di�erent strategiesto partitioning the data (e.g., sampling with reposition), and other approaches toorder/select the rules. The combination of a search procedure, partitioning strategyand rule ordering scheme is a variation of the algorithm that may lead to di�erentresults. We studied a few combinations, but we plan to study others in the future.Furthermore, we plan to assess whether the proposed approach to reduce over�ttingis e�ective.

Optimizations

The communication is always an important factor in the performance of parallelalgorithms, mainly in distributed memory machines. Currently, all data is exchangedbetween processors as plain Prolog terms containing lists which, in some cases, canbe considerable large (more than 20MB). The compression of the data stream (e.g.,using a fast compressing algorithm) can reduce the amount of data exchanged and,possibly, improve performance. Therefore, it is important to investigate and implementimprovements in the communication layer in order to reduce parallelism overheads and,thus, improve performance.

April

The to-do list in April is too long to be enumerated here. First, and foremost, we planto release the �rst version of April system. The distribution will include the Aprilexecutable to run sequentially or in parallel, example applications, and documentationof the system. We also plan to extend April to support learning from interpretations,improve the procedures for rule selection, among others.

6.4 Final Remarks

The implementation of the April system has been a hard but challenging and in-teresting task. The implementation and integration of di�erent techniques oftenbrings some unexpected problems. Balancing modularity with e�ciency can also be

Page 182: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

180 CHAPTER 6. CONCLUSION AND FURTHER RESEARCH

tricky, in particular when using languages like Prolog. It is very easy to get run-time errors that are very hard to trace and solve (e.g., a goal simply fails when itshould succeed, no error is reported). Developing parallel programs in Prolog is alsoa troublesome experience. If tracing a problem in a sequential Prolog program maybe complicated, tracing several instances of the program the task becomes harder(especially when one needs to wait a considerable amount of time for the problem tooccur). Notwithstanding the challenges, the development of April gave the author agreat deal of experience in a variety of areas, namely implementation of ILP systems,parallel programming, Prolog, and MPI.

Page 183: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

"Logic will get you from A to B. Imagination will take you

everywhere."

Albert Einstein

ALogic Programming and Prolog

This section introduces some basic notions and terms from logic programming andProlog. These include the language (syntax) of logic programs, as well as the basicnotions from model and proof theory. An in depth introduction to Logic Programmingcan be found in [Llo87, Hog90, NS97].

181

Page 184: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

182 APPENDIX A. LOGIC PROGRAMMING AND PROLOG

A.1 Logic Programming

The logic programming language is a subset of First Order Logic (FOL) named Hornclause logic.

Language

A clause has the form

A1, . . . , Am ← B1, . . . , Bn

where Ai and Bi are positive literals. The above clause can be read as

A1 or . . . or Am if B1 and . . . and Bn

A positive literal has the form p(t1, . . . , tk), where p is a predicate symbol (representedwith a lower case letter followed by a string of lower case letters, digits or underscores)and all ti are terms. The negation of a positive literal represented with the negationsymbol ¬ in the pre�x is called negative literal. A term is a variable (representedby an upper case letter followed by a string of lower case letters and digits) or afunction symbol (represented with a lower case letter followed by a string of lower caseletters, digits or underscores) immediately followed by a bracketed n-tuple of terms.A variable represents an unspeci�ed term for which a value can be assigned (usuallydesignated as instantiated or bound). A variable can be instantiated only once withanother variable or a term. A term containing no variables is called ground term. Afunction symbol followed by a bracketed n-tuple of terms is a function with arity n

(or an n− ary function or compound term). If n is zero then the function is called aconstant (or an atom) and the brackets may be omitted.

All variables in clauses are universally quanti�ed∗, however that is not explicitly

∗Sometimes one may need to remove all existencial quanti�ers from a clause. Skolemization is the

process of removing all existencial quanti�ers from the formula. This is done by replacing the existen-

tial variables by Skolem functions. The skolem function symbol is a new symbol that was not found

in the sentence. The parameters of the symbol are the universally quanti�ed variables. For instance,

the formula ∀X1, . . . , Xn∃(father(Y )) is rewritten to ∀X1, . . . , Xnfather($skolem(X1, . . . , Xn)). Inthe special case where n = 0 we simply replace the variable with a new constant symbol that is called

skolem constant.

Page 185: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

A.1. LOGIC PROGRAMMING 183

stated. The conjunction �and� is represented as ∧ and the disjunction �or� as ∨. Sincea← b ≡ a ∨ ¬b, the above clause is also equivalent to

A1, . . . , Am ∨ ¬B1, . . . ,¬Bn

Clauses can be classi�ed by the number of positive and negative literals they contain.A clause is a de�nite clause if it has exactly one positive literal (where m = 1). AHorn clause is a clause with at most one positive literal (where m ≤ 1). In a de�niteclause, the positive literal is called the head of the clause. The set of all negativeliterals is called the body of the clause. A inde�nite clause is a clause containing atleast two positive literals (i.e., where m > 1). A de�nite clause with no literals in thebody de�nite unit clause. A set of de�nite clauses is called a de�nite logic program. Aset of de�nite clauses with the same predicate symbol p in the head forms a predicatede�nition of the predicate p. A negative clause is a clause which has zero or morenegative literals, and no positive literals. The empty clause is a negative clause andrepresents the contradiction false ← true. The empty clause is denoted by � (thatalso represents the logical constant False).

Other classi�cations of clauses are: a recursive clause has at least one literal in thebody with the same predicate symbol as the head literal; a ground clause does nothave any variables; a range-restricted clause has all variables in the head appear atleast once in the body.

A �nite set of clauses is called a clausal theory and represents a conjunction ofclauses. Thus, the clausal theory {c1, c2, . . . , cn} can be equivalently represented as(c1∧c2∧ . . .∧cn). A clause or clausal theory is called function free if it contains onlyvariables as terms, i.e., contains no function symbols. An empty clausal theory isdenoted by � (that also represents the logical constant True).

A Horn theory is a clausal theory containing only Horn clauses. A range-restricted

de�nite program is a de�nite program in which all clauses are range restricted.

The following concepts are often referred in ILP. Let c be a clause and T a set of clausessuch that c /∈ T . Without loss of generality we may consider each clause as a sequentialde�nite clause [vdL95], i.e., a sequence of literals, in the form l0 ← l1, . . . , ln (n ≥ 1);li (1 ≤ i ≤ n) is a literal in the body of the clause and l0 is the head literal of theclause; each literal li can be represented by pi(A1, . . . , Aia) where pi is the predicatesymbol with arity ia and A1, . . . , Aia are the arguments. A literal li in a clause c is atdepth i. A literal li is redundant in a clause c if (c \ li) � c. A clause c is redundant in

Page 186: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

184 APPENDIX A. LOGIC PROGRAMMING AND PROLOG

a set of clauses T if and only if T � c. A theory T is reduced if and only if T containsno redundant clauses. The depth of a variable v in a clause c, denoted by d(v), is theminimum of {k | lk contains the variable v}.

Semantic

TheModel Theory (semantic) of Logic Programming provides a way of assigning mean-ing to any given sentence of logic. The basic idea is to associate the sentence with sometruth-valued statement about a chosen domain, a process known as interpretation.

An interpretation is determined by the set of ground facts (ground atomic formulae)to which it assigns the value true. Sentences involving variables and quanti�ers areinterpreted by using the truth values of the ground atomic formula and a �xed set ofrules for interpreting logic operations and quanti�ers, such as ¬l is true if and only ifl is false.

The Herbrand Universe, or Herbrand Domain, of a clausal theory (or logic program)P , denoted HP , is the set of all ground terms formed from the function symbols foundin P . The Herbrand Base of P , denoted B(P ), is the set of ground atoms formed fromthe function symbols and predicates found in P .

An Herbrand interpretation I of a clausal theory P is an interpretation whose domainis the Herbrand Base of P . The Herbrand interpretation is usually seen as the subsetof the Herbrand base that is mapped by I to true (�).

A relevant concept is substitution. Let θ = {X1/t1, . . . , Xn/tn}, θ is said to be asubstitution when each Xi is a variable and each ti is a term, and Xi = Xj → i = j.The application of θ to a term t, denoted by tθ, is the act of replacing every occurrenceof Xi in t by ti.

Related to substitution is the concept of uni�cation. Two atoms are uni�able whenthey can be made identical by replacing the variables by terms in a consistent way.A substitution that makes two atoms identical is called a uni�er. The most general

uni�er is the substitution which minimally instantiates two atoms.

Inference

Logic Programming is concerned with deducing which clauses are logically implied bysome program (set of clauses). The process of deriving new sentences from known

Page 187: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

A.2. PROLOG 185

sentences is accomplished through the application of a set of rules known as inferencerules.

An inference rule has the following schematic form: �from a set of sentences of thatkind, derive a sentence of this kind�. More formally, an inference system consists of aninitial set S of sentences (axioms) assumed to be true and a set R of inference rules.The inferred (derived) sentences are called theorems (or syntactic consequences). Thefact that a sentence s can be inferred from the set of axioms S using the inferencerules of the set R is denoted by S`Rs. A proof is a sequence s1, s2, . . . , sn such thateach si is either in S or derivable from S using R and from s1, . . . , si−1. Such a proofis also called a derivation or deduction. Note that the above notions are entirely of asyntactic nature.

The set of inference rules R de�nes the derivability relation `. A set of inference rulesis sound if the corresponding derivability relation is a subset of the logical implicationrelation, i.e., for all S and s, if S ` s then S � s. A set of inference rules is complete ifthe other direction of the implication holds, i.e., for all S and s, S � s then S ` s. Theproperties of soundness and completeness establish a relation between the notions ofsyntactic (`) and semantic (�) entailment in logic programming and �rst-order logic.When the set of inference rules is both sound and complete, the two notions coincide.

According to Church's theorem, logical implication is undecidable for �rst-order logic,i.e., there is no algorithm that can �nd out if c1 � c2 for arbitrary c1 and c2. However,proof procedures exist which are both sound and complete.

Resolution is an inference rule that produces a clause c (called resolvent) that is aconsequence of two other clauses c1 and c2, called parent clauses. The parent clausesmust have complementary literals, i.e., for some literal l1 in one of the clauses theremust be a literal ¬l2 in the other clause such that l1 and l2 are uni�able. Therefore,if c1 is {p(X) ∨ ¬q(X)} and c2 is {q(b)} then the resolvent is p(b). Combining severalresolution steps, starting from clauses in S and ending in S ′, we get a derivation fromS, denoted S `r S ′.

A.2 Prolog

Prolog [Bra90] is a Logic Programming language. Many ILP systems are implementedin Prolog and many examples in the ILP literature are given using the Prolog language.

Page 188: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

186 APPENDIX A. LOGIC PROGRAMMING AND PROLOG

Language

In the Prolog context, positive unit clauses are also called facts. All other de�niteclauses are called rules. The negative clauses are called either queries or goal clauses.A logic program is a conjunction of Horn clauses, where each clause de�nes somepredicate. Program clauses are a generalization of Horn clauses that may containnegative literals in their body. A normal program or Prolog program is a �nite set ofprogram clauses. In practice, a Prolog program is an ordered list of clauses, and nota set of clauses.

The de�nition of a predicate p/k, where p is the function symbol of the head literaland k the arity of the function (or predicate), is composed by one or more clauses. Inorder to represent logic programs we will use a notation used in the logic programminglanguage Prolog. The symbols for disjunction (∨) and conjunction (∧) are replaced bysemi-colon and comma respectively, and a clause ends with a period. The implicationarrow ← is represented by : −, and negation by not.

Example 7

An example of a logic program with the de�nition of the father/2 predicate:

i) Facts: son(tom, parris). man(tom).

ii) Rule: father(X, Y ) : −man(X), son(Y,X).

A particular case of de�nite clause (program) is a Datalog clause (program). A Datalog

clause (program) is a de�nite clause (program) in which all function symbols havezero arity. This means that only variables and constants can be used as predicatearguments.

Semantic

The meaning of a (de�nite) logic program P is the minimal (Herbrand) model MM(P ).The minimal model of a de�nite program P corresponds to the set of ground atomsthat are its logic consequences. These ground atoms are elements of the Herbrandbase of P . In other words, the meaning of P (MM(P )) is the set of ground logicalconsequences of P . A fact q is a logical consequence of a program P if all models of p

are also models of q. That is denoted by P � q.

Page 189: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

A.2. PROLOG 187

Inference

A logic program is executed by posing queries to it. A query q is a negative clauseof the form : −l1, . . . , ln. Basically, querying q to a program P corresponds to checkif q is a ground logical consequence of P or not, or if it is possible to assign certainvalues to the variables in q such that q is a logical consequence of P after replacingthe variables by the respective values. A query may succeed or fail. If it succeeds andcontains variables then a substitution (called answer substitution) is also part of theanswer. Several answer substitutions may exist. The process of substituting variablesby terms is called instantiation.Example 8

Consider the program given in Example 7. The query

: −man(X).

would succeed with the answer substitution {X/tom}. On the other hand,

the query

: −man(parris).

would fail.

A query is solved in Prolog engines by using the SLDNF-resolution proof procedure.The SLDNF-resolution extends SLD-resolution in order to handle negated literals withthe negation as failure rule. Negation as �nite failure is a derivation rule that statesthat if P ∪ {: −q} �nitely fails then we can derive the ground atom not q. Typically,a query is evaluated to be false by merit of not �nding any positive rules or facts thatsupport the statement. This is called Closed World Assumption: if a fact is not knownto be true (or false), it is assumed to be false.

SLD-resolution is the abbreviation of linear resolution for de�nite clauses with aselection function. It works in most Prolog systems as follows. Given a query :

−l1, . . . , ln and a program P , the leftmost literal of the query is selected (l1). Next,the �rst clause in P whose head can be uni�ed with l1. Let c be such a clause with theform h : −b1, . . . , bk such that l1θ1 = h, where θ1 is the most general uni�er of l1 andh, i.e., the substitution which minimally instantiates two atoms. The resolvent R1 is:

: −(b1, . . . , bk, l2, . . . , ln)θ1

The procedure repeats for the remaining literals, thus generating more resolvents. Itends when the empty clause is obtained or when no uni�er is found (and in this casethe query fails).

Page 190: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

188 APPENDIX A. LOGIC PROGRAMMING AND PROLOG

SLDNF-resolution is only sound if negation is applied to ground literals, however, inpractice, Prolog engines do not distinguish between ground and not ground literals.Prolog computation rule can lead to problems because it always selects the leftmostliteral in a goal, which can lead in turn to in�nite resolutions.

The facts derivable using SLD-resolution from a logic program P correspond exactlyto the minimal herbrand model of P , i.e., it can be used to �nd all ground logicalconsequences from P .

Page 191: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

"Data! Data! Data!" he cried impatiently. "I can't make bricks

without clay."

Sir Arthur Conan Doyle in Sherlock Holmes: The Adventure of

the Copper Beeches (1892)

BSupplementary Tables and Graphics

This appendix includes complementary tables and graphics that present results fromthe experiments performed.

189

Page 192: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

190 APPENDIX B. SUPPLEMENTARY TABLES AND GRAPHICS

B.1 Performance Evaluation: April vs Aleph

Application Aleph April Di� (%)

Carc 35,93 (4,01) 24,68 (3,33) -31%

Mesh 84,73 (4,01) 70,31 (5,33) -17%

Mut 16,88 (4,06) 16,99 (4,16) 1%

Pyr 58,40 (0,21) 61,07 (1,28) +5%

Table B.1: Average sequential execution time (in seconds) taken by April and Aleph tosearch for a single rule. Standard deviation is presented within brackets. The relativevalue of the di�erence is computed as the ratio of April's minus Aleph's execution timewith Aleph's execution time. Statistically signi�cant di�erences, for a p-value of 0.02,are marked in bold.

Application Aleph April Di� (%)

Carc 49.80 50.72 2%

Mesh 9.85 9.91 1%

Mut 35.06 60.39 72%

Pyr 53.26 53.26 0%

Table B.2: Average predictive accuracies for the rules found by April and Aleph. Therelative value of the di�erence is computed as the ratio of the di�erence of April'sminus Aleph's accuracy and Aleph's accuracy. Statistically signi�cant di�erences, fora p-value of 0.02, are marked in bold.

Application Aleph April Di� (%)

Carc 160.326 71.215 -56%

Mesh 199.999 66.544 -67%

Mut 10.508 10.080 -4%

Pyr 151.564 43.969 -71%

Table B.3: Average memory usage (in kbytes) to perform a single search for a rule byApril and Aleph.

B.2 April's Performance Analysis

Page 193: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

B.2. APRIL'S PERFORMANCE ANALYSIS 191

Figure B.1: Average distribution of April's execution time on the four main compo-nents (saturation, clause generation, clause evaluation, and clause selection) whenperforming a Stochastic Clause Selection. The Stochastic clause selection involved therandom generation and evaluation of 688 clauses uniformly distributed by length.

Figure B.2: Average distribution of April's execution time on the four main compo-nents (saturation, clause generation, clause evaluation, and clause selection) whenperforming a Stochastic Clause Selection (outer ring) or top-down search (inner ring).The Stochastic Clause Selection involved the random generation and evaluation of 688clauses uniformly distributed by length.

Page 194: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

192 APPENDIX B. SUPPLEMENTARY TABLES AND GRAPHICS

B.3 Strategies to parallelize ILP

Base (Sequential) Results

Application BFBB RRR SCS

Carc 2 509 107 1.168

Mesh 5 987 1 474 3.300

Mut 31 979 19 338 6 292

Pyr 7 393 1 037 3.293

Table B.4: Sequential execution time (in seconds) with the BFBB, RRR, and SCSalgorithms.

Application BFBB RRR SCS

Carc 86.58/55.17 58.90/55.47 84.42/52.79

Mesh 90.33/90.25 89.81/89.70 64.93/62.22

Mut 80.32/76.64 82.32/78.75 79.25/77.17

Pyr 92.51/88.45 90.95/83.65 91.97/85.73

Table B.5: Average accuracy (Training/Test) with the BFBB, RRR, and SCSalgorithms.

Application BFBB RRR SCS

Carc C 2 246 793 1 218 163 39 080

E 3 086 451 541 890 4 351 553

Mesh C 2 675 805 46 340 327 64 433

E 125 140 325 3 907 630 7 318 887

Mut C 23 260 1 404 573 12 825

E 155 959 124 594 107 532

Pyr C 2 989 728 24 443 492 142 966

E 137 816 529 40 227 624 106 334 268

Table B.6: Number of clauses generated (C) and number of examples evaluated (E)with the BFBB, RRR, and SCS algorithms.

Page 195: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

B.3. STRATEGIES TO PARALLELIZE ILP 193

Parallel results

Application 2 4 6 8

Carc 2 310 (1.09) 3 344 (0.75) 3 564 (0.70) 3 876 (0.65)

Mesh 17 623 (0.34) 12 904 (0.46) 17 817 (0.34) 18 539 (0.32)

Mut 20 199 (1.58) 10 816 (2.96) 11 015 (2.90) 10 590 (3.02)

Pyr 13 838 (0.53) 17 679 (0.42) 13 472 (0.55) 17 847 (0.41)

a) PCT

Application 2 4 6 8

Carc 2 175 (1.15) 907 (2.77) 728 (3.45) 254 (9.89)

Mesh 16 951 (0.35) 3 308 (1.81) 3 096 (1.93) 1 932 (3.1)

Mut 5 625 (5.69) 13 508 (2.37) 14 568 (2.2) 5 789 (5.52)

Pyr 5 735 (1.29) 3 495 (2.12) 1 909 (3.87) 1 653 (4.47)

b) DPILP

Application 2 4 6 8

Carc 950 (2.64) 1 410 (1.78) 1 100 (2.28) 922 (2.72)

Mesh 3 218 (1.86) 1 338 (4.47) 1 121 (5.34) 1 164 (5.15)

Mut 6 910 (4.63) 5 889 (5.43) 5 929 (5.39) 4 091 (7.82)

Pyr 8 789 (0.84) 4 342 (1.70) 2 618 (2.82) 1 551 (4.77)

c) DPLR

Table B.7: Execution time (in seconds) and respective speedup (within brackets) forthe PCT , DPILP , and DPLR algorithms.

Application 2 4 6 8

Carc 1 216 (0.96) 1 168 (1.00) 1 123 (1.04) 1 116 (1.05)

Mesh 3 433 (0.96) 3 296 (1.00) 3 375 (0.98) 3 274 (1.01)

Mut 16 404 (0.38) 10 363 (0.61) 7 536 (0.83) 5 882 (1.07)

Pyr 1 763(0.97) 3 306(0.52) 3 250 (0.52) 3 222 (0.53)

a) PSCS

Application 2 4 6 8

Carc 46 (2.34) 24 (4.48) 20 (5.26) 18 (5.71)

Mesh 1 916 (0.77) 1 056 (1.39) 1 011 (1.46) 969 (1.52)

Mut 167 416 (0.27) 84 487 (0.54) 58 012 (0.79) 45 394 (1.00)

Pyr 1 071 (0.76) 1 013 (0.81) 1 055 (0.78) 1 133 (0.72)

b) PRRR

Table B.8: Execution time (in seconds) and speedup (withing brackets) for the parallelrandomized search algorithms (PRRR and PSCS ).

Page 196: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

194 APPENDIX B. SUPPLEMENTARY TABLES AND GRAPHICS

a) Carc b) Mesh

c) Mut d) Pyr

Figure B.3: Average number of epochs by algorithm.

Page 197: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

B.4. P 2 RESULTS 195Application 2 4 6 8

Carc 55.78 (+0.6) 58.43 (+3.3) 58.43 (+3.3) 58.73 (+3.6)

Mesh 91.09 (+0.8) 90.64 (+0.4) 90.89 (+0.6) 90.60 (+0.4)

Mut 84.03 (+7.4) 75.07 (-1.6) 78.29 (+1.7) 78.29 (+1.7)

Pyr 86.94 (-1.5) 87.73 (-0.7) 86.66 (-1.8) 87.26 (-1.2)

a) DPILP

Application 2 4 6 8

Carc 52.80 ( -2.37) 54.61 ( -0.56) 52.83( -2.34) 56.10 ( +0.93)

Mesh 43.81 (-46.44) 41.50 (-48.75) 40.54 (-49.71) 40.64 (-49.61)

Mut 77.71 ( +1.07) 78.24 ( +1.59) 79.29( +2.65) 79.16 ( +2.52)

Pyr 79.82 ( -8.63) 77.71 (-10.74) 80.56( -7.89) 79.99 ( -8.46)

b) DPLR

Application 2 4 6 8

Carc 54.60 (-0.9) 54.60 (-0.9) 54.60 (-0.9) 54.60 (-0.9)

Mesh 89.86 (+0.2) 89.74 (0.0) 89.77 (+0.1) 89.70 (0.0)

Mut 78.75 (-0.5) 78.75 (-0.5) 78.75 (-0.5) 78.75 (-0.5)

Pyr 83.36 (+0.9) 83.57 (+1.1) 83.21 (+0.7) 83.32 (+0.8)

c) PRRR

Table B.9: Average test accuracy and accuracy variation (within brackets) byalgorithm. Statistically signi�cant changes (using a t-test with a p-value of 0.02)are marked in bold.

B.4 P 2 Results

Page 198: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

196 APPENDIX B. SUPPLEMENTARY TABLES AND GRAPHICS

w 1 10 inf

p S Acc. Acc.V ar. | H | S Acc. Acc.V ar. | H | S Acc. Acc.V ar. | H |

Carc

2 1.0 59.04 -2.7 11.6 1.2 56.96 -4.8 15.8 1.6 55.77 -5.9 20.8

4 2.4 56.34 -5.4 4.4 3.0 57.54 -4.2 10.2 3.7 60.51 -1.2 13.8

6 4.0 60.52 -1.2 4.8 8.0 57.85 -3.9 7.0 10.7 57.24 -4.5 9.4

8 7.8 58.72 -3.0 3.2 11.9 57.55 -4.1 6.8 13.7 56.68 -5.1 8.6

Mesh

2 1.5 64.59 +0.8 6.4 1.7 66.00 +2.2 7.2 1.8 67.22 +3.4 8.2

4 4.7 69.66 +5.9 6.0 4.6 68.34 +4.6 5.6 4.7 70.94 +7.2 7.0

6 6.5 70.75 +7.0 6.6 6.5 70.01 +6.2 6.0 4.2 69.63 +5.8 7.6

8 6.7 66.07 +2.3 5.4 7.1 68.41 +4.6 5.8 4.1 72.36 +8.6 5.8

Mut

2 1.8 83.6 -3.18 5.4 2.3 83.07 -3.7 5.6 2.4 84.13 -2.6 6.8

4 4.7 84.6 -2.11 3.6 4.6 84.11 -2.6 4.4 4.4 81.98 -4.8 5.2

6 4.0 83.5 -3.21 5.6 4.5 84.61 -2.1 4.2 2.6 80.91 -5.8 5.4

8 7.2 84.1 -2.64 4.4 5.9 85.14 -1.6 5.8 3.8 86.24 -0.5 5.8

Pyr

2 2.0 75.90 -0.6 5.0 2.0 76.47 0.0 5.4 2.0 76.47 0.0 5.4

4 3.9 74.64 -1.8 4.4 4.2 74.82 -1.7 4.4 3.7 74.64 -1.8 4.4

6 6.4 73.71 -2.8 4.0 6.5 74.07 -2.4 4.0 5.1 72.85 -3.6 4.2

8 8.2 74.03 -2.4 4.2 8.3 74.39 -2.1 4.2 5.0 75.54 -0.9 4.6

Table B.10: p2: speedup (S), accuracy and accuracy variation, and theory size (| H |)for 2, 4, 6 and 8 processors. Statistically signi�cant di�erences in accuracy, for ap-value of 0.02, are marked in bold.

Application p 1 10 inf

Carc 2 22 20 18

Carc 4 15 13 12

Carc 6 10 9 8

Carc 8 8 7 7

Mesh 2 332 321 311

Mesh 4 145 145 136

Mesh 6 97 97 96

Mesh 8 76 70 63

Mut 2 7 6 6

Mut 4 4 4 3

Mut 6 3 2 3

Mut 8 2 2 2

Pyr 2 239 236 236

Pyr 4 131 131 133

Pyr 6 93 93 97

Pyr 8 68 67 65

Table B.11: The impact of p2 algorithm on the (average) number of epochs for 2, 4, 6and 8 processors.

Page 199: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

B.4. P 2 RESULTS 197

w 1 10 inf

p Msgs (MB) |Msgs| Max. Msg. Msgs (MB) |Msgs| Max. Msg. Msgs (MB) |Msgs| Max. Msg.

2 3.07 704 0.33 4.54 696 0.49 8.14 698 0.95

4 7.51 1 154 0.68 11.07 1 136 1.01 19.79 1 128 4.96

6 9.27 1 498 0.77 13.64 1 424 1.15 23.87 1 398 2.25

8 12.07 1 882 1.18 17.74 1 784 1.77 31.93 1 826 3.46

Table B.12: Average number of messages (|Msgs |), average amount of communicationexchanged in MB, and average size of the maximummessage exchanged while executingp2 on the Carc application with a pipeline width of 1, 10, and unlimited (inf).

w 1 10 inf

p Msgs (MB) |Msgs| Max. Msg. Msgs (MB) |Msgs| Max. Msg. Msgs (MB) |Msgs| Max. Msg.

2 15.90 5 282 1.69 23.32 5 072 2.51 42.82 4 956 4.96

4 99.89 7 856 4.84 148.64 7 844 7.24 272.63 7 172 14.40

6 538.37 10 786 5.83 805.02 10 760 8.73 1 560.74 10 776 17.34

8 951.30 13 758 10.81 1 422.70 12 802 16.18 2 773.58 11 468 32.20

Table B.13: Average number of messages (|Msgs |), average amount of communicationexchanged in MB, and average size of the maximummessage exchanged while executingp2 on the Mesh application with a pipeline width of 1, 10, and unlimited (inf).

w 1 10 inf

p Msgs (MB) |Msgs| Max. Msg. Msgs (MB) |Msgs| Max. Msg. Msgs (MB) |Msgs| Max. Msg.

2 6.37 3 102 1.90 9.37 3 026 2.83 18.05 3 026 5.62

4 37.37 6 374 1.63 55.56 6 390 2.43 106.31 6 574 4.80

6 189.11 10 216 3.94 282.09 10 296 5.90 541.38 10 640 11.70

8 655.82 12 280 7.33 981.24 12 124 10.98 1 923.77 11 698 21.83

Table B.14: Average number of messages (|Msgs |), average amount of communicationexchanged in MB, and average size of the maximummessage exchanged while executingp2 on the Mut application with a pipeline width of 1, 10, and unlimited (inf).

w 1 10 inf

p Msgs (MB) |Msgs| Max. Msg. Msgs (MB) |Msgs| Max. Msg. Msgs (MB) |Msgs| Max. Msg.

2 1.27 230 0.32 1.89 224 0.48 3.61 232 0.93

4 3.20 376 0.50 4.76 380 0.75 9.09 374 1.46

6 4.55 580 0.56 6.76 532 0.84 12.84 572 1.64

8 5.23 796 0.63 7.76 828 0.94 14.40 784 1.82

Table B.15: Average number of messages (|Msgs |), average amount of communicationexchanged in MB, and average size of the maximummessage exchanged while executingp2 on the Pyr application with a pipeline width of 1, 10, and unlimited (inf).

Page 200: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

198 APPENDIX B. SUPPLEMENTARY TABLES AND GRAPHICS

Page 201: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

References

[AF99] Simon Anthony and Alan M. Frisch. Cautious induction: An alternativeto clause-at-a-time induction in inductive logic programming. New

Generation Computing, 17(1):25�52, January 1999.

[AM02] Erick Alphonse and Stan Matwin. A dynamic approach to dimension-ality reduction in relational learning. In ISMIS '02: Proceedings of the

13th International Symposium on Foundations of Intelligent Systems,pages 255�264, London, UK, 2002. Springer-Verlag.

[AM04] Erick Alphonse and Stan Matwin. Filtering multi-instance problemsto reduce dimensionality in relational learning. J. Intell. Inf. Syst,22(1):23�40, 2004.

[Amd67] Gene Amdahl. Validity of the single processor approach to achievinglarge-scale computing capabilities. In AFIPS Conference Proceedings,volume 30, pages 483�485, 1967.

[BA96] Ronald J. Brachman and Tej Anand. The process of knowledgediscovery in databases. In Advances in Knowledge Discovery and Data

Mining, pages 37�57. American Association for Arti�cial Intelligence,Menlo Park, CA, USA, 1996.

[BDD+02] H. Blockeel, L. Dehaspe, B. Demoen, G. Janssens, J. Ramon, andH. Vandecasteele. Improving the e�ciency of Inductive Logic Program-ming through the use of query packs. Journal of Machine Learning

Research, 16:135�166, 2002.

[BDR96] H. Blockeel and L. De Raedt. Relational knowledge discovery indatabases. In S. Muggleton, editor, Proceedings of the 6th InternationalWorkshop on Inductive Logic Programming, volume 1314 of LNAI, pages199�211. Springer-Verlag, 1996.

199

Page 202: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

200 REFERENCES

[BDV94] Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open ClusterEnvironment for MPI. In Proceedings of Supercomputing Symposium,pages 379�386, 1994.

[BGSS03] Marco Botta, Attilio Giordana, Lorenza Saitta, and Michele Sebag.Relational learning as search in a critical region. Journal of Machine

Learning Research, 4:431�463, 2003.

[BKML+05] Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, JamesOstell, and David L. Wheeler. Genbank. Nucleic Acids Research,33:235�242, 2005.

[BM95] I. Bratko and S. Muggleton. Applications of inductive logic program-ming. Communications of the ACM, 1995.

[BM96] P. Brockhausen and K. Morik. Direct access of an ILP algorithm toa database management system. In Proceedings of the MLnet Famil-

iarization Workshop on Data Mining with Inductive Logic Programing,pages 95�110, 1996.

[BO04] Joseph Bockhorst and Irene M. Ong. FOIL-D: E�ciently Scaling FOILfor Multi-relational Data Mining of Large Datasets. In Proceedings of the14th International Conference on Inductive Logic Programming, pages63�79, 2004.

[BR97] Hendrik Blockeel and Luc De Raedt. Lookahead and discretizationin ilp. In ILP '97: Proceedings of the 7th International Workshop

on Inductive Logic Programming, pages 77�84, London, UK, 1997.Springer-Verlag.

[BR98] Hendrik Blockeel and Luc De Raedt. Top-down induction of �rst-orderlogical decision trees. Arti�cial Intelligence, 101(1-2):285�297, 1998.

[Bra90] Ivan Bratko. PROLOG Programming for Arti�cial Intelligence.Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,1990.

[Bre99] Leo Breiman. Pasting small votes for classi�cation in large databasesand on-line. Machine Learning Journal, 36(1-2):85�103, 1999.

[BRJD99] H. Blockeel, L. De Raedt, N. Jacobs, and B. Demoen. Scaling upinductive logic programming by learning from interpretations. Data

Mining and Knowledge Discovery, 3(1):59�93, 1999.

Page 203: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

REFERENCES 201

[BVM04] Margherita Berardi, Antonio Varlaro, and Donato Malerba. On thee�ect of caching in recursive theory learning. In Proceedings of the 14th

International Conference on Inductive Logic Programming, pages 44�62,2004.

[Cam00] R. Camacho. Inducing Models of Human Control Skills using Machine

Learning Algorithms. PhD thesis, Department of Electrical Engineeringand Computation, Universidade do Porto, 2000.

[Cam02] Rui Camacho. Improving the e�ciency of ilp systems using an incre-mental language level search. In Annual Machine Learning Conference

of Belgium and the Netherlands, 2002.

[Cam03] Rui Camacho. As lazy as it can be. In P. Doherty B. Tassen, P.Ala-Siuru and B. Mayoh, editors, The Eighth Scandinavian Conference

on Arti�cial Intelligence (SCAI'03), pages 47�58. Bergen, Norway,November 2003.

[CCHB89] S. H. Clearwater, T. P. Cheng, H. Hirsh, and B. G. Buchanan.Incremental batch learning. In Proceedings of the Sixth International

Workshop on Machine Learning, pages 366�370, San Mateo, CA, 1989.Morgan Kaufmann.

[CDRA89] V. S. Costa, L. Damas, R. Reis, and R. Azevedo. YAP Prolog User's

Manual. Universidade do Porto, 1989.

[CHB+02] Nitesh V. Chawla, Lawrence O. Hall, Kevin W. Bowyer, Thomas E.Moore, andW. Philip Kegelmeyer. Distributed pasting of small votes. InMCS '02: Proceedings of the Third International Workshop on Multiple

Classi�er Systems, pages 52�61, London, UK, 2002. Springer-Verlag.

[CK03] Amanda Clare and Ross D. King. Data mining the yeast genome ina lazy functional language. In Proceedings of the Fifth International

Symposium on Practical Aspects of Declarative Languages, pages 19�36,2003.

[CMH+03] Nitesh V. Chawla, Thomas E. Moore, Lawrence O. Hall, Kevin W.Bowyer, W. Philip Kegelmeyer, and Clayton Springer. Distributedlearning with bagging-like performance. Pattern Recogn. Lett., 24(1-3):455�471, 2003.

Page 204: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

202 REFERENCES

[CN89] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine

Learning Journal, 3:261�283, 1989.

[Coh93] W.W. Cohen. Rapid prototyping of ILP systems using explicit bias.In F. Bergadano, L. De Raedt, S. Matwin, and S. Muggleton, editors,Proceedings of the IJCAI-93 Workshop on Inductive Logic Programming,pages 24�35. Morgan Kaufmann, 1993.

[Coh94] William W. Cohen. Grammatically biased learning: Learning logicprograms using an explicit antecedent description language. Arti�cial

Intelligence, 68:303�366, 1994.

[Coh95] W.W. Cohen. Learning to classify English text with ILP methods.In L. De Raedt, editor, Proceedings of the 5th International Workshop

on Inductive Logic Programming, pages 3�24. Department of ComputerScience, Katholieke Universiteit Leuven, 1995.

[CSC00] Vítor Santos Costa, Ashwin Srinivasan, and Rui Camacho. A note ontwo simple transformations for improving the e�ciency of an ilp system.In Proceedings of the 10th International Conference on Inductive Logic

Programming, pages 225�242, 2000.

[CSC+03] Vítor Santos Costa, Ashwin Srinivasan, Rui Camacho, Hendrik Bloc-keel, Bart Demoen, Gerda Janssens, Jan Struyf, Henk Vandecasteele,and Wim Van Laer. Query transformations for improving the e�ciencyof ilp systems. Journal of Machine Learning Research, 4:465�491, 2003.

[Cus96] James Cussens. Part-of-speech disambiguation using ilp. TechnicalReport PRG-TR-25-96, Oxford University Computing Laboratory,1996.

[DBJ97] B. Dolsak, I. Bratko, and A. Jezernik. Machine Learning, Data

Mining and Knowledge Discovery: Methods and Applications, chapterApplication of machine learning in �nite element computation. JohnWiley and Sons, 1997.

[DDR95] L. Dehaspe and L. De Raedt. Parallel inductive logic programming.In Proceedings of the MLnet Familiarization Workshop on Statistics,

Machine Learning and Knowledge Discovery in Databases, 1995.

Page 205: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

REFERENCES 203

[DJV99] B. Demoen, G. Janssens, and H. Vandecasteele. Executing query�ocks for ILP. In Proceedings of the 1999 Benelux Workshop on Logic

Programming (BENELOG'99), pages 1�14, 1999.

[DL91] S. Dºeroski and N. Lavra£. Learning relations from noisy examples:An empirical comparison of LINUS and FOIL. In L. Birnbaum andG. Collins, editors, Proceedings of the 8th International Workshop on

Machine Learning, pages 399�402. Morgan Kaufmann, 1991.

[DR87] T. Davies and Stuart Russell. A logical approach to reasoning byanalogy. In Proceedings of the 10th International Joint Conference on

Arti�cial Intelligence, pages 264�270, Los Altos, California, 1987.

[DR98] L. De Raedt. Attribute value learning versus inductive logic program-ming: The missing links (extended abstract). In D. Page, editor,Proceedings of the 8th International Conference on Inductive Logic

Programming, volume 1446 of LNAI, pages 1�8. Springer-Verlag, 1998.

[DRL96] L. De Raedt and N. Lavra£. Multiple predicate learning in twoinductive logic programming settings. Journal on Pure and Applied

Logic, 4(2):227�254, 1996.

[DS04] Frank DiMaio and Jude W. Shavlik. Learning an approximation toinductive logic programming clause evaluation. In Proceedings of the

14th International Conference on Inductive Logic Programming, pages80�97, 2004.

[DT00] Luc Dehaspe and Hannu Toironen. Relational Data Mining, chapterDiscovery of relational association rules, pages 189�208. Springer-Verlag, 2000.

[D�93] S. Dºeroski. Handling imperfect data in inductive logic programming.In Proceedings of the 4th Scandinavian Conference on Arti�cial Intelli-

gence, pages 111�125. IOS Press, 1993.

[D�01] Saso Dºeroski. Relational data mining applications: An overview. InSaso Dºeroski and Nada Lavra£, editors, Relational Data Mining, pages339�364. Springer-Verlag, September 2001.

[F�97] J. Fürnkranz. Dimensionality Reduction in ILP: A Call to Arms. InLuc de Raedt and S. Muggleton, editors, Proceedings of the IJCAI-97

Page 206: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

204 REFERENCES

Workshop on Frontiers of Inductive Logic Programming, pages 81�86.Nagoya, Japan, August 1997.

[Fay01] Usama Fayyad. Knowledge discovery in databases: An overview. InSaso Dºeroski and Nada Lavra£, editors, Relational Data Mining, pages28�47. Springer-Verlag, September 2001.

[FCCS04] Nuno A. Fonseca, Vitor Santos Costa, Rui Camacho, and FernandoSilva. On avoiding redundancy in Inductive Logic Programming. In RuiCamacho, Ross D. King, and Ashwin Srinivasan, editors, Proceedingsof the 14th International Conference on Inductive Logic Programming,volume 3194 of Lecture Notes in Arti�cial Intelligence, pages 132�146,Porto, Portugal, September 2004. Springer-Verlag.

[FCR+] Nuno A. Fonseca, Vitor S. Costa, Ricardo Rocha, Rui Camacho, andFernando Silva. Improving the e�ciency of ILP systems.

[FCSC03] Nuno A. Fonseca, Vitor S. Costa, Fernando Silva, and Rui Ca-macho. Experimental evaluation of a caching technique for ILP.In Fernando Moura Pires and Salvador Abreu, editors, EPIA 03 -

11th Portuguese Conference on Arti�cial Intelligence, volume 2902 ofLecture Notes in Arti�cial Intelligence, pages 151�155, Beja, Portugal,December 2003. Springer-Verlag.

[FF03] Johannes Fürnkranz and Peter Flach. An analysis of rule evaluationmetrics. In Proceedings of the 20th International Conference on Machine

Learning (ICML-03), Washington, 2003. Morgan Kaufmann.

[FMPS98] Paul W. Finn, Stephen Muggleton, David Page, and Ashwin Srinivasan.Pharmacophore discovery using the inductive logic programming systemPROGOL. Machine Learning Journal, 30(2-3):241�270, 1998.

[For94] Message Passing Interface Forum. MPI: A message-passing interfacestandard. Technical Report UT-CS-94-230, University of Tennessee,Knoxville, TN, USA, 1994.

[FPSS96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth.From data mining to knowledge discovery: An overview. In Advances inKnowledge Discovery and Data Mining, pages 1�34. AAAI/MIT Press,1996.

Page 207: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

REFERENCES 205

[FRCS03] Nuno A. Fonseca, Ricardo Rocha, Rui Camacho, and FernandoSilva. E�cient Data Structures for Inductive Logic Programming.In T. Horváth and A. Yamamoto, editors, Proceedings of the 13th

International Conference on Inductive Logic Programming, volume 2835of Lecture Notes in Arti�cial Intelligence, pages 130�145, Szeged,Hungary, September 2003. Springer-Verlag.

[Fre62] E. Fredkin. Trie Memory. Communications of the ACM, 3:490�499,1962.

[FSC05] Nuno A. Fonseca, Fernando Silva, and Rui Camacho. Strategies toParallelize ILP Systems. In Stefan Kramer and Bernhard Pfahringer,editors, Proceedings of the 15th International Conference on Inductive

Logic Programming, volume 3625 of LNAI, pages 136�153, Bonn,Germany, August 2005. Springer-Verlag.

[FSC06] Nuno A. Fonseca, Fernando Silva, and Rui Camacho. April - aninductive logic programming system. In Proceedings of the 10th

European Conference on Logics in Arti�cial Intelligence (JELIA06),Lecture Notes in Arti�cial Intelligence, Liverpool, September 2006.Springer-Verlag. (to appear).

[FSCC03] Nuno A. Fonseca, Fernando Silva, Rui Camacho, and Vitor S. Costa.Induction with April - A preliminary report. Technical Report DCC-2003-02, DCC-FC & LIACC, Universidade do Porto, 2003.

[FSCC05] Nuno A. Fonseca, Fernando Silva, Vitor Santos Costa, and RuiCamacho. A pipelined data-parallel algorithm for ILP. In Proceedings

of 2005 IEEE International Conference on Cluster Computing, Boston,Massachusetts, USA, September 2005. IEEE.

[Für98] Johannes Fürnkranz. Integrative windowing. Journal of Machine

Learning Research, 8:129�164, 1998.

[Für99] Johannes Fürnkranz. Separate-and-conquer rule learning. Arti�cial

Intelligence Review, 13(1):3�54, February 1999.

[GGKK03] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar.Introduction to Parallel Computing. Addison-Wesley, 2nd edition, 2003.

Page 208: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

206 REFERENCES

[GLDS96] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance,portable implementation of the MPI message passing interface standard.Parallel Computing, 22(6):789�828, September 1996.

[GPK03] James Graham, C. David Page, and Ahmed Kamal. Accelerating thedrug design process through parallel inductive logic programming datamining. In Proceeding of the Computational Systems Bioinformatics

(CSB'03). IEEE, 2003.

[GSSB00] Attilio Giordana, Lorenza Saitta, Michele Sebag, and Marco Botta.Analyzing relational learning in the phase transition framework. InProceedings of the 17th International Conference on Machine Learning,pages 311�318, San Francisco, CA, USA, 2000. Morgan Kaufmann.

[HJ86] W. Daniel Hillis and Guy L. Steele Jr. Data parallel algorithms.Communications of the ACM, 29(12):1170�1183, 1986.

[HJZ+00] H.M.Berman, J.Westbrook, Z.Feng, G.Gilliland, T.N.Bhat, H.Weissig,I.N.Shindyalov, and P.E.Bourne. The protein data bank. Nucleic AcidsResearch, pages 235�242, 2000.

[Hog90] C. J. Hogger. Essentials of Logic Programming. Oxford University Press,1990.

[HSM01] David J. Hand, Padhraic Smyth, and Heikki Mannila. Principles of datamining. The MIT Press, Cambridge, MA, USA, 2001.

[ILP02] ILP Applications. http://www-ai.ijs.si/ ilpnet2/apps/index.html, 2002.

[JB95] A. Jorge and P. Brazdil. Architecture for iterative learning of recursivede�nitions. In L. De Raedt, editor, Proceedings of the 5th International

Workshop on Inductive Logic Programming, pages 95�108. Departmentof Computer Science, Katholieke Universiteit Leuven, 1995.

[JKP94] George H. John, Ron Kohavi, and Karl P�eger. Irrelevant features andthe subset selection problem. In Proceedings of the 11th International

Conference on Machine Learning, pages 121�129, 1994.

[KB97] Aram Karalic and Ivan Bratko. First order regression.Machine Learning

Journal, 26(2-3):147�176, 1997.

Page 209: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

REFERENCES 207

[KCN90] Chung-Ta King, Wen-Hwa Chou, and Lionel M. Ni. Pipelined dataparallel algorithms: Part I - Concept and modeling. IEEE Transactions

on Parallel And Distributed Systems, 1(4), 1990.

[Kin04] Ross D. King. Applying inductive logic programming to predicting genefunction. AI Magazine, 25(1):57�68, 2004.

[KMS92] R.D King, S. Muggleton, and M.J.E. Sternberg. Drug design bymachine learning: The use of inductive logic programming to modelthe structure-activity relationships of trimethoprim analogues bindingto dihydrofolate reductase. In Proceedings of the National Academy of

Sciences, volume 89, pages 11322�11326, 1992.

[Koh95] Ron Kohavi. A study of cross-validation and bootstrap for accuracyestimation and model selection. In Proceedings of the 14th InternationalJoint Conference on Arti�cial Intelligence, pages 1137�1145, 1995.

[Kon03] Stasinos K. Konstantopoulos. A Data-Parallel version of Aleph. InProceedings of the Workshop on Parallel and Distributed Computing

for Machine Learning, co-located with ECML/PKDD'2003, Dubrovnik,Croatia, September 2003.

[Kra95] S. Kramer. Predicate Invention: A Comprehensive View. TechnicalReport OFAI-TR-95-32, Austrian Research Institute for Arti�cial Intel-ligence, Vienna, 1995.

[KSC01] Boonserm Kijsirikul, Sukree Sinthupinyo, and Kongsak Chongkasem-wongse. Approximate match of rules using backpropagation neuralnetworks. Machine Learning Journal, 44(3):273�299, 2001.

[Lan94] P. Langley. Selection of relevant features in machine learning. In AAAI

Fall Symposium on Relevance, pages 140�144, 1994.

[LFZ99] N. Lavra£, P. Flach, and B. Zupan. Rule evaluation measures: Aunifying view. In S. Dºeroski and P. Flach, editors, Proceedings of the9th International Workshop on Inductive Logic Programming, volume1634 of LNAI, pages 174�185. Springer-Verlag, June 1999.

[Llo87] J. W. Lloyd. Foundations of Logic Programming. Springer-Verlag, 1987.

[LvF02] Nada Lavra£, Filip �elezný, and Peter A. Flach. RSD: Relational sub-group discovery through �rst-order feature construction. In Proceedings

Page 210: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

208 REFERENCES

of the 12th International Conference on Inductive Logic Programming,LNAI, pages 149�165. Springer-Verlag, 2002.

[MB88] S. Muggleton and W. Buntine. Machine invention of �rst order pred-icates by inverting resolution. In Proceedings of the 5th International

Workshop on Machine Learning, pages 339�351. Morgan Kaufmann,1988.

[MDR94] S. Muggleton and L. De Raedt. Inductive logic programming: Theoryand methods. Journal of Logic Programming, 19/20:629�679, 1994.

[MF90] S. Muggleton and C. Feng. E�cient induction of logic programs. InProceedings of the 1st Conference on Algorithmic Learning Theory,pages 368�381. Ohmsma, Tokyo, Japan, 1990.

[MF92] S. Muggleton and C. Feng. E�cient induction in logic programs. InS. Muggleton, editor, Proceedings of the 2nd International Workshop on

Inductive Logic Programming, pages 281�298. Academic Press, 1992.

[MF01] Stephen Muggleton and John Firth. Relational rule induction withcprogol4.4: A tutorial introduction. In Saso Dºeroski and NadaLavra£, editors, Relational Data Mining, pages 160�188. Springer-Verlag, September 2001.

[Mic68] D. Michie. Memo Functions and Machine Learning. Nature, 218:19�22,1968.

[Mic69] R. S. Michalski. On the quasi-minimal solution of the general coveringproblem. In Proceedings of the V International Symposium on Informa-

tion Processing (FCIP 69), volume A3, pages 125�128, October 1969.

[Mic80] R.S. Michalski. Pattern recognition as rule-guided inductive inference.In Proceedings of IEEE Transactions on Pattern Analysis and Machine

Intelligence, pages 349�361, 1980.

[MISI98] T. Matsui, N. Inuzuka, H. Seki, and H. Itoh. Comparison of threeparallel implementations of an induction algorithm. In 8th Int. Parallel

Computing Workshop, pages 181�188, Singapore, 1998.

[Mit80] Tom M. Mitchell. The need for biases in learning generalizations.Technical Report CBM-TR-117, New Brunswick, New Jersey, 1980.

Page 211: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

REFERENCES 209

[Mit82] T. Mitchell. Generalization as search. Arti�cial Intelligence, 18:203�226,1982.

[Mit97] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.

[MKS92] S. Muggleton, R.D. King, and M.J.E. Sternberg. Predicting proteinsecondary structure using inductive logic programming. Protein Engi-

neering, (5):647�657, 1992.

[Mor97] Katharina Morik. Knowledge discovery in databases - an inductive logicprogramming approach. In Foundations of Computer Science: Potential- Theory - Cognition, to Wilfried Brauer on the occasion of his sixtieth

birthday, pages 429�436, London, UK, 1997. Springer-Verlag.

[MP92] J. Marcinkowski and L. Pacholski. Undecidability of the horn-clause im-plication problem. In Proceedings of the 33rd IEEE Annual Symposium

on Foundations of Computer Science, pages 354�362. IEEE, 1992.

[MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms.Cambridge University Press, New York, NY, USA, 1995.

[MS95] E. McCreath and A. Sharma. Extraction of meta-knowledge to restrictthe hypothesis space for ilp systems. In X. Yao, editor, Proceedings ofthe Eighth Australian Joint Conference on Arti�cial Intelligence, pages75�82. World Scienti�c, November 1995.

[MS04] J. Maloberti and M. Sebag. Fast Theta-Subsumption with ConstraintSatisfaction Algorithms. Machine Learning Journal, 55(2):137�174,2004.

[Mug90] S. Muggleton. Inductive logic programming. In Proceedings of the 1st

Conference on Algorithmic Learning Theory, pages 43�62. Ohmsma,Tokyo, Japan, 1990.

[Mug91] S. Muggleton. Inductive logic programming. New Generation Comput-

ing, 8(4):295�317, 1991.

[Mug93] S. Muggleton. Optimal layered learning: A PAC approach to incremen-tal sampling. In K. Jantke, S. Kobayashi, E. Tomita, and T. Yokomori,editors, Proceedings of the 4th Conference on Algorithmic Learning

Theory, pages 37�44. Springer-Verlag, 1993.

Page 212: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

210 REFERENCES

[Mug95] S. Muggleton. Inverse entailment and Progol. New Generation

Computing, Special issue on Inductive Logic Programming, 13(3-4):245�286, 1995.

[Mug96] S. Muggleton. Learning from positive data. In S. Muggleton, editor,Proceedings of the 6th International Workshop on Inductive Logic

Programming, volume 1314 of LNAI, pages 358�376. Springer-Verlag,1996.

[NCdW97] S.-H. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive LogicProgramming, volume 1228 of LNAI. Springer-Verlag, 1997.

[NRA+96] C. Nédellec, C. Rouveirol, H. Adé, F. Bergadano, and B. Tausend.Declarative bias in ILP. In L. De Raedt, editor, Advances in Inductive

Logic Programming, pages 82�103. IOS Press, 1996.

[NS97] Anil Nerode and Richard A. Shore. Logic for Applications. Springer-Verlag, second edition, 1997.

[OdCDPC05] Irene M. Ong, Inês de Castro Dutra, David Page, and Vítor SantosCosta. Mode directed path �nding. In Proceedings of the 16th European

Conference on Machine Learning, volume 3720 of LNCS, pages 673�681,2005.

[OM99] Hayato Ohwada and Fumio Mizoguchi. Parallel execution for speedingup inductive logic programming systems. In Proceedings of the 9th

International Workshop on Inductive Logic Programming, number 1721in LNAI, pages 277�286. Springer-Verlag, 1999.

[ONM00] Hayato Ohwada, Hiroyuki Nishiyama, and Fumio Mizoguchi. Concur-rent execution of optimal hypothesis search for inverse entailment. InJ. Cussens and A. Frisch, editors, Proceedings of the 10th International

Conference on Inductive Logic Programming, volume 1866 of LNAI,pages 165�173. Springer-Verlag, 2000.

[Pag00] David Page. ILP: Just do it. In J. Cussens and A. Frisch, editors,Proceedings of the 10th International Conference on Inductive Logic

Programming, volume 1866 of LNAI, pages 3�18. Springer-Verlag, 2000.

[PC03] David Page and Mark Craven. Biological applications of multi-relationaldata mining. SIGKDD Explor. Newsl., 5(1):69�79, 2003.

Page 213: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

REFERENCES 211

[PF05] Ronaldo Cristiano Prati and Peter Flach. Roccer: an algorithm for rulelearning based on roc analysis. In Proceedings of the 19th International

Joint Conference on Arti�cial Intelligence, 2005.

[Plo70] G.D. Plotkin. A note on inductive generalization. In Machine

Intelligence, volume 5, pages 153�163. Edinburgh University Press,1970.

[PS03] David Page and Ashwin Srinivasan. ILP: a short look back and a longerlook forward. Journal of Machine Learning Research, 4:415�430, 2003.

[pvm] PVM:Parallel Virtual Machine. http://www.csm.ornl.gov/pvm/.

[QCJ93] J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report.In P. Brazdil, editor, Proceedings of the 6th European Conference on

Machine Learning, volume 667, pages 3�20. Springer-Verlag, 1993.

[Qui90] J. R. Quinlan. Learning logical de�nitions from relations. Machine

Learning Journal, 5(3):239�266, 1990.

[Qui93] J. Ross Quinlan. C4.5: programs for machine learning. MorganKaufmann Publishers Inc., 1993.

[Qui96] J. Ross Quinlan. Learning �rst-order de�nitions of functions. Journal

of Machine Learning Research, 5:139�161, 1996.

[Rae97] Luc De Raedt. Logical settings for concept-learning. Arti�cial

Intelligence, 95(1):187�201, 1997.

[RB97] Luc De Raedt and Hendrik Blockeel. Using logical decision treesfor clustering. In Proceedings of the 7th International Workshop

on Inductive Logic Programming, pages 133�140, London, UK, 1997.Springer-Verlag.

[RD94] Luc De Raedt and Saso Dºeroski. First-order jk-clausal theories arepac-learnable. Arti�cial Intelligence, 70(1-2):375�392, 1994.

[RD97] L. De Raedt and L. Dehaspe. Clausal discovery. Machine Learning,26:99�146, 1997.

[RFC05] Ricardo Rocha, Nuno A. Fonseca, and Vitor Santos Costa. On ApplyingTabling to Inductive Logic Programming. In Proceedings of the 16th

European Conference on Machine Learning, ECML-05, volume 3720 of

Page 214: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

212 REFERENCES

Lecture Notes in Arti�cial Intelligence, pages 707�714, Porto, Portugal,October 2005. Springer-Verlag.

[RL95] Luc De Raedt and Wim Van Laer. Inductive constraint logic. In ALT

'95: Proceedings of the 6th International Conference on Algorithmic

Learning Theory, pages 80�94, London, UK, 1995. Springer-Verlag.

[RM95] Bradley L. Richards and Raymond J. Mooney. Automated re�nementof �rst-order horn-clause domain theories. Machine Learning Journal,19(2):95�131, 1995.

[RN03] Stuart Russell and Peter Norvig. Arti�cial Intelligence: A Modern

Approach. Prentice-Hall, Englewood Cli�s, NJ, 2nd edition edition,2003.

[Rou92] C. Rouveirol. Extensions of inversion of resolution applied to theorycompletion. In S. Muggleton, editor, Inductive Logic Programming,pages 63�92. Academic Press, 1992.

[Rou94] C. Rouveirol. Flattening and saturation: Two representation changesfor generalization. Machine Learning Journal, 14(2):219�232, 1994.

[RP89] C. Rouveirol and J-F. Puget. A simple solution for inverting resolution.In K. Morik, editor, Proceedings of the 4th European Working Session

on Learning, pages 201�210. Pitman, 1989.

[RSC00] R. Rocha, F. Silva, and V. Costa. Yaptab: A tabling engine designed tosupport parallelism. In Proceedings of the 2nd Conference on Tabulation

in Parsing and Deduction, TAPD'2000, pages 77�87, Vigo, Spain,September 2000.

[RU95] Raghu Ramakrishnan and Je�rey D. Ullman. A survey of research ondeductive database systems. Journal of Logic Programming, 23(2):125�149, 1995.

[SB03] Jan Struyf and Hendrik Blockeel. Query optimization in inductivelogic programming by reordering literals. In Proceedings of the 13th

International Conference on Inductive Logic Programming, pages 329�346, 2003.

[SFR05] T. Soares, M. Ferreira, and R. Rocha. The MYDDAS programmer'smanual. Technical Report Technical Report DCC-2005-10, Departmentof Computer Science, University of Porto, 2005.

Page 215: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

REFERENCES 213

[Sha83] E.Y. Shapiro. Algorithmic Program Debugging. The MIT Press, 1983.

[She00] Colin Shearer. The CRISP-DM model: The new blueprint for datamining. Journal of Data Warehousing, Fall 2000.

[SK05] Ashwin Srinivasan and Ravi Kothari. A Study of Applying Dimen-sionality Reduction to Restrict the Size of a Hypothesis Space. InStefan Kramer and Bernhard Pfahringer, editors, Proceedings of the

15th International Conference on Inductive Logic Programming (ILP

2005), volume 3625 of LNAI, pages 348�365, Bonn, Germany, August2005. Springer-Verlag.

[SKB03] A. Srinivasan, R.D. King, and M.E. Bain. An empirical study of theuse of relevance information in inductive logic programming. Journal ofMachine Learning Research, 2003.

[SKMS97] Ashwin Srinivasan, Ross D. King, S. Muggleton, and M. J. E. Sternberg.Carcinogenesis predictions using ILP. In S. Dºeroski and N. Lavra£,editors, Proceedings of the 7th International Workshop on Inductive

Logic Programming, volume 1297, pages 273�287. Springer-Verlag, 1997.

[SL96] Wei-Min Shen and Bing Leng. Metapattern generation for integrateddata mining. In Knowledge Discovery and Data Mining, pages 152�157,1996.

[SL03] Je�rey M. Squyres and Andrew Lumsdaine. A Component Architecturefor LAM/MPI. In Proceedings, 10th European PVM/MPI Users'

Group Meeting, number 2840 in LNCS, Venice, Italy, September 2003.Springer-Verlag.

[Smi80] R.G. Smith. The Contract Net Protocol: High-Level Communicationand Control in a Distributed Problem Solver. IEEE Trans. Computers,29(12):1104�1113, Dec 1980.

[SMKS94] A. Srinivasan, S. Muggleton, R.D. King, and M.J.E. Sternberg. Muta-genesis: ILP experiments in a non-determinate biological domain. InS. Wrobel, editor, Proceedings of the 4th International Workshop on

Inductive Logic Programming, volume 237 of GMD-Studien, pages 217�232, 1994.

Page 216: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

214 REFERENCES

[SPR04] Mathieu Serrurier, Henri Prade, and Gilles Richard. A simulatedannealing framework for ILP. In Proceedings of the 14th International

Conference on Inductive Logic Programming, pages 288�304, 2004.

[SR97] M. Sebag and C. Rouveirol. Tractable induction and classi�cation in�rst order logic via stochastic matching. In Proceedings of the 15th

International Joint Conference on Arti�cial Intelligence, pages 888�893.Morgan Kaufmann, 1997.

[Sri99] A. Srinivasan. A study of two sampling methods for analysing largedatasets with ILP. Data Mining and Knowledge Discovery, 3(1):95�123, 1999.

[Sri00] A. Srinivasan. A study of two probabilistic methods for searching largespaces with ilp. Technical Report PRG-TR-16-00, Oxford UniversityComputing Laboratory, 2000.

[Sri03] Ashwin Srinivasan. The Aleph Manual, 2003. Available fromhttp://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph.

[Str04] Jan Struyf. Techniques for Improving the E�ciency of Inductive Logic

Programming in the Context of Data Mining. PhD thesis, KatholiekeUniversiteit Leuven Department of Computer Science, 2004.

[Tau94] B. Tausend. Biases and their e�ects in inductive logic programming. InF. Bergadano and L. De Raedt, editors, Proceedings of the 7th EuropeanConference on Machine Learning, volume 784 of LNAI, pages 431�434.Springer-Verlag, 1994.

[TMM03] Lappoon R. Tang, Raymond J. Mooney, and Prem Melville. Scaling upilp to large examples: Results on link discovery for counter-terrorism.In Proceedings of the KDD-2003 Workshop on Multi-Relational Data

Mining (MRDM-2003), pages 107�121, 2003.

[TMS98] M. Turcotte, S. H. Muggleton, and M. J. E. Sternberg. Applica-tion of inductive logic programming to discover rules governing thethree-dimensional topology of protein structure. In D. Page, editor,Proceedings of the 8th International Conference on Inductive Logic

Programming, volume 1446, pages 53�64. Springer-Verlag, 1998.

[TNM00] Alireza Tamaddoni-Nezhad and Stephen Muggleton. Searching thesubsumption lattice by a genetic algorithm. In J. Cussens and A. Frisch,

Page 217: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

REFERENCES 215

editors, Proceedings of the 10th International Conference on Inductive

Logic Programming, volume 1866 of LNAI, pages 243�252. Springer-Verlag, 2000.

[TNM03] A. Tamaddoni-Nezhad and S. Muggleton. A genetic algorithm approachto ILP. In S. Matwin and C. Sammut, editors, Proceedings of the 12thInternational Conference on Inductive Logic Programming, volume 2583of LNAI, pages 285�300. Springer-Verlag, 2003.

[TUA+98] Dick Tsur, Je�rey D. Ullman, Serge Abiteboul, Chris Clifton, RajeevMotwani, Svetlozar Nestorov, and Arnon Rosenthal. Query �ocks: ageneralization of association-rule mining. In SIGMOD '98: Proceedings

of the 1998 ACM SIGMOD international conference on Management of

data, pages 1�12, New York, NY, USA, 1998. ACM Press.

[vdL95] P.R.J. van der Laag. An analysis of re�nement operators in inductive

logic programming. PhD thesis, Erasmus Universiteit, Rotterdam, theNetherlands, 1995.

[vSP02] F. �elezný, A. Srinivasan, and D. Page. Lattice-search runtimedistributions may be heavy-tailed. In S. Matwin and C. Sammut,editors, Proceedings of the 12th International Conference on Inductive

Logic Programming, volume 2583 of LNAI, pages 333�345. Springer-Verlag, 2002.

[vSP04] Filip �elezný, Ashwin Srinivasan, and David Page. A monte carlostudy of randomized restarted search in ilp. In Proceedings of the 14th

International Conference on Inductive Logic Programming, pages 341�358. Springer-Verlag, 2004.

[WA02] Michael Widenius and David Axmark. MySQL Reference Manual:

Documentation from the Source. O'Reilly Community Press, 2002.

[WD95] Stefan Wrobel and Saso Dºeroski. The ILP description learningproblem: Towards a general model-level de�nition of data mining inILP. In K. Morik and J. Herrmann, editors, Proc. Fachgruppentre�enMaschinelles Lernen (FGML-95), 44221 Dortmund, 1995. Univ. Dort-mund.

[Web97] Irene Weber. Discovery of �rst-order regularities in a relationaldatabase using o�ine candidate determination. In Proceedings of the 7th

Page 218: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

216 REFERENCES

International Workshop on Inductive Logic Programming, pages 288�295, London, UK, 1997. Springer-Verlag.

[Wie03] Jan Wielemaker. Native preemptive threads in SWI-Prolog. In CatusciaPalamidessi, editor, Proceedings of the 19th International Conference

on Logic Programming, volume 2916 of LNAI, pages 331�345. Springer-Verlag, 2003.

[Wro97] Stefan Wrobel. An algorithm for multi-relational discovery of subgroups.In Proceedings of the First European Symposium on Principles of Data

Mining and Knowledge Discovery, pages 78�87, London, UK, 1997.Springer-Verlag.

[Wro01] Stefan Wrobel. Inductive logic programming for knowledge discovery indatabases. In Saso Dºeroski and Nada Lavra£, editors, Relational DataMining, pages 74�101. Springer-Verlag, September 2001.

[WS00] Y. Wang and D. Skillicorn. Parallel inductive logic for data mining. InWorkshop on Distributed and Parallel Knowledge Discovery, KDD2000,Boston, 2000. ACM Press.

Page 219: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

Index

A∗ search, 53�, 183�, 183, 184¬, 182θ − equivalent, 48∨, 183∧, 183

accuracy, 58Amdahl's law, 118answer substitution, 187attribute-value representation, 30

background knowledge, 30, 31batch-learning ILP, 84beam-search clause selection, 102best clause selection, 102best-�rst clause selection, 102bias, 29

admissible vocabulary, 70antecedent description grammar, 71declarative, 56determinacy, 70language bias, 56, 69linked clauses, 71modes, 70preference, 56restriction, 56search, 56shift, 56

strong, 57validation, 56weak, 57

bottom clause, 60branch-and-bound, 54branching factor, 50

class value, 29classi�cation problem, 29clause, 182

acceptable, 94body, 183datalog, 186de�nite, 183de�nite unit, 183function free, 183goal, 186ground, 183Horn, 183inde�nite, 183negative, 183program, 186range-restricted, 183recursive, 183reduced, 48redundant, 183

closed world assumption, 187clustering, 29coarse-grained, 117complete, 42, 185

217

Page 220: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

218 INDEX

completeness, 50concept learning, 39con�dence, 58confusion matrix, 57conjunction, 183consistent, 42constant, 182correct, 42

partial correct, 65coverage, 58coverage lists, 76, 100covering algorithm, 46covers

intensionally, 42CRISP methodology, 28cross validation, 108CWA, 43, 187

data mining, 28data stream, 149data summarization, 29decomposition, 116deduction, 37, 185deductive database, 30de�nite logic program, 183degree of parallelism, 117depth-limit search, 52derivation, 185descriptive ILP, 39, 41descriptive learning, 39descriptive modeling, 29determination, 86determination declarations, 70discretization, 69

empirical ILP, 84epoch, 46extensional predicate, 40

F-measure, 58, 59facts, 186false negative, 57false positive, 57feature, 72feature selection

�lter method, 72wrapper method, 72

feature subset selection, 72�ne-grained, 117Flattening, 95FOL, 31, 182foothill, 53function, 182function symbol, 182

generalization, 38, 48generate-and-test, 31glb, 48greedy clause selection, 102ground term, 182

head, 183Herbrand Base, 184Herbrand interpretation, 184Herbrand Universe, 184heuristic, 52hypothesis space, 45

IBL, 172ILP, 27, 38incremental batch learning, 172incremental ILP, 84induction, 37inductive generalization, 38inductive logic, 37inference rules, 185instantiation, 182, 187intensional predicate, 40

Page 221: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

INDEX 219

interactive, 84interpretation, 184

KDD, 28knowledge discovery in databases, 28

language bias, 29, 40Las Vegas algorithms, 61lattice, 48layered learning, 78, 173lazy evaluation

negatives, 99positives, 99total laziness, 100

learn, 29learning from entailment, 41, 77learning from interpretations, 41, 44, 77literal

depth, 183negative, 182positive, 182redundant, 183

local �niteness, 50local maxima, 53local minima, 53logic program, 186Logic Programming, 38LP, 38lub, 49

master, 124materialization, 76minimal model, 186mining, 30mode declarations, 86model, 28Monte Carlo algorithms, 61most general instance, 49most general uni�er, 184, 187

most speci�c clause, 60MPI, 133MRDM, 30Multi-Relational Data Mining, 30multiple predicate learning, 84

negation symbol, 182noise, 43non-interactive, 84non-monotonic ILP, 42normal semantics, 41

parallel runtime, 117parallel speedup, 117parallel task, 116parent clauses, 185pattern, 28pipeline width, 150pipelined data-parallelism, 148pipelining, 148

data, 149functional, 149

precision, 58predicate, 186

de�nition, 183symbol, 182

predictive ILP, 39, 41predictive learning, 39predictive modeling, 29premises, 37program

Datalog, 186de�niterange-restricted, 183

normal, 186Prolog, 186

properness, 50propositional algorithms, 30

Page 222: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

220 INDEX

propositional representation, 30propositionalization, 30pruning, 54

query, 186�ocks, 74packs, 74

Randomized algorithms, 33, 61randomized rapid restarts, 64, 128RDBMS, 78recall, 58re�nement, 48

graph, 50operator, 49downward, 49ideal, 50upward, 50

regression problem, 29resolution, 185resolvent, 185RLGG, 60RRR, 64rule accuracy, 58rule set accuracy, 58rules, 186

scalable, 118search

best-�rst, 52bottom-up, 51breadth-�rst, 52depth-�rst, 52greedy best-�rst, 52iterative depth-�rst, 52local, 53hill-climbing search, 53local beam search, 53

top-down, 51

search bias, 29search space, 56seed, 60, 92sensitivity, 58separate-and-conquer, 46serial runtime, 117setting

explanatory ILP, 41normal setting, 41strong ILP, 41

single predicate learning, 84skolem constant, 182skolem function, 182skolemization, 182SLD-resolution, 187sound, 185specialization, 48speci�city, 58, 59stochastic clause selection, 62, 125stream parallelism, 149subsampling, 78, 173substitution, 184super-linear speedup, 118synchronization, 116syntactic consequences, 185

tabling, 76task, 116

dependency, 116parallel, 116

techniquecorrect, 65not correct, 66

term, 182theorems, 185theory, 40

clausal, 183Horn, 183

Page 223: Nuno Alberto Paulino da Fonseca - cracs.fc.up.ptcracs.fc.up.pt/~nf/pessoal/pubs/naf-phd.pdf · Nuno Alberto Paulino da Fonseca ... dissertation related to the exploitation of ablingT

INDEX 221

reduced, 184revision, 84

training examples, 31trie, 103

uni�cation, 184uni�er, 184

valid, 41variable, 182

depth, 184

windowing, 78workers, 124