~~ 2~ simpÓsio brasileiro de automaÇao...

6
() SIMPÓSIO BRASILEIRO DE .. , ) AUTOMAÇAO INTELIGENTE . ,. , ;, CEFET-PR, 13 a 15 de Setembro de 1995 Curitiba Paraná Learning and Gel1eralization in Pyramidal Architectures M. C. P. de Souto, K. S. Guimarâes, and T. B. Ludermir Universidade Federal de Pernambuco Departamento de Informática Caixa Postal 785] - CEP 50.7732-970 - Recife - PE - Brasil {mcps,katia, t bl}@di.ufpe.br Abstract This paper describes a technique, called the probably approximately correct learnáng (PA C- learníng) model, to cope with the intractability of lhe loading problcm for pyramádal archí- tectures. The PA C-learning model ás a counterpart of randomalgorithmtl used in the theory of the NP-completeness. In such context, thás pape r provides an upper botlnd for lhe sample complexily of learning algorithms. Also. lhe computational complexily of lhe loadáng pro- cess ás analyzed. Based on these studies, il is shown here Ihal generaliza/ion in pyramidal archátectures ás as har'd as learning. 1 Introduction One of the most important features of neural networks is their ability to generalize to new situations. Once trained a network will compute an inputjoutput mapping which, if the training data was representative enough, wilI closely match the unknown function which produced the original data. This paper deals with basic theoretical questions regarding learning and generalization in pyramidal net,works. It is inspired by Judd [8] who shows the loadíng problem. to be NP-complete. The loading problem can be described as folIows: "Given a neural networkand a set of training examples, does there exists a configu- ration of functiolls for the network so that the network produces the correct output for alI training examples?" In this research, the term "neural network" always means billary input.joutput feedforward networks. Specifically, pyramidal architectures have been chosen to be analyzed because of their relevance in the research of weightless neural models [13], [2], [7]. The learning paradigm studied in this work is the supervised learning. Judd shows the loading problem to be NP-complete when general architectures and tasks (training sets) are considered. Judd's results regarding the computational complexity of the loading problem have been extended to pyramidal architectures in [6]. In order to develop this research, it is necessary to study the previous issues in the context of t,he PAC-Iearning model. The PAC-Iearning framework alIows the phenomenon of generalization to be analyzed from a formal point of view. The remainder of this paper is divided into four sections. Section 2 presents weightless n'eu- ral models alld their main characteristics. Section 3 describes the basic PAC-Iearning mo de!. The main section of this paper is Section 4, where an upper bound for the sample complexi- ty for PAC-Iearning algorit,hm is provided. This is done by defining an upper bound for the VC-dimension of pyramidal architectures. AIso, this section presents results regarding the com- putational complexity of loading pyramidal architectures. FinalIy, the last section provides some brief remarks.

Upload: phunglien

Post on 02-Oct-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ~~ 2~ SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTEfei.edu.br/sbai/SBAI1995/ARTIGOS/II_SBAI_37.pdf · 226 2' SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTE 2 Weightless Neural Models

()

·~

~~ 2~ SIMPÓSIO BRASILEIRO DE

~ .. , ) AUTOMAÇAO INTELIGENTE

. ,. , ;, CEFET-PR, 13 a 15 de Setembro de 1995 Curitiba • Paraná

Learning and Gel1eralization in Pyramidal Architectures

M. C. P. de Souto, K. S. Guimarâes, and T. B. Ludermir Universidade Federal de Pernambuco

Departamento de Informática

Caixa Postal 785] - CEP 50.7732-970 - Recife - PE - Brasil

{mcps,katia, t bl}@di.ufpe.br

Abstract

This paper describes a technique, called the probably approximately correct learnáng (PA C­learníng) model, to cope with the intractability of lhe loading problcm for pyramádal archí­tectures. The PA C-learning model ás a counterpart of randomalgorithmtl used in the theory of the NP-completeness. In such context, thás pape r provides an upper botlnd for lhe sample complexily of learning algorithms. Also. lhe computational complexily of lhe loadáng pro­cess ás analyzed. Based on these studies, il is shown here Ihal generaliza/ion in pyramidal archátectures ás as har'd as learning.

1 Introduction

One of the most important features of neural networks is their ability to generalize to new situations. Once trained a network will compute an inputjoutput mapping which, if the training data was representative enough, wilI closely match the unknown function which produced the original data.

This paper deals with basic theoretical questions regarding learning and generalization in pyramidal net,works. It is inspired by Judd [8] who shows the loadíng problem. to be NP-complete. The loading problem can be described as folIows:

"Given a neural networkand a set of training examples, does there exists a configu­ration of functiolls for the network so that the network produces the correct output for alI training examples?"

In this research, the term "neural network" always means billary input.joutput feedforward networks. Specifically, pyramidal architectures have been chosen to be analyzed because of their relevance in the research of weightless neural models [13], [2], [7]. The learning paradigm studied in this work is the supervised learning.

Judd shows the loading problem to be NP-complete when general architectures and tasks (training sets) are considered. Judd's results regarding the computational complexity of the loading problem have been extended to pyramidal architectures in [6].

In order to develop this research, it is necessary to study the previous issues in the context of t,he PAC-Iearning model. The PAC-Iearning framework alIows the phenomenon of generalization to be analyzed from a formal point of view.

The remainder of this paper is divided into four sections. Section 2 presents weightless n'eu­ral models alld their main characteristics. Section 3 describes the basic PAC-Iearning mo de!. The main section of this paper is Section 4, where an upper bound for the sample complexi­ty for PAC-Iearning algorit,hm is provided. This is done by defining an upper bound for the VC-dimension of pyramidal architectures. AIso, this section presents results regarding the com­putational complexity of loading pyramidal architectures. FinalIy, the last section provides some brief remarks.

Page 2: ~~ 2~ SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTEfei.edu.br/sbai/SBAI1995/ARTIGOS/II_SBAI_37.pdf · 226 2' SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTE 2 Weightless Neural Models

\ ,

226 2' SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTE

2 Weightless Neural Models

The architecture analyzed in this work have been oft.en used with a kind of neuron called the weightless neuron, Boolcan neuron, or logical neuron model [3]. The simplest weightless neuron (the RAM neuron) is based on operations of look-up t.ables which is best. implemented by random access memories (RAM). In such model, knowledge is directly "stored" in the memory (the look­up tables) of the nodes during learning.

Definition 2.1 ARAM neuron (node) is a device capable of computillg any Boolea.n function with a given number of inputs. ARAM neuron can be described as consisting of:

• input terminais, i 1 , i 2 , ... , ik. which represent the input of the neuron;

• a set of cells which stores the contents or learned information (local memory), which may be O or 1;

• an output terminal, r, which returns an addressed content given by the input t.erminals;

• a teach terminal, d, which provides the desired response;

• an operator mode terminal, t"et (writ.e-enable terminal), which indicates if the neuron is in the learning (if wet is 1 then the binary value in d is written at the memory location) or recalling phase.

Definitioll 2.2 ARAM network is an arrangement. of a finite number of RAM nodes.

There are 221< different fUllctions which can be performed on k address lines and these cor­

respond exactly to 2k states that each RAM no de can be in. This way, a single RAM neuron can comput.e any Boolean function of its inputs and hence cannot generalize. However, RAM networks do generalize [3].

Seeing aRAM node as a truth table (look-up table) its output is described by r = C[p], where C[P] is the content of the address position associated with the input pattern p = i1i2 ... ik .

There are many variations and extensions of the RAM node, called RAM-based nodes. Some advantages of neural networks comprised of weightless nodes are [9]: (1) they share some of the main characteristics of Hopfield networks (seeking of energy minima at runtime) and error back­propagation (learning from errors); and (2) they are straightforward to implemellt in hardware using digitallogic techniques.

In the context of weightless neural models the architecture mostly used is the pyramidal [13], [2], [7], which tends to have high depth. A pyramidal architecture 7r is a. q-tree of nodes (e.g ., a binary tree, oct-tree. et.c .), where ali nodes have the same fan-in q. The root of the tree is the output node, and the leaves are the input nodes.

3 The basic PAC model of learning

The basic "probably approximately correct" (PAC) model of learning was int.roduced by Valiant [15]. The PAC-Iearning is applicable to neural networks with one output node which outputs either the value O or 1. Thus, such model applies to neural network classification problems. One suitable class of architectures to analyze in the context of PAC-Iearning is hence t.he class of pyramidal networks.

Instead of arbitrarily choosing a training set (training sample) in advance, the PAC-Iearning madel selects such set by ralldomly choosing items according to an unknown probabilit.y distri­bution. Hence the training sample is biased only by that distribution. The test set is selected in the same way. Since both sets are chosen according to the same distribution, the test set is

Page 3: ~~ 2~ SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTEfei.edu.br/sbai/SBAI1995/ARTIGOS/II_SBAI_37.pdf · 226 2' SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTE 2 Weightless Neural Models

2! SIMPÓSIO BRASILEIRO DE AUTOMAÇÃO INTELIGENTE 227

an unbiased sample of the training set and vice-versa. Moreover, the size of training sample is a decision of the learning algorithm rather than a determination of an experimenter.

In order to investigat,e learning concepts (i.e . sets) in neural net.works, it is necessary to consider a framework that consists of [12]: (1) a domain X (it is assumed here that X = {O, 1}3, where s is the number of input. terminaIs for the network); (2) a class !\" ç 2x of possible target. concepts (the "concept class" ); and (3) a class T ç 2x (the "hypot,hesis class") .

The task of the learning algorithm is to compute from given positive and/or negative examples for some unknown target concept KT E f{, the representation of some hypothesis H E T that approximates l\"T. A pair (O', 1) with U E KT is called a positive example for ]\T, and a pair (u, O) with u E .X - A"T is called a negative example. Frequently, T is defined as the class of all sets that are computable by a network A for any configuration of node functions F from a fixed node function sei. <l>. The configuration assignment F is a representation for the H E T that is computed by A.

111 the PAC-learning model it is often assumed that the network architecture defines the hypothesis class T over the donlain X. Additionally, a class f{ ç 2x of possible target concepts (the concept cla.c;;s) is fixed. The network is given a parameter ( > O (error parameter) and a parameter , > O (confidence parameter). In order to train the net,work, the learning algorithm needs to determine a bound m( (,,) for the number of items in the training set and to solve the following problem [11]:

For a.ny probabilit,y distribution D over X, any target. coneept l\"T from the class f{ ç 2x , and any sample S = [(Ui, l\"T(ud)i<m] of m ~ m(( , ,) labeled exanlples for f{T with points Ui drawn independently ace~rding to D, the learning algorithm can compute from S, (, and , the representation of some hypothesis H E T (in this case a suitable configuration F of no de funetions for 11' so that H = M;') sueh that with probabilit,y ~ 1-" D[{u E X: Al;'(u) # f{T(U)}] ~ {,

In other words, this definition requires that the system usually be correct for most items in the test set. "Usually" is defined by the confidenee probability parameter , (this param,~ter

protects against the small, but nonzero, chance that the examples happen to be very atypicaI), and "most" is defined by the accuracy probabiIity parameter ( . Thus, based on these parameters one can controI the network generalization.

The eoncept class f{ is said to be efficiently PAC-learnable with hypothesis class T if the following conditions are satisfied: (1) m( u, 1) is bounded by a polynomial in ~ and ~; and (2) the representation of H can be eomputed from 5, ( , and , by an algorithm whose computation time is bounded by a polynomial in +, *, and in the length of S. '

lt is interesting to analyze issues with respect. to feasibility of this model. The feasibility of learning an unknown coneept from examples depends on two questions [1]:

1. Do the examples eonvey enough information to determine the function (eoncept)?

2. Is there a speedy wa.y of constructing the function from the examples?

The first. one regarding questions on sample complexity, while the secoud oue is related to computational complexity of learning. In the next sections both questiolls will be addressed.

4 Sample and computational complexity of pyramidal ar­chitectures

In this section, the upper bound to the sample complexit,y of PAC-learning algorithms for pyra­midal architectures is defilled. 5ince the concept of VC-dimension is used, sa.mple complexity is provided in terms of that measure .

Page 4: ~~ 2~ SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTEfei.edu.br/sbai/SBAI1995/ARTIGOS/II_SBAI_37.pdf · 226 2' SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTE 2 Weightless Neural Models

228 2! SIMPÓSIO BRASILEIRO DE AUTOMAÇÃO INTELIGENTE

Definition 4.1 [10] Assume that. 7r is a pyramidal architecture with s inputs, node function set <l>, and S ç {O, 1}8 is an arbitrary set. Then the VC-dimension of 7r over S is: V C-dim( 7r, S) = max {IS' 1 such that. S' ç S has the property that for every function

TI : S' - {O, I} there exists a configuration assignment F E <l> such that \lu E S' [M;:(u) = TI(u)]}.

In other words, the VC-dimension of a network provides the maximum lengt,h of a large set of examples for which there are a set of functions that can achieve all possible classifications of those examples. The theorem below found in [12] shows the importance of the VC-dimension of a neural network for the PAC-learning model.

Theorem 4.1 [12] Assume that T ç 2x satisfies VC-dim(T, X) < 00 and T is well-behaved (the latter is a rath.er benign measure-theoretic assumption that is always satisfied if T is count­able).

Then for

any function L thal assign·s to a randomly drawn sample S of m ~ m(f,1') examples (u, p) for some target concept I{T E T (with. u drawn according to some arbitrary distribution D over X) some hypothesis L(S) E T that is consistent with S is a PA C-Iearning algorith.m, since it has then EuED[I](T(U) - L(S)(u)l] S f with probability 2: 1 - 1'.

This way, it is not possible to guarantee the conclusions of Theorem 4.1 unless an training sample of length at least proportional to the VC-dimension of the network is used. Hence, the VC-dimension quantifies the sample complexity of the PAC-Iearning.

Now, suppose that one wants to design an efficient PAC-Iearning algorithm for the class of pyramidal architectures. Computational complexity issues will be approached later, however in this section simply observe that if the PAC algorithm is to perform in time polynomial in the size of the network, then it must have polynomial sample complexity. This is true iff the VC-dimension of those networks are polynomial in the number of nodes in such networks.

Lemma 4.1 [4] IfTI is a finite set of {O,J}-valued functions, tll.en TI has VC-dimension at most

log2lTII·

Theorem 4.2 Let 7r be a pyramidal architecture witll. s inputs and arbitrary Roolean functions as node function set. Then v'C-dim(7r, {O, 1}~) is O(N), wll.ere N is the numbcr of nodes in 7r.

Proof: A pyramidal architecture 7r can perform at most 2N •

2q (where q is the node fan-in) Boolean functions, since that value represents how many distinct ways there are to fill each memory location in the pyramid. Hence, according to Lemma 4.1, VC-dim(7r, {O, 1}8) S 10g2 2N *2

q

Then, by mean of some algebraic operations, VC-dim(7r, {O, I}") is O(N). • Although a polynomial bound for the VC-dimension of pyramidal architectures has been

provided, this section shows that there is no efficient PAC learning algorithm to train them. Thus, the computational complexity is the reason for existing no such algorithm. Before showing a formal proof for that, however, it is necessary to define a important complexity class of problems called the random polynomial problems (RP).

A problem Q is in the complexity class RP with error parameter ( iff there exists a polynomial time algorithm L such that for every instance Y of Q the following holds: if Y is a "YES" instance of Q then L outputs "YES" with probability at least 1 - ( for some constant ° < f < 1, and if Y is a "NO" instance of Q thell L always outputs "NO". It is widely kllown that P ç RP ç

Page 5: ~~ 2~ SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTEfei.edu.br/sbai/SBAI1995/ARTIGOS/II_SBAI_37.pdf · 226 2' SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTE 2 Weightless Neural Models

2! SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTE 229

N P, however if these ·inclusions are proper is an important open question in Computational Complexity Theory.

The proof below uses a technique applied to prove Corollary 3.1 in [5]. It is assumed that the functions computed by pyramidal architectures a.re learnable and it is shown that such assumption implies an RP algorithm for solving a known NP-complete problem, that is, NP=RP.

Corollary 4.1 The class of Boolean functions computab/e by a pyramidal architecture 7r 1tsing {AN D, OR} as node functi01l set is not learnable in time polynomial, unless RP = N P .

Proof: Given an instanre (Z, C) of the MSAT problem, an instance (7r, T) of the loading problem

(according to the algorithm in [6]) is created. The MSAT problem is defined by Sim a in [14]. Such problem is similar to the general Satisfiability problem (SAT), however in it any clause cannot contain two literaIs derived from the same variable, and addit.ionally a positive literal always follows after a negative one and vice-versa. ,

Let D be the uniform distribution over the items of T. Choose « min{ ~, ~}, and f = 1-(. To prove the corolla.ry it is sufficient to show that for the above choice of f, f, D, the learnability of pyramidal architectures in the cont.ext of the PAC learning can be used to decide the result. of the MSAT problem in random polynomial time. So, there are the following cases:

• Suppose (Z,C) is an instance of the MSAT and let n be its solution. Then, from the proof of Theorem 4.1 in [6], there exists a configuration to 7r which is consistent with the task T. Thus, if the concept.s performed by pyramidal architectures using {AN D, OR} are efficiently PAC-Iearnable, then due to choice of f and f (and again by Theorem 4.1), the probabilistic learning algorithm must produce a solution which is consistent with T with probability at leást 1 - f, therefore providing a probabilistic solution of the MSAT. That is, if the answer to the MSAT question is "YES", then the algorithm answers "YES" with probability at least 1 - ( .

• Suppose that there is no solution for the given instance of the MSAT. Then, by Theorem 4.1 in [6], for (7r, T) there is no configuration for 7r which is consistent with T. Thus, the learning algorithm must always either produce a solution which is not consistent with T, or fail to halt in time polynomial in n, ~, and ~. In either case it is possible to check that the learning algorithm was inconsistent with T or did not halt in determined time, and answer "NO". In other words, if the answer to the MSAT is "NO", the algorithm always answers "N O" .

Since the MSAT problem is NP-complete, it follows that any problem NP has random poly­nomial time solution, that is, N P ç RP. However, it is widely known that RP ÇN p., hence RP=NP. •

5 Final remarks

This paper has addressed two important issues: sample complexity and time complexity of learning in pyramidal architectures. The results obtained here, together with the ones in [4], show the apparent independel1ce between such issues.

In the first case, the VC-dimension of pyramidal architectures has been shown to be bounded by a polynomial. Hence, it is possible to define a number of training examples large enough (but polynomial) which makes a .neural network to generalize as expected. On the other hand, even with the need of a polynomial number of examples, it. has been proved that there is no ral1dom polynomial algorithm to solve the loading problem for pyramidal architectures. That. is, even a sufficient number of examples is useless if the computational task of transforming such examples

Page 6: ~~ 2~ SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTEfei.edu.br/sbai/SBAI1995/ARTIGOS/II_SBAI_37.pdf · 226 2' SIMPÓSIO BRASILEIRO DE AUTOMAÇAO INTELIGENTE 2 Weightless Neural Models

230 2' SIMPÓSIO BRASIlEIRO DE ~.,' I AUTOMAÇÃO INTELIGENTE

in hypothesis is intractable. Because of this, it is not possible to guarantee the conclusions of Theorem 4.1 in this paper. So, the network generalization may not be the expected (based on "'( and l parameters). Therefore, generalization Oll pyramidal architectures in the context of the PAC-Iearning model must be as hard as the learning.

References

[1] Y. S. Abu-Mostafa. The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning. Neural Computation, 1:312-317, 1989.

[2] R. AI-Alawi and T. J. Stonharil. A training strategy and functionality analysis of digital multi-Iayer neural networks. Journal of Intelligent Systems, 2:53-93, 1992.

[3] 1. Aleksander. Emergent intelligent properties of progressively structured pattern recogni­tion nets. Pattern Recognition Letters, 1 :375-384, 1983.

[4] M. Anthony. Probabilistic Analysis of Learning in Artificial Neural Networks: The PAC Model and its Variants. NeuroCOLT Technical Report Series NC-TR-94-3, University of London, Surrey, England, June 1994.

[5] B. Dasgupta, H. T. Siegelmann, and E. Sontag. On the complexity of training neural networks with continuos activation functions. Technical report, Rutgers University, New Brunswick, NJ, December 1993.

[6] M. C. P. de Souto, K. S. Guimarães, and T. B. Ludermir. On the Intractability of Load­ing Pyramidal Architectures. Accepted to be published in Annals of IEE International Conference on Artificial Neural 1995 (ANN95), 1995.

[7] E. C. B. C. Filho, D. L. Bisset, and M. C. Fairhurst. A goal seeking neuron for Boolean neural networks. In Proc. International Neural Networks Conference, volume 2, pages 894-897, Paris, France, July 1990.

[8] J. S. Judd. Neural Network Design and the Complexity of Learning. The MIT Press, Cambridge, USA, 1990.

[9] T. B. Ludermir and W. R. de Oliveira. Weightless neural models. Computer Standards fj Interfaces, 16:253-263, 1994.

[10] W. Maass. Bounds for the computational power and learning complexity of analog neural nets. In Proc. of XXV Annual AG.M Symposium on the Theory of Computing, pages 335-344, 1993. Extended abstract.

[11] W. Maass. Neural nets with superlinear VC-dimension. In Proc. of the International Conference on Artificial Neural Networks 1994 (ICANN 94), pages 581-584, Berlin, 1994.

[12] W. Maass. Perspective of current research about the complexity of learning on neural nets. In V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky, editors, Theoretical Advances in Neural Computation and Learning, pages 295-336. Luwer Academic Publishers, 1994.

[13] C. Myers and 1. Aleksander. Learning algorithms for probabilistc neural nets. In IEE Inter­national Conference on Artificial Neural Networks, pages 310-314, London, UK, October 1989.

[14] J. Sima. Loading Deep Network is Hard. Neural Computation, 6:840-848, September 1994.

[15] L. G. Valiant. A theory ofthe learnable. Comm. ofthe ACM, 27(11):1134-1142, November 1984.

r."