Автор: Braspenning P.J.   Thuijsman F.   Weijters A.J.M.M.  

Теги: artificial intelligence  

ISBN: 3-540-59488-4

Год: 1991

Текст
                    Lecture Notes in Computer Science 931
Edited by G. Goos, J. Hartmanis and J. van Leeuwen
Advisory Board: W. Brauer D. Gries J. Stoer


P.J. Braspenning F. Thuijsman A.J.M.M. Weijters (Eds.) Artificial Neural Networks An Introduction to ANN Theory and Practice fflj) Springer
Series Editors Gerhard Goos Universitat Karlsruhe Vincenz-Priessnitz-StraBe 3, D-76128 Karlsruhe, Germany Juris Hartmanis Department of Computer Science, Cornell University 4130 Upson Hall, Ithaca, NY 14853, USA Jan van Leeuwen Department of Computer Science, Utrecht University Padualaan 14, 3584 CH Utrecht,The Netherlands Volume Editors P.J. Braspenning, Department of Computer Science F. Thuijsman, Department of Mathematics A.J.M.M. Weijters, Department of Computer Science University of Limburg, P.O. Box 616 6200 MD Maastricht, The Netherlands CR Subject Classification (1991): El.l, 1.2.6, G.1.6,1.5.1, J.l, J.2, J.6 1991 Mathematics Subject Classification: 92B20, 94C15, 68T05, 90C90 ISBN 3-540-59488-4 Springer-Verlag Berlin Heidelberg New York CIP data applied for This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. © Springer-Verlag Berlin Heidelberg 1995 Printed in Germany Typesetting: Camera-ready by author SPIN: 10486258 06/3142-543210 - Printed on acid-free paper
Preface This book is the result of a concerted action by the departments of Computer Science and Mathematics of the University of Limburg (Maastricht, The Netherlands) to develop a collection of lectures, specifically dedicated to informing the industrial world about the potential of using neural networks. For this reason, both departments had worked together within an NN working group to set up an Autumn School for Neural Networks, which was held in 1990 in Maastricht. Participants came from different quarters within government, industry and small and medium-sized companies, and insurance and banking institutes. However, the participants were not arbitrarily chosen workers within those quarters. The target group of people addressed by the Neural Network School were technical managers, consultants, research associates, and software developers at the high end of the spectrum whose employers expected innovative applications of new technologies within their own (industrial) setting. Hence, in our view the target group consisted of people with a reasonable level of formal education and, specifically, some basic background in mathematics and computer science. Having this group in mind, the contributions of this book were set. Hence the prerequisites for a fruitful understanding of the material are set. Nevertheless, some more specific knowledge of mathematics and/or computer science may be required at a few places. The aim of this book is not to offer a systematic exposition of all kinds of neural networks or a bunch of most often used networks. Rather, the idea was to focus on two generic application domains, namely control and optimization, and use these application domains to illustrate the concrete use of different kinds of neural networks. Put otherwise, these application domains were used to cluster and direct the NN School lectures to be particularly illustrative for how to apply different kinds of neural network architecture. In this way, we hoped to serve the needs of the participants, both regarding their need to (theoretically) understand the functioning of any particular network and regarding their need to really see a demonstrative example application. After the NN School was held we used the feedback from the participants to update and elaborate the course materials into a set of papers which together comprise this book. That is, the book is a compilation, not of the original course materials, but of carefully re-worked original papers. However, it should not be seen as detailing the newest developments within the rapidly evolving scientific discipline of neural networks. As explained already, this was not the goal of our efforts. What can the reader expect of this book? First, it gives a representative bunch of neural network architectures which have found widespread application. The level of exposition is such that the functioning of these neural networks can be understood, and many times their functioning is also dealt with in a more analytical fashion. Secondly, quite a few applications are described for which a particular neural network architecture has been chosen. This choice is not always
VI based on purely objective criteria, because the field of neural networks is still of a rather experimental nature. However, where possible the actual choice of architecture is reasoned. In fact, one contributing paper in this book is exclusively dedicated to making a choice about the neural network architecture to use for a particular task within an actual application domain. Thirdly, reading the book as a whole certainly stimulates one's curiosity (and therefore one's innovative- ness) about the applicability of neural networks within one's own field of work. Therefore, the team of authors considers itself to have been successful if many readers, after reading this book, seriously consider applying some neural network technology to the problem at hand, whether it is for a classification/recognition task, a control task, or a complex multiple constraint satisfaction task. Of course, the ordering of the papers in this book is not arbitrary. It constitutes the route which we think to be most profitable for the serious reader. However, depending on the reader's pre-existing background knowledge about neural networks, we do not object at all to a reader who wants to dive into particular papers, especially because all papers are sufficiently self-contained. Nevertheless, we would like to finish this introduction with an outline of the route of papers in this book. The first paper, by Braspenning, gives a general chararcterization of neural networks and puts the contributions to the book in perspective. In the contribution by Weijters and Hoppenbrouwers the back-propagation network is discussed. This is probably the most widespread and most popular architecture. The paper by Henseler addresses this architecture again, but in a more formal way and applied to a robot control task. The paper by Peters treats the forerunner of back-propagation networks, namely perceptrons, and analyzes mathematically their advantages/disadvantages. The contribution by Vrieze gives a basic treatment of another architecture, the Kohonen network, and analyzes basic expectations of what it does and can do for a number of application tasks. The paper by Postma and Hudson again introduces an architecture, namely adaptive resonance networks, and discusses a number of variants. The paper by Spieksma treats a neural network architecture which is inspired by physical phenomena treated by statistical mechanics, namely the Boltzmann machine architecture, and applies it to a combinatorial optimization problem. The contribution by Lenting discusses the same architecture, but now from the perspective of how to map (or represent) a particular problem on (with) such an architecture; a topic which, in fact, deserves careful attention with any of the architectures. The paper by Postma introduces another architecture inspired by a physical theory, namely the Hopfield-Tank network, but its main gist is to show how neural networks can help in solving optimization problems. The contribution by Crama, Kolen, and Pesch addresses a wide range of combinatorial optimization approaches including a neural network approach like the Boltzmann machine and a genetic algorithm inspired by a Darwinian framework. The paper by Boekhoudt turns to process identification and control, which is another important generic application domain of neural networks, and some already introduced NN architectures are evaluated regarding their promising use. The paper by Van Luenen again
VII deals with control tasks and discusses the neural network design and learning strategies appropriate for these tasks with a quite illustrative example application: the inverted pendulum. The paper by Cardon and Hoogstraten is written from the perspective of a large industry (Shell) and deals with practical criteria for choosing a neural network solution illustrated by an application for an industrial classification task. The next paper by Braspenning discusses the relationship between NNs and Artificial Intelligence (with its strong emphasis on symbolic processing) and focuses on a high-level map of the many types of neural networks and their dynamics, thereby sketching a landscape wherein all treated architectures may be placed. Finally, the contribution by Hudson and Postma addresses the topic of choosing and using a neural net, providing suitable criteria in the context of different types of problems, outlining a general categorization of neural network architectures, and finally summing up the considerations that may matter in making a choice. We would like to finish with a somewhat cautious remark. Although this book is critical at some points about the appropriateness or usefulness of neural network technology, it includes among its purposes that of saving this technology from the sometimes inordinate claims that its enthusiasts are making for it. In a sense, the neural network hype is over! Dressed in more modest but palpable working clothes, neural network technology may yet become a reasonably valuable collection of tools for addressing practical problems. Finally, we wish to acknowledge those who assisted in making this volume possible. We thank all participants of the NN School and all contributors to this volume. We are grateful to P. Schoo for the computer assistance at the NN School and to J.J.M. Derks and E.J. Pesch for their assistance at various stages of this project. We are especially indebted to Mrs. M. Verheij and Mrs. M. Haenen for preparing this document in BTjjX. P.J. Braspenning F. Thuijsman A.J.M.M. Weijters
Contents P.J. Braspenning Introduction: Neural Networks as Associative Devices 1 A.J.M.M. Weijters and G.A.J. Hoppenbrouwers Backpropagation Networks for Grapheme-Phoneme Conversion: a Non-Technical Introduction 11 J. Henseler Back Propagation 37 H.J.M. Peters Perceptrons 67 O.J. Vrieze Kohonen Network 83 E.O. Postma and P.T.W. Hudson Adaptive Resonance Theory 101 F.C.R.. Spieksma Boltzmann Machines 119 J.H.J. Lenting Representation Issues in Boltzmann Machines 131 E.O, Postma Optimization Networks 145 Y. Crama, A.W.J. Kolen and E.J. Pesch Local Search in Combinatorial Optimization 157 P. Boekhoudt Process Identification and Control 175 W.T.C. van Luenen Learning Controllers Using Neural Networks 205 H.R..A. Cardon and R. Hoogstraten Key Issues for Succesful Industrial Neural-Network Applications: an Application in Geology 235 P.J. Braspenning Neural Cognodynamics 247 P.T.W. Hudson and E.O. Postma Choosing and Using a Neural Net 273 Supporting General Literature 289 Addresses of the Authors 295
Introduction: Neural Networks as Associative Devices P.J. Braspenning Department of Computer Science, University of Limburg, Maastricht 1 Introduction This introductory paper provides a short overview of the many contributions to this book. It may help in better finding your way through the following papers, while also providing the many-faceted reasons for putting them together in the first place. As stated already in the Preface, this book results from a concerted action by the departments of Computer Science and Mathematics of the University of Limburg (Maastricht, The Netherlands) to develop a collection of lectures, specifically dedicated to informing the industrial world about the potential of using neural networks. These lectures were thoroughly updated and elaborated and finalized in the collection of papers which together comprise this book. The target group concerned technical managers and consultants, research associates and software developers at the high end of the spectrum. Their employers should have an active interest in innovative applications of new technologies within their own (industrial) setting. Therefore, the target group consists of people from such diverse environments as government, industry and small and medium-sized companies, insurance and banking etc.. The idea was to focus on two generic application domains, namely control and optimization, and use these application domains to illustrate the concrete use of different kinds of neural networks. Although these application domains were used to cluster the diverse contributions, they were in no way meant to be exhaustive for the application domains where neural networks may profitably be used. In fact, we were (and are) sure that neural networks can also be applied with success in many other domains. Nevertheless, focussing on a rather small set of (generic) application domains has the advantage that the reader gets a more lively picture of how to involve neural networks for a problem at hand. Most contributions to this book try to serve the reader in a two-fold way, namely to help in understanding the functioning of any particular (type of) network and to illustrate its use with an application which acts as a demonstrative example. Further, some contributions are possibly helpful in providing more context, practical industrial experience or concrete advice about neural networks. In the following paragraphs we will detail these contributions somewhat more. 2 Associative Devices and Heuristic Devices First, however, we like to state that the basic interest in neural networks comes from the fact that they may be considered as very flexible, associative devices.
2 Basically, they are devices able to perform pattern-recognition, -construction and -retention, but in a way which is both more flexible, and often less well theoretically understood than more classical pattern-recognition techniques. Sometimes they may be applied where older techniques just fail since these techniques require more strict boundary conditions to be fulfilled for their proper use. However, the results of applying neural networks may be less theoretically justified, although, of course, still of considerable practical relevance. To give an example: being able with the help of a neural network to find a pattern in stock market dynamics may appal the theoretician (when the pattern is not well-founded enough to predict stock prices), and yet delight the invester which might use it (additionally) for his nearby decisions and apparently wins money! The point here is, of course, that practical relevance never coincides with theoretical well-foundedness. Still, our picture would be too rosy if we would not add that practical results also don't coincide with practical relevance, since the latter concept only hints at the possibility of useful, practical results] Hence, neural networks may be seen as a particular class of heuristic devices. . 3 Backpropagation Networks Having said so, we are now ready to describe the many-faceted contributions to this book: the paper by Weijters and Hoppenbrouwers gives an introduction to what is probably the most wide-spread and popular architecture, namely a back-propagation network. This non-technical outline serves well to explain why neural networks may be considered to be a modelling technique in which the emphasis is on training (or learning) and not so much explicit encoding of rules (or programming). Although there is a form of programming in setting up a particular network (architecture) and tuning certain parameters, the dominant role is for the resulting dynamics of the network which in a training phase tries to compact input-output pairs of data in a partial mapping. After this phase then this mapping may be exploited as a rather flexible associative machinery, also able to map inputs to outputs which it has never 'seen' as pairs of data during the training phase. This paper also introduces what is commonly called a 'learning rule' (in this case the Delta rule);this is, in fact, an associative template (or scheme) for how a neurons activity may influence other coupled neurons. Furthermore, it discusses the use of such a network in the conversion of (written) text into a phonetic representation (which, e.g., may be used to produce an oral transcription in the auditive domain). , The contribution by Henseler addresses this architecture again, but in a more formal way and applied to a robot control task. Moreover, it explains why older learning rules which were applied at the beginning of the sixties, though succes- ful in many cases, had also serious drawbacks. They failed in certain problem instances were exceptional non-linearity of the input-output mapping was required. Much later a possible solution was offered by incorporating extra layers of neurons by which the required non-linear (partial) mapping, in principle, could be established. However, such a complicated network required a generalization
3 of older associative schemata (or learning rules), and only after finding the generalized delta rule an upsurge in interest for such networks re-appeared. This paper also addresses some difficulties with this generalized learning rule, and shows some ways to improve the performance of such networks. Moreover, it introduces recurrent networks, i.e. networks in which neurons may (indirectly) feed back their output to neurons from which they previously received inputs. It seems that such type of networks may be required to find patterns in time (besides space). Finally, the back-propagation network is applied to the problem of robot arm movement control, whereas an appendix contains the code for the basic algorithm. The robot arm requires also discussion of the representation of the problem for a particular network; a topic which is often undervalued, but remains of utmost importance for succesfully using any neural network. 4 Some Other Classical Networks The paper by Peters treats the forerunner of backpropagation networks, namely perceptrons, and analyzes mathematically their advantages and disadvantages. Although this type of network is no longer really used the analysis is still of great interest for a number of reasons. First, perceptrons may be considered to be building blocks for understanding the functioning of more intricate networks. As such getting a clear picture of their properties helps in knowing globally what can (and cannot) be expected of neural networks. Secondly, this paper discusses also a dynamical building block (of perceptrons and also of many present-day neural networks), namely the activity of neurons as a linear threshold function. Although this dynamical property of neuron activity is not used everywhere, it acts as some sort of yardstick to assess the basic operation of a neurons processing. Moreover, it is the easiest way to visualize the characteristic non-linear mapping of a sequence of neurons, dependent on each others output to build up their own activity, and only producing their own output after reaching a certain threshold for their activity. Thirdly, this paper addresses the very important topic of which (complicated) predicates (build out of templates or masks, which are basic sub-patterns) can be computed (i.e., assessed to be true or false!) by a certain network type. Again, the perceptron network acts here as some sort of yardstick against which other network architectures can be measured. Finally, the topic of training (or 'learning') is treated again, but now in a way which allows to remove the 'magic' of neural network learning by showing that the heart of the matter is convergence of the dynamics of the network to a stable state. The contribution by Vrieze gives a basic treatment of another architecture, the Kohonen network, and analyzes basic expectations of what it does and can do for a sample of application tasks. Since this type of network has quite some other form and resulting dynamics than the previous perceptron-based family of treated architectures, the reader is slowly, yet thoroughly introduced to this kind of neural network. Formal neurons are introduced with a certain basic mathematical behaviour, and then their interaction within a lattice-like field of such
4 neurons is discussed. Since this interaction, depending on the distance to other neurons may be exhibitory or inhibitory for the other neurons activity, the associative scheme (or learning rule) is much more intricate than in the case of the Delta rule. Still, the basic functioning of the network is as an associative device, but now in a 'self-organizing' (or unsupervised) way. Stripped from 'magic', this term means only that the associations found by the neural network of Kohonen type result from inherent dynamical adaptation of the network to the data set. Depending on the complexity of the network and corresponding inherent dynamics the Kohonen type network generally projects a collection of input-output vector pairs onto a space of lower dimensionality while preserving as much as possible the topology of the original set of vector pairs in this lower dimensional space. The network 'incorporates' this nearby topology in its collective neuronal interaction. After appropriate proofs of basic properties this paper also discusses some applications, and provides the code of the basic algorithm in the appendix. 5 Stability versus Plasticity The paper by Postma and Hudson again introduces an architecture, namely adaptive resonance networks, and discusses a number of variants. The theoretical background (Adaptive Resonance Theory) of these ART-networks are to be found in numerous publications by Grossberg and co-workers. However, this contribution focusses on more practical aspects of this family of neuronal networks. Just like the previous type of networks, the ART-networks operate in an unsupervised way. However, their basic functioning is quite different: they learn input patterns by trying to classify them under the heading of a most similar class pattern. If such a class pattern can not be found then the input pattern may be the seed of a new class pattern. The theoretical framework (ART) is a thoroughly worked out answer to what is called the stability-plasticity dilemma. This dilemma is faced by any neural network, but Grossberg and co- workers were particularly concerned about it, because their basic purpose was to mimick as far as possible the way in which our brain has solved that dilemma. It comes down to being plastic enough to be able to store new input patterns, yet stable enough to be able to maintain already properly stored (or classified) patterns. The family of ART-networks (ART1, ART2 and ART3 etc.) is treated in just enough detail to get a basic understanding of their functioning and the kinds of input patterns (binary, analog) to which they can be applied. Finally, an evaluation of the whole family of networks is provided. The contribution by Spieksma treats a neural network architecture which is inspired by physical phenomena treated by statistical mechanics, namely the Boltzmann machine architecture, and applies it to a combinatorial optimization problem. However, the Boltzmann machine may also be applied to problems from all areas of pattern recognition and pattern learning. The most basic properties of such machines, i.e. when they are implemented in the most proper way, are massive parallelism (any neuron calculates its activity and output irrespective of the calculations of any other neurons) and simulated annealing (collectively and
5 in time, the neurons slowly 'freeze' their acquired, yet plastic patterns into more stable patterns). Though a very promising type of network, the reader should keep in mind that most of the implementations until now are not full implementations, but mostly simulations of the massive parallelism attributed to Boltzmann machines. Still, even those simulations may provide useful results, although the user should be careful (see Lenting's contribution). This paper provides a detailed description of the Boltmann machine and its variants. Moreover, it treats some combinatorial optimization problems (max-cut, Travelling Salesman) and how the Boltmann machine may handle them. A summary and description of some possible future developments close this paper. It is to be mentioned that the Boltmann machine model is not only widely used in theoretical physics, but is also one of the most well-understood and analyzeable models in the field of neural networks. 6 Parallelism in Neural Networks The contribution by Lenting discusses the same architecture, but now from the perspective of how to map (or represent) a particular problem on (with) such an architecture; a topic which, in fact, deserves careful attention with any of the architectures. However, the Boltzmann machine is particularly apt to treat representation issues, since the mathematics as such is very analyzeable so that not the machinery, but the interpretation issues may be brought into prominence. It is shown, that contrary to popular opinion the network can not be made responsible for getting the problem representation (the encoding of the problem in neural weights and/or inputs) right. Representation (and its inverse: interpretation) remain purely human affairs, and the success of applying neural networks will always also depend on the right problem representation choice. The issue is dealt with by zooming in on the Travelling Salesman Problem (TSP) on a Boltzmann machine for combinatorial optimization. The critical remarks are based on experimentation with a (simulated) Boltzmann machine with unlimited, synchronous parallelism. First, the quadratic assignment representation is discussed and weak spots are elicited. Then, a search for improved representations is undertaken during which the size of the configuration space is treated too. The paper evaluates some representational improvements on the performance of the Boltzmann machine. Moreover, it is pointed out that even harder problems (than the TSP), such as job-shop scheduling problems, would also need careful consideration of the issues involved. An appendix details the experiments on which the main statements of this paper are based. The paper by Postma introduces another architecture inspired by a physical theory, namely the Hopfield-Tank network, but its main gist is to show how neural networks can help in solving optimization problems. More traditional algorithmic approaches suffer from the fact that computational time is increasing exponentially with the problem size. Therefore, a solution may be to map the problem onto parallel hardware, and the Hopfield-Tank network, as a fully connected network, is at least a good candidate to implement in parallel hardware.
6 The theoretical framework comes from the area of spin-glasses within Solid State Theory (one of the Theoretical Physics sub-disciplines). As the mathematical model is well-known in those circles, and full connectivity of neurons often simplifies the analysis of the network, traditional statistical-mechanics techniques could be applied to Hopfield-Tank (HT) networks. In fact, being able to handle such networks within a well-understood theoretical framework caused a widespread and ever Increasing interest to re-appear in those 'good old' neural nets. This contribution treats the structure and dynamics of HT- networks in enough detail to understand their basic functioning. Further, it shows how the task assignment problem (as an example of an optimization problem) can be mapped onto this type of network. Furthermore, the performance of HT-networks and the special role of continuous activation functions together with the use of a sigmoid non-linearity to produce neural outputs is treated. A final discussion explains the shortcomings and current research alternatives. 7 Neural Networks versus Other Approaches The contribution by Crama, Kolen and Pesch addresses the full range of combinatorial optimization approaches including a neural network approach like the Boltzmann machine and a genetic algorithm inspired by a Darwinian framework. However, the real topic concerns the local search in combinatorial optimization since the local search is the basic principle underlying many classical optimization methods. Connected with local search is a neighbourhood around every feasible solution of a (combinatorial optimization) problem. This again is an operationalization of the basic idea that slight perturbation of known feasible solutions may render the final solution by looking simultaneously for minima in the objective function which reflects the constraints (the function is defined over the space of feasible solutions). Hence, local search is crucial and, moreover, also well-suited to explain the dynamics of many types of neuTal networks (see also the contribution by Hudson and Postma). This paper is illuminating on many topics such as: how to pick an initial solution, how to define neighbourhoods and how to select (search) a neighbour of a given solution. Obviously, both the starting solutions and the choice for the size of the neighbourhoods are important for any local search procedure. There is a trade-off between quality of the solution and complexity of the algorithm. In addition, there is always a problem with local search, namely the existence of local optima (which are not global). This paper describes very well how recent extensions of local search with more possibilities to escape local optima (e.g., Boltzmann machine, Tabu search) come down to allowing for occasional degradations of the objective function. Moreover, it shows clearly why, in this context, genetic algorithms are also an interesting technique, because here the computation starts with a population of feasible solutions instead of a single one (as in more traditional approaches). The contribution by Boekhoudt turns to process identification and control, which is another important generic application domain of neural networks, and some already introduced NN-architectures are evaluated regarding their promis-
7 ing use. The issues of control can not be tackled before the process to be con- troled is understood. Therefore, the topic of process identification should be addressed first. Process identification consists of a number of steps to come to a mathematical model formulation of the process to be studied. This paper surveys "traditional" methods of identification and control (in the sense that they make no use of neural networks). After that, it discusses where neural networks may profitably be used. The basic ideas of process identification and control are treated with the help of a particular model of the diabetes mellitus process. It serves well to understand the basic issues, such as the parameter-identification problem, process control by pole placement, state estimation etc.. These basics are then extended with a discussion of stochastic systems, which may be applied to account for differences between model results and reality. Differences which may results from such diverse sources as 1) changing process characteristics, 2) unmodelled non-linearities, 3) changing process parameters, and 4) sensor (mea- surement)errors and other disturbances. Moreover, control of processes, based on a linear model of the dynamics, sometimes requires other types of process control. Also, the fact that a model is not always completely given (i.e., lack of knowledge) requires other types of control. It is in these cases that neural networks may profitably be used, but more mathematical rigor in applying them is certainly needed. 8 Applying Neural Networks The paper by Van Luenen again deals with control tasks and discusses the neural network design and learning strategies appropriate for these tasks with a quite illustrative example application: the inverted pendulum. Again, the emphasis is on preliminary results, although quite interesting results are claimed in the literature. However, the author quite rightly warns that all present-day work is experimental in nature and mostly conducted in research laboratories. Still, one may expect that in future more and more control tasks will be solved with the help of (particular types of) neural networks, and major problems (e.g., long learning times, computational capabilities, proofs of convergence and proofs of stability) will (at least partially) be solved. After an extended introduction which explains how neural networks may be applied to control problems, what kind of learning strategies may be used, how process identification (sometimes also in the form of a neural network model) is done, and how a priori knowledge may be embodied, the learning strategy of reinforcement learning is treated in more detail. In particular, the adaptive heuristic critic (AHC) algorithm can be used to learn the control of a process, and this algorithm is treated in quite some detail. An example application of this algorithm is with the inverted pendulum; a very nice application since this process is notoriously instable and allows to show the real power of this learning strategy. Experimental results are provided and discussed, also with respect to the real time behaviour of the pendulum. Interestingly, an integration of Artificial Intelligence and neural networks is foreseen to be needed for practical use of neural controllers.
8 The contribution by Cardon and Hoogstraten is written from the perspective of a large industry (Shell) and deals with practical criteria for choosing a neural network solution illustrated by an application for an industrial classification task. The experience reported here goes back to developments within the Shell Research laboratory in Rijswijk (The Netherlands), where already from the mid eighties neural network explorations were performed. Shell Research was a forerunner in applying neural networks to concrete practical problems, and people in the laboratory became used to considering the use of Expert Systems, Standard Statistics, Genetic Algorithms on Rule Induction besides the possible application of neural networks to their problems. Therefore, this paper reports in a condensed way about their experiences, mostly in the form of answers to key questions, such as: When do you consider to use a neural network?, What are the critical isssues when introducing a neural network in an operational environment?, What are the most important stages in the development cycle? Subsequently, a practical application developed within the laboratory will be discussed. In this context, also a comparison with a well-known statistical technique (Linear Discriminant Analysis) is given. Moreover, some improvements to the algorithm (Back-Propagation) used in the application, are treated. The neural network developed for the identification of genetic geological fades types proved quite succesful in practical terms. For example, when the neural network answer differs from the answers provided by the geological experts, the answer is mostly debatable in the first place. 9 Context, Choice and Use of Neural Networks The next paper by Braspenning discusses the relationship between NNs and Artificial Intelligence (with its emphasis on symbolic processing) and focuses on a high-level map of the many types of neural networks and their dynamics, thus sketching a landscape wherein (nearly) all architectures may be placed. After an introduction about the relationship between (and corresponding critics of) Artificial Intelligence and Artificial Neural Networks, showing the many sources of renewed interest in ANNs, and a very short summary of ANN-basics, a general dynamical systems framework for ANNs is expounded. This framework may help in viewing all neural network architectures (i.e. those discussed in the book and many others) from a general vantage point. It functions as some sort of cognitive map on which the many architectures may be placed, and some 'white spots' located. Basically, the reader is introduced to two complementary spaces, namely activation dynamics and weight dynamics. The latter is only used during the training (or learning) phase of a neural network, but is otherwise absent when the weights are kept fixed. However, even then the activation dynamics may be of a convergent, oscillatory or even chaotic nature. The latter two are not further discussed, although one may expect that in the near future also these forms of dynamics will be used for information-processing. Convergent dynamics (i.e., converging to a stable state of the network) has however a quite natural interpretation from the perspective of information-processing. Accordingly, the
9 class of convergent dynamics is described under which most neural networks fall, and criteria for convergence are given that may be applied to actual nets. The topic of Liapunov functions is treated together with an equality that allows to find criteria for global asymptotic stability. Finally layered networks and cascades are discussed, because they form a natural way to build more complex networks. At the end of the book, the contribution by Hudson and Postma addresses the topic of choosing and using a neural net, providing suitable criteria in the context of different types of problems, outlining a general categorization of neural network architectures, and finally summing up the considerations that may matter in making a choice. The tone set by this contribution is that neural networks provide very powerful ways of solving certain sorts of problems, yet they do not, nevertheless, provide a panacea. Therefore, it makes sense to detail some very general features of neural networks, and to categorize them on the basis of these (operational) features. First, however, some types of problems for which a neural network may be useful should be distinguished and described. Then a general classification of architectures is provided (moreover, a particularly helpful table of common neural networks and several references to public domain neural network simulators are provided). This classification aids enormously in treating subsequently considerations for choosing a network architecture and considerations for using a network. The first class of considerations addresses features like learning or non-learning, generalization, input type, output, stability, scalability and execution speed. The second class of considerations treats, mostly from a user perspective, issues like learning speed, learning algorithm, learning parameters, number of layers, connectivity, distributed or localized representations and locality of algorithm. The conclusions of this paper emphasize again that careful analysis of the problem at hand in terms of the features discussed are necessary to make a reasoned choice for either a particular network architecture or not using a neural network at all. 10 Concluding Remarks In conclusion, the papers in this book provide together a broad and often deep- going survey of Artificial Neural Network land; a land which is as exciting as unexplored. Many more theoretical contributions are needed (and, in fact, also more and more theory is developed). However, for the time being we also need brave adventurers who are willing to experiment with particular network architectures and dynamics within concrete practical problems. It is our hope that this book stimulates those "adventurers" outside academia to explore the use of Artificial Neural Networks to solve their concrete problems for the benefit of their companies. Moreover, any feed-back of readers is welcomed for the benefit of the ANN-science.
Backpropagation Networks for Grapheme-Phoneme Conversion: a Non-Technical Introduction A.J.M.M. Weijters1 and G.A.J. Hoppenbrouwers2 1 Department of Computer Science, University of Limburg, Maastricht 2 Dutch State School of Translation and Interpreting, Maastricht 1 Introduction Until very recently, cognitive processes typically have been modelled by means of rule based-models. It appears, however, to be possible to model these processes by means of neural networks3. This modelling technique, inspired by the workings of the human brain, is distinguished from approaches based on symbol manipulation by the fact that the rules are not incorporated in the model explicitly: a neural network is not programmed for a particular task but is trained for it. Presenting it with examples enables it to acquire the skill which is to be modelled. Our contribution to the present volume is meant as a non-technical introduction to this modelling technique. It consists of three parts. In section 2 we discuss (in general terms) various modelling techniques in which neural networks play a major role. Section 3 discusses NETspraak, a neural network that can be trained to convert Dutch texts into a phonetic representation, thus providing a practical example of the approach. After a brief discussion of the model used for NETspraak, we deal with the learning material and the test material presented to NETspraak. Closer examination of the results at various stages of the learning process, gives an indication of the results that can be achieved. Both in the choice of the name and in the technical realisation of NETspraak (Dutch for NETtalk) we have been inspired by the article "NETtalk: A Parallel Network That Learns to Read Aloud" (Sejnowski and Rosenberg, 1987). NETtalk can be trained to convert English texts into a phonetic representation. We conclude section 3 with a comparison of the results of NETtalk and NETspraak. In section 4 we attend to the question of whether the modelling technique using neural networks is really as promising as the results achieved so far might suggest. 3 The terminology in this field is still unsettled. In the literature on the subject the following terms can be found, all referring to the same thing: Neural Networks (NN), Artificial Neural Networks, Parallel Distributed Processing Networks (often referred to as PDP-networks), Connectionist Networks, Neural Circuits, Dynamical Computation Systems, etc. In this paper we will stick to the term "neural network".
12 2 Neural Networks: An Introduction Modelling of cognitive processes by means of neural networks differs greatly from classical approaches. In this section, we will introduce a number of important notions such as that of a processing unit, threshold values, weights, a learning rule, and local and distributed representation. It aims to offer the reader a nontechnical introduction. A more technical introduction is to be found in (Rumel- hart and McClelland (eds.), 1986) and the paper by Henseler in the present volume. Readers interested in the application of neural networks in modelling linguistic cognitive processes are referred to chapters 18 (On Learning the Past Tenses of English Verbs) and 19 (Mechanisms of Sentence Processing) of the former publication. In subsection 2.1 we will present some back-ground information on the cognitive process to be modelled: the conversion of (written) texts into a phonetic representation. The rest of the section will be devoted to a discussion of the classical method of modelling this process. In subsection 2.2 we will briefly discuss the neuro-physiological structure of the brain, since this formed the primary inspiration for the architecture of neural networks. In subsection 2.3 we will illustrate the workings of a very simple network, showing how this can be trained to perform various tasks. This training process uses a particular kind of learning rule: the Delta rule, which will also be discussed. We will end section 2 by presenting, in 2.4, the main characteristics of the so- called back-propagation networks, which have been used extensively in practical applications. This type of network was also used for NETspraak. 2.1 The Main Features of Classical Methods of Modelling At the basis of classical methods of modelling, there is always a system of explicit rules according to which symbolic expressions are manipulated. This can be illustrated by taking a closer look at traditional approaches to the problem of grapheme-to-phoneme conversion. Let us begin by presenting some back-ground information on this problem. It is a well known fact that the spelling systems for natural languages such as English and Dutch are far from providing one-to-one correspondences with the sounds they are supposed to represent. One letter may be used to represent several different sounds. Thus the e's in the word eleven [ile.v'n] all represent different sounds4. The opposite situation also occurs: the letters c and k in scorn and skip represent an identical sound. Often a combination of letters is used to represent a single sound, as in knight where the sound corresponding to kn is identical to the n in night. Sometimes letters are used that are not pronounced at all as in bomb, where the second b is not realized. If we do not restrict our attention to isolated words, but take them in their natural context we should take into account all kinds of sandhi phenomena, as in bread and butter, where and is pronounced ['n], 4 We follow the convention used in linquistics of placing texts in phonetic script between square brackets.
13 An instance of sandhi that is very common in Dutch is the assimilation of voice as in is de [iz de] (English: is the) versus is ie [is te] (English: is too). Similar examples in Dutch can be given for all the above examples, although the frequency with which these phenomena occur in each language may differ. English and Dutch do differ in the following respect (Bloomfield, 1933:114). In Dutch, a single consonant before the vowel of a stressed syllable, always shares in the loudness, regardless of word-division or other factors of meaning: een aam (measure of forty gallons) and een naam (a name) are both [e'na:m]. In English we have an aim [en 'ejm] versus a name [e 'nejm]. In order to indicate how a text is pronounced we make use of the phonetic alphabet presented in (Figure 10). Every phoneme is represented in our notation by means of two characters. How these codes relate to the International Phonetic Alphabet (IPA, 1949) can be seen from the same matrix in Figure 10: in the left- hand column of the matrix one finds the two-place code we have been using, while in the right-hand column its IPA equivalent is given. Using this phonetic alphabet, we can indicate that the name of the Dutch town of Enschede is pronounced as [E.n.s.x.&.d.e:]. In classical approaches, symbolic rule systems are used, with rules such as that in (1): (a) n(p,b,m) -► [m.] (b)n(k,g) -[ng] (1) (c) n -+ [n.] This rule for converting the grapheme n is to be interpreted as follows. The grapheme n is pronounced as [m.] if followed by one of the graphemes p, b or m, and as [ng] if followed by one of the graphemes k or g. In all other cases it is pronounced [n.]. Here, our interest is not primarily whether (1) provides an adequate and correct account of the facts, but in illustrating the fact that modelling the skill in question consists of defining an adequate set of rules of the type given in (1). Such a rule system is considered to be an adequate one, if it enables us to convert any Dutch text into its correct phonetic representation mechanically5, that is to say without having to depend on (implicit) knowledge on our part. If the rules are formulated in terms of a computer programme that is able to convert texts into their phonetic representation, we have a handy means to check whether, in converting a text, we are making use exclusively of the rule proposed. Such an approach would moreover, provide us with a useful product, which could be used, for example, for automatically producing spoken newspapers for the blind. Providing such a system of adequate and explicit rules, however, is by no means a trivial matter. It would be very convenient indeed if a system were available that could learn the skill involved purely on the basis of an example consisting of a text and its correct phonetic transcription. Neural networks do in fact seem to provide us with the means to achieve this: they are capable of Rule systems in which reference is made to phonetic features to express various kinds of linguistic generalization, such as the rule in Dutch phonology known as Final Devoicing, are instances of symbol manipulating rules: [-son] —► [- voice] / #
14 grasping the underlying rule system on the basis of examples given to them. In section 3 we present a neural network which can be trained to convert written text into a phonetic representation on the basis of a few pages of sample texts. 2.2 Some Neuro-Physiological Facts about the Brain As we have already said, the architecture of neural networks resembles in some respects that of the human brain, which was the original inspiration for them, although it should be stressed from the outset that neural networks are not meant to constitute a model for the workings of the human brain. Fig. 1. The structure of a neuron The human brain forms a massive communications network, consisting of billions of nerve cells, also known as neurons. Many different types of neurons are known. The structure of the individual neuron is rather simple. As can be seen from Figure 1, we can distinguish three main elements: the cell body, a (large) number of dendrites, and an axon. The offshoots of an axon are connected by so-called synapses to the dendrites of many other neurons. In functional terms a neuron can be seen as a processing unit receiving incoming impulses via the dendrites. These impulses or electrical currents may vary in frequency, but not in intensity. If the number of incoming impulses within a certain period of time exceeds a certain threshold value, the neuron will fire off an impulse via its axon. Both activating and inhibiting impulses can be fired. Since incoming signals are added together, activating and inhibiting impulses can cancel out one another partially or completely. When a neuron fires, it transmits its impulse via the axon and synapses to other neurons. The frequency and the nature of the impulse transmitted (activating or inhibiting) is largely determined by the synapses. The various individual building blocks of the brain are relatively simple units that decide, on the basis of incoming signals, whether or not to transmit a signal. Whereas these building blocks are relatively simple, the system as a whole is
15 incredibly complex, due to the enormous number of neurons, the number of interconnections between them (the number of dendrites for a single neuron may amount to 200,000) and the fact that all the neurons function autonomously and in parallel6. This great number of connections is essential, since the learning process in the brain depends on the growth of new connections or the breaking up of existing ones. In this process the synapses play an important role. 2.3 The Basic Architecture of a Neural Network Many mathematical models for the (human) brain have been developed. Although they may differ considerably from one another in detail, they have the following minimum characteristics in common. The basic unit of a neural network is the processing unit. Below we give a brief description of the basic elements of the processing unit (cf. Figure 2). Between brackets we will, where possible, mention the analogous structure in the human brain. i1 o1 Summation Threshold value Input Weights Output Fig. 2. The components of a processing unit A number of input values coming from another processing unit or from outside (dendrites); So-called weights indicating the degree of influence of the input value on the processing unit in question (the frequency and the nature of the signal transmitted via the synapses); In this paper we will pay very little attention to the parallel aspect of neural networks. In many cases, including NETspraak, parallellism is simulated in non-parallel systems. In principle, parallel systems can be developed that can perform the same functions as NETspraak, but many times faster.
16 - A summation function, usually the weighted sum of the input values, here: w\ * i\ +W-2* «2 + W3 * *3 (the summation of incoming values in the neuron); - A threshold value: if the resulting sum reaches this threshold value, the signal will be transmitted, otherwise it will not (the threshold value that the summation of the incoming signals must exceed if the neuron is to fire); - An output signal (the signal exiting via the axon); Let us illustrate this to the workings of a very simple neural network for modelling the logical connective AND. From classical predicate calculus we know that the conjunction of two propositions PI and P2 by means of the logical connective AND yields a true proposition if and only if both PI and P2 are true. If we represent TRUE by 1 and FALSE by 0, then only the input (1 1) to the neural network should result in a value of 1 for the output signal, the input pairs (1 0), (0 1), and (0 0) would have to result in an output signal with the value 0. The simple network in Figure 3 appears to be adequate for modelling the logical connective AND. This network consists of three processing units Ul, U2 and U3. Ul and U2 are so-called input units, U3 is an output unit. The interconnections are indicated by lines connecting the units. The weight of the connection between Ul and U3 is 0.7, in other words W\3 = 0.7; furthermore we have W23 = 0.7. Threshold value = 1 w23 = 0.7 Input Output Fig. 3. A neural network for the logical connective AND In all the networks in Figures 3-6, the threshold value in the processing units is equal to 1, as is the strength of the signals transmitted. Transmitting no signal at all can be regarded as sending a signal of strength 0. If, and only if, the sum of the weighted input signals is greater than or equal to 1, a signal (of strength
17 1) is transmitted. Let us examine what happens in case of an input pair ii = 1 and 2*2 = 1 (that is to say, both PI and P2 are TRUE). In this case unit 1 receives a signal of strength 1; since the threshold value is 1, a signal of strength 1 is transmitted to unit 3. In a similar way, unit 2 will transmit a signal of strength 1 to unit 3. In order to determine whether or not unit 3 will transmit a signal, we must calculate the weighted sum of the input signals to unit 3: w\ * i\ + W2 * i-z = 0.7 * 1 + 0.7 * 1 = 1.4 We see that the weighted sum exceeds the threshold value 1 and so unit 3 will transmit a signal of strength 1. In case i\ = 1 and i2 = 0 the result will be 0 because the weighted sum of the input signals for unit 3 is 0.7*1 + 0.7*0 = 0.7 The weighted sum is smaller than the threshold value and therefore unit 3 will not fire, resulting in an output value for unit 3 of 0. The reader may easily verify that the network yields correct results for the other possible input combinations (0 1) and (0 0). By merely adjusting the weights in the network in Figure 3, we can easily adapt the network for another logical connective such as the inclusive OR. Threshold value = 1 Input Output Fig. 4. A neural network for inclusive OR The proposition PI OR P2 is FALSE if, and only if, both PI and P2 are FALSE. The network in Figure 4 which models the inclusive OR, differs only
18 from the one in Figure 3 in that the weights W13 and W23 have been adjusted (they have both been set to the value 1.4). Determining the adequate value for the weights in order to make the network suitable for a specific task, is not difficult in the case of such simple networks, but for more complex cases it is far from easy. We have already alluded to the possibility of training a network for a particular task. This would free us from the task of determining the correct weights "manually". In order to train a network, so-called learning rules are used. By using such a rule, it becomes possible to transform the AND network of Figure 3 into the OR-network of Figure 4 automatically. Training a network really amounts to adjusting the weights in response to incorrect results. At the basis of all this is the following principle: the degree to which a connection has contributed to a particular error determines to what degree the weight associated with the connection in question will be adjusted. The so-called Delta rule is one of the learning rules based on this principle. It can be formulated as follows: AWij = c * (gj — dj) * a; where AWij '■ the change in the weight associated with the connection between the processing units i and j; c : a so-called learning constant (for this a value of 0.35 has proved adequate in practice); gj : the activity desired for unit j [goal]; dj : the current activity of output element [=unit] j; a; : the current activity of input element [=unit] i Let us see how the Delta rule given above can be used to change the AND network of Figure 3 into an OR network; we will use the following learning material: Input: Desired output (goal): 1 1 1 1 0 1 0 1 1 0 0 0 This means that if the input to the network consists of the pair (1 1), the desired output value is 1 etc. Starting from the AND network of Figure 3, we want to obtain an OR network using the above mentioned Delta rule by presenting the network with both the input values and the desired output values. For the network of Figure 3 the following holds true: W13 = 0.7 and W23 — 0.7. When presented with the first input pair of the learning material, (1 1), the network's output will equal 1: the value desired. Application of the Delta rule will result in no change to the network because gj —aj = 0. This need not surprise us,
19 since in the case of both propositions being true, there is no difference between the logical connectives AND and OR. As far as the input pair (1 1) is concerned, the network is correct and no weights need to be adjusted. In the case of the input pair (1 0) matters are different: the output value desired is 1, whereas the network will yield an output value of 0 (it is after all still an AND network!). Application of the Delta rule yields: Aw13 = 0.35 * (1 - 0) * 1 = 0.35 Aw23 = 0.35*(l-0)*0 = 0 This will give us the following values: w13 = 0.7+0.35 = 1.05 and w23 = 0.7+0 = 0.7. Application of the Delta rule to the input pair (0 1) yields: Aw13 = 0.35*(l-0)*0 = 0 Aw23 = 0.35 * (1 - 0) * 1 = 0.35 This will give us the weights w±3 = 1.05 + 0 = 1.05 and w23 = 0.7 + 0.35 = 1.05. The result of presenting the input pair (0 0) is 0. In this case the weights are not adjusted since a8- equals 0 in the expression: c*(gj - aj)*a,i After having presented the learning material once to the network, the weights have the values in Figure 5. It will be clear that the original weights for the AND U1 Threshold value = 1 i1 ^-n Input Output Fig. 5. An OR Network trained with the help of the Delta rule, network in Figure 3 have been changed in the right direction (i.e. they resemble
20 more the weights in Figure 4), but one may ask whether we are already dealing with a real OR network? The reader may easily verify that this is indeed the case. Presenting the learning material to the network once again, will no longer affect the weights of the network. From a comparison of the networks in Figure 4 and Figure 5, we conclude that the same functionality may be achieved by assigning different weights. This holds even more true for more complicated networks. In the above example, training the network was a very simple matter. We will see later that more learning material is often required and the same learning material may have to be presented many times in succesion. Using the Delta rule we have been training a network in correctly performing a particular task. It is not possible, however, to adapt the network in Figure 3 so as to make it fit for modelling exclusive OR (henceforth XOR). The proposition PI XOR P2 is true if, and only if, exactly one of its constituent parts is true. It can be shown that networks consisting solely of input units and output units are inadequate for the modelling of non-linearly classifiable problems7. If we do not restrict ourselves to the use of input and output units, and introduce one or more so-called "hidden layers", defining an XOR network no longer presents a problem. The processing units in the hidden layer perform the role of recognizing the relevant abstract characteristics. In the XOR network in Figure 6, U3 functions as the recognizer of a situation in which only i\ = 1, and in which therefore i2 = 0. The reader may easily verify that the network in Figure 6 correctly models the XOR connective. In defining the XOR network of Figure 6 the problem has been solved, however, only in part: we will also have to define a new learning rule. In 1986 (Rumelhart, Hinton and Williams, 1986) and (Parker 1986) independently formulated an extension of the Delta rule, the socalled error propagation learning rule8, which plays an essential role in back-propagation networks. 2.4 Back-Propagation Networks The error propagation learning rule can be applied in networks having the following characteristics: the network has one or more hidden layers; all units within a layer are connected to all units of the next layer; there are no connections between non-successive layers. These restrictions result in a network architecture as shown in Figure 7. An important feature of back-propagation networks is furthermore that signals of arbitrary strength between 0 and 1 can be fired; one is no longer restricted to integer values 0 and 1. In modelling the logical connectives, the continuity of the input and output signals is not generally used: signals with a value less than 0.5 are interpreted as negative values, signals with a value greater than 0.5 as positive. Thus, if the pair (1 1) is presented as input to an adequate XOR 7 See (Minsky and Papert, 1969) 8 For details on the error propagation learning rule see the paper by Henseler in the present volume.
21 Threshold value = 1 Input Hidden layer Output Fig. 6. An XOR network with one hidden layer Input Hidden layer Output Fig. 7. Back-propagation network with one hidden layer
22 back-propagation network the output value must be less than 0.5; a value of 0.61 would indicate that the network does not perform adequately. In back-propagation networks, also, the value of the output signal is calculated by taking the weighted sum of the values of the input signals for all units in each of the layers. During the training stage, the result thus achieved can be compared to the result desired, after which the weights of the connections can be adjusted if necessary in accordance with the error propagation learning rule. Although there is no upper limit to the number of hidden layers, in practice a single hidden layer will usually suffice. This is also true of the skill we wish to model by means of a neural network: grapheme-phoneme conversion. As we will see later, a back-propagation network with a single layer appeared to be adequate. The architecture eventually used for the network (the number of input and output units, and the number of units in the hidden layer) depends to a large extent on the choice of "translation" of the skill to be modelled, in terms of input and output signals. In the following section we will elaborate on the problem to be modelled, and we will discuss the architecture of the grapheme-phoneme network designed for this task. 3 NETspraak Our attempt to use the computer for automatic grapheme-phoneme conversion is by no means new. Wester and Kerkhoff from the research-group "Language and Speech" at Nijmegen University have developed a conversion system for Dutch. An evaluation of this system, in which its performance was tested by presenting it with words in isolation, is given in (Willemse, 1987). An evaluation with the explicit aim of assessing the usefulness of this system in producing a spoken journal for the blind, is to be found in (Bezooijen, 1989). From this source it is possible to gain a clear impression of the performance of this system when applied to running text. For the conversion of English texts into a phonetic representation, the rule- based expert system DECtalk, developed by Digital Equipment Corporation, is available commercially. As we saw before, NETtalk and NETspraak differ crucially from these traditional approaches in that no explicit rule system is used. The knowledge possessed by someone who can read a Dutch or English text aloud is not stated explicitly and cannot be traced to any unambiguously identifiable part of the network. In the learning stage, the network is presented with a text a number of times and during each cycle of this process it makes guesses as to the best way to represent any given grapheme of the text by a phonetic symbol: The result predicted by the network is then compared to the result desired, and the weights are adjusted if necessary. The type of back-propagation network discussed in subsection 2.4 can be made fit for grapheme-phoneme conversion by formulating the problem in terms
23 of input to and output from the network. How this can be done is discussed in subsection 3.1. As we have already mentioned, training a network takes place by means of a learning text. It is important that, after the training stage, the performance of the network is measured by testing it using text material other than that used in the learning stage, for we are not primarily interested in finding out whether the network is able to learn the peculiarities of the learning text. What we really want to know is whether the network is able to make significant generalizations. The choice of learning text and test text, the problems we encountered in transcribing the material, and the solutions we have chosen are the subject of subsection 3.2. It usually makes sense to present a particular learning text to a network many times in succession. NETspraak went through the learning text 55 times. In subsection 3.3 we discuss the learning path that NETspraak follows and we compare the results with those achieved by NETtalk and by two traditional systems: the INF-KUN-system and GRAFON (Daelemans, 1985, 1988). 3.1 Input and Output for NETspraak From the texts presented to NETspraak, 28 graphemes are taken into consideration: the 26 letters of the alphabet, the space and the period; all other characters are ignored. Each grapheme is assigned a unique binary representation consisting of 28 binary digits. From Figure 8 it can be seen how the 26 letters of the alphabet, the space and the period are represented as strings of l's and O's. This method of representing the graphemes enables one to enter the data from the learning text into the network. V = (100000000000000000000000000 0) '«.' = (0 10000000000000000000000000 0) V= (0 00000000000000000000000100 0) V = (0 00000000000000000000000010 0) '' = (0 000000000000000000000000010) '.' = (0 00000000000000000000000000 1) Fig. 8. The binary representation of the graphemes The input part of NETspraak can best be seen as a seven character window that slides over the text to be transcribed. Henceforth we will refer to this window as a heptagram. The intention is that the fourth grapheme in the heptagram will eventually be transcribed by the network as a phonetic sign, whereas the first
24 and last three graphemes offer the necessary contextual information. Since the representation of each grapheme requires 28 cells, a total number of 7 * 28 = 196 input units is used. This method of representing the input for NETspraak offers a good example of a so-called local representation: different characters are represented by activities in separate input units. The hidden layer of NETspraak consists of 20 hidden units9. Each of the 196 units in the input layer has been connected to each of the 20 units in the hidden layer. The output component of NETspraak should be designed in such a way as to be able to represent the phonetic value of the grapheme in the fourth position of the input window. We have chosen 22 units corresponding to 21 phonetic features and one dummy feature introduced for the sake of representing the space and the null-phoneme that will be discussed later. The way the output of NETspraak has been designed is a good example of a distributed representation: each phonetic element has been represented by means of a pattern of active output units. Figure 9 provides a schematic representation of the design underlying NETspraak. The knowledge that NETspraak acquires by passing through the learn- grapheme 1 ... grapheme 4 ... grapheme 7 il ... i28 185 ... 1112 H69 ... U96 hi h2 hi9 h20 Hidden layer ol o2 o21 o22 Output Fig. 9. Schematic design of the NETspraak network ing cycle is stored in the real numbers representing the weights of each of the (196 * 20) + (20 * 22) = 4360 connections between the various units. Before the first training cycle, these 4360 weights of NETspraak are initialized at random values between —0.5 and +0.5. 9 The choice of the number of hidden units is somewhat arbitrary. The 20 chosen for NETspraak appear to be adequate for Dutch, In NETtalk (Sejnowski and Rosenberg, 1987) 120 hidden units have been used in a number of cases.
25 1 Hill aftfl (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (IS) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36) (37) (38) (39) (40) (41) (42) (43) (44) (45) (46) (4 7) (48) (49) (50) (53) (52) (53) (54) (55) (56) 0. £. t. w. A. s . n . O. g- d. ng k. r. u. I. V . m. e: X. z. E. b. i . I:' P. ■h. Qy l. a: f. Ei o: zj c: Ou "• • G. y: w . y. j. u: / : i : tj N. Sj a. M. 0: o. ou c. In \ 0,0,0,0,0, 1,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,0,1,0,1, 0,0, 0,0,0, 0,0,0,0,0, 1,0,1,1,1, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,0,1,1,0, 1,1,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,1,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,1,0,0,1, 0,0,0,0,0, 1,1, 0,0,0, 1,1,0,0,0, 0, 0, 0, 0,0, 0,0,0,0,0, 1,1,0,1,1, 0,0,0,0,0, 1,0,0,0,1, 0,0,0,0,0, 1,1,0,0,1, 1,0,1,1,0, 0,0,0,0,0, 1,0,1,1,0, 1,0,1,1,"1, 1,1,0,1,0, 1,1,0,1,0, 0,0,0,0,0, 1,1,0,1,0, 0,0,0,1,0, 1,1,0,1,0, Q, 0,0, 0,0, 1,0,1,1,0, 1,1,0,1,0, 1,1,0,0,0, 0,0,0,0,0, 0, 0,0,0,0, 0,0,0,0,0, 1,0,0,0,1, 0,0,0,0,0, 1, 0,1,1,1, 1, 0,1, 1,0, 1,0,1,1,0, 1,0,1,1,0, 1,1,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,0,0,0,0, 0,0,0,0,0, 0,0,0,0,1, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,1, 0,0,0,0,0, 0,0,0,0,0, 1,0,0,0,0, 0,0,0, 0, 0, 0,0,0,0,0, 0,0,0,0,1, 0, 1,1,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,0,0,0,0, 0,1,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,1,1,1,0, 0,0,0,0,0, 1,1,1,0,0, 0,0,0,0,0, 0,1,1,1,0, 0,1,1,0,0, 0,0,0,0,0, 0,1,0,0,0, 0,1,1,1,0, 0,1,0,0,0, 0,0,0,0,0, 0,0,0, 0,0, 1,1,1,0,0, 0, 0,0,0,0, 1,0,0,0,0, 0,0,0,0,0, 1,1,1,0,0, 0,1,1,0,0, 1,1,1,0,0, 0,0,0,0,0, 0,0,0,0,1, ' 0,0,0,0,0, 1,0,0,0,0, 0,0,0,0,1, 0,1,0,0,0, 0,0, 0,0,0, 0,1,1,1,0, 0,0,0,0,0, 0,0,0,0,1, 0,0,0,0,0, c *- »- sa l- c -s m S ^ =« < 0,0,0,0,0, 0,0,0,0,0, 1,1,1,0,0, 0,1,0,0,0, o;o,0,0,0, 1,1,1,0,0, 1,1,1,0,0, 0,0,0,0,0, 1,0,0,1,0, 1,1,1,0,0, 1,0,0,1,0, 1,0,0,1,0, 1,1,1,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,1,0,0,0, 1,1,0,0,0, 0,0,0,0,0, 1,0,0,1,0, 1,1,1,0,0, 0,0,0,0,0, 1,1,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,1,0,0,0, 1,0,0,0,1, 0,0,0,0,0, 1,1,1,0,0, 0,0,0,0,0, 1,1,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,0,1,0,0, 0,0,0,0,0, 0,0,0, 0,0, 0,0,0,0,0, 0,0,0,0,0, 1,0,0,1,0, 0,0,0,0,0, 0,1,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,0,1,0,0, 1,1,1,0,0, 1,0,1,0,0, 0,0,0,0,0, 1,1,0,0,0, 0,0,0,0,0, 0,0,0, 0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 1,1,0,1,0, 0,0,0,0,0, 1,1,1,1,0, 1,1,0,1,0, 0,0,0,1,0, 1,1,0,0,0, 1,1,0,1,0, 0,1,1,1,0, 0,1,0,0,0, 1,1,1,0,0, 0,0,1,0,0, 1,1,0,1,0, 1,1,0,1,0, 1,1,0,1,0, 0,1,0,1,0, 1,1,0,0,0, 1,1,0,1,0, 0,0,1,1,0, 0,1,0,1,0, 1,1,0,1,0, 0,1, 0,0,0, 1,1,0,1,0, 1,1,0,1,0, 0,0,0,0,0, 0,1,0,1,0, 1,1,0,1,0, 1,1,0,1,1, 1,1,0,1,0, 0,0,0,1,0, 1,1,0,1,0, 1,1,0,1,0, 0,1,1,1,0, 1,1,0,1,0, 1,1,0,1,0, 1,1,0,1,0, 1,1,0,1,0, 0,1,1,0,0, 1,1,0,1,0, 1,1,1,1,0, 1,1,0,1,0, 1,1,1,1,0, 1,1,0,1,0, 1,1,0,1,0, 1,1/0,1,0, 0,0,1,0,0, 1,1,0,0,0, 0,0,1,1,0, 1,1,0,1,0, 1,1,0,0,0, 1,1,0,1,0, 1,1, 0,1,0, 1,1,0,1,0, 1,1,0,1,0, 1,1,0,1,0, 0,0,0,0,0, 0,0 1,1 0,1 0,1 1,1 0,1 0,1 1,1 0,1 0,1 0,1 0,1 0,1 1,1 1,1 0,1 0,1 1,1 0,1 0,1 1,1 0,1 1,1 1,1 0,1 0,1 1,1 0,1 1,1 0,1 1,1 1,1 0,1 1,1 1,1 1,1 1,1 0,1 1,1 0,1 1,1 0,1 1,1 1,1 1,1 0,1 1,1 0,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 0,1 3 t V a 3 n D g d Q k r u I V m e: X z c b i I: p h <sy 1 a: f ei o: 3 O". ou ": * g y; w y j u: 0: i: C n ; a m o: o ou o I Fig. 10. The feature matrix
26 As we have seen already, the output layer of NETspraak consists of 22 units whose activation value a (0 < a < 1) is interpreted as indicating the presence or absence of a phonetic feature. Figure 10 lists the feature specifications for each of the 56 phonemes needed in the transcription of the texts. In the left-hand column the two place code we have been using is given while its IPA equivalent (See IPA, 1949) is given in the right-hand column. Above each column there is a label indicating the phonetic feature in question. We can see for example from the third line that the phoneme coded as "t." is negatively specified for the feature vowel (the column vow has 0 in the third line), whereas it is positively specified for the feature consonant (the column cons has 1)10. The first line of the matrix in Figure 10 contains the specification of the null- phoneme ("0." in our coding system) discussed in the next section. Line 56 lists the specification of the space (" '' in our coding). Both the null-phoneme and the space are negatively specified for every phonetic feature. They are discriminated by their value for the dummy feature "space". On the basis of the information presented so far, we can illustrate what happens during each of NETspraak's learning cycles. Suppose that we have a Dutch text as in (1) with its transcription as in (2) which will be presented to NETspraak as a learning text. Since our phonetic transcription uses two-place codes, we have inserted hyphens in order to assist the reader in keeping track of graphemes and their corresponding phonetic symbols: (1) D-a-t- -w-a-s- -g-e-k- ('That was funny') (2) d.A.t. \W.A.s. \x.E.k. During a learning cycle the heptagram window is slid along the learning text. At a certain point the window contains the following heptagram: |t was g|. The 28-place codes for t, space, w, a, s, space and g respectively are entered into the 7 * 28 input units of the network. Some of these 196 input units are activated. They therefore will fire and activate some of the units in the hidden layer. These, in turn, may activate to some degree the 22 units in the output layer. Since the network is in its learning stage, the information is available that the grapheme in the fourth position of the heptagram, "a", is to be transcribed as [A.] The network now determines which output units differ from the feature specification for [A.] and the values for the weights in the network are adjusted according to the error propagation learning rule mentioned above. The heptagram window moves up one position over the text, so it changes to | was ge|, and the process is repeated. In the test stage the heptagramme window slides over the text too, but now the network must find the correct phoneme by itself on the basis of the weights of the output units. In the test stage the weights are not adjusted anymore. In Figure 11 a possible constellation of output values is given. On the basis of this constellation a phoneme is searched for in the matrix of Figure 10 that most resembles the given constellation of feature values in the 22 output units. The 10 The data in the matrix of Figure 10 is taken from Hoppenbrouwers and Hoppen- brouwers 1993:366).
27 .00 .00 .01 .00 .00 .00 .01 .00 .01 .00 .94 .23 .00 .01 .00 .00 .00 .02 .00 .00 .00 1.00 Fig. 11. A possible constellation of output values sum of squares of the differences between the values in the output units and those in the corresponding columns of the feature matrix is used as a similarity measure. Using this measure will yield [p.] in line 25 of the feature matrix as resembling most the constellation in Figure 11. 3.2 The Learning Text and the Test Text Both theoretical considerations and the possible practical applications make clear that the use of plain running text is to be preferred to the use of isolated words, since only in the former case can assimilation phenomena such as those discussed in subsection 2.1 be taken into consideration. For this reason, we have chosen to present NETspraak with a learning text consisting of the first ten pages of De Avonden, a famous Dutch novel by Gerard Reve (1987), and its phonetic transcription (22071 graphemes, 4040 words). As a test text we have chosen the eleventh page of that novel (2355 graphemes, 457 words). Of course a transcription of the latter page had to be available in order to make an assessment of the performance of the network possible. The testing of NETspraak's results by presenting it with a test text was motivated by our wish to gain some insight into NETspraak's ability to reach significant generalizations. In this context it is useful to note that 32% of the heptagrams in the test text occurred in the learning text at least once, whereas 68% were new. The period was the only punctuation mark that was taken into account: it was included in the contextual information, but the network did not have to assign it a phonetic interpretation. A minor difficulty we encountered was posed by the necessity to enforce a one-to-one correspondence between the graphemes of a text and the phonetic symbols in the transcription: as in most orthographies, a cluster of graphemes in Dutch is often used to represent a single phoneme. We solved this problem by introducing a null phoneme, indicated as [0.] As can be seen from Figure 10, this symbol is negatively specified for all features. Thus the word wekker (alarm clock) is transcribed as in (4). (3) W-e-k-k-e-r- ('alarm clock') (4) W.E.k.O.&.r. In order to avoid inconsistencies in the transcription process11, we have ad- 11 It must be said however, that one of the reasons that neural networks appear to be so promising has to do precisely with the fact that they are not very sensitive to exceptions, inconsistencies and errors.
28 hered to the following rule: if a sound is represented by a cluster of graphemes, the symbol representing that sound has been assigned to the first grapheme in the cluster, the other graphemes being assigned the null phoneme. In (3) and (4) we find an application of this rule12. The opposite situation, one grapheme that is to be represented by two phonemes, is rare. The example in (5)-(6) illustrates this: (5) m-e-1- k- ('milk') (6) m.E.l.&.k. In our texts no such cases actually happened to occur. In (Sejnowski and Rosenberg 1987) such cases were dealt with by adding a new symbol to the set of phonemes so as to represent the combination of signs. For the case illustrated in (5) and (6), therefore, the symbol [L.] could be introduced to represent the combination [I.&.]. 3.3 The Learning Path The learning text discussed in the previous section was presented to the network many times in succession. During each learning cycle, the heptagram window slid successively over the 22071 graphemes, calculated at each step the assumed phoneme specification for the fourth grapheme in the window, compared this specification with the specification of the correct phoneme provided by us and adjusted the weights if neccessary according to the error propagation rule. At the end of each cycle, the percentage of correct grapheme-phoneme assignments was calculated both for the learning text and for the test text. In Figure 12 the learning graph is shown for 55 cycles. The graph in Figure 12 is typical of backpropagation networks. One often finds that after a starting period with rapid improvement, a stage of stabilization sets in, in which the distance between the results of the test text and those of the learning text remain the same. At the latest stage, the results on the learning text may improve a little 12 In some cases application of this rule led to solutions that are hardly plausible if one wishes to take into account information on syEabic structure. In the foEowing example, assigning the phoneme [t.] to the grapheme d of the word dat ('that') appears to be more natural than assigning it to the grapheme d of the word vindt ('finds') as we now do by adhering to the rule given above. H-i-j- -v-i-n-d-t- -d-a-t- -n-i-k-s- ('He does not like it') h.EiO. v.I.n.t.O. O.A.t. n.I.k.s. Two remarks are relevant here. Since neither the null phoneme nor the space are realized phonetically, the above transcription is equivalent to the transcription given below, which can be produced simply by leaving out the nuE phoneme and the space: h.Eiv.I.n.t.A.t.n.I.k.s. Furthermore, nothing indicates that the network would behave differently if a different transcription convention were adhered to. The network also learns the transcription conventions from its example.
29 100.0 95.0 % Correct 90.0 + 85.0 80.0 75.0 i i i i i i i i i i i i i i i i i i 0 5 10 15 20 25 30 35 40 45 50 55 Number of cycles learning testing Fig. 12. Learning Graph NETspraak more, whereas those on the test text deteriorate. A straightforward explanation for this phenomenon would be that the network initially makes significant generalizations that will also lead to better performance on the test text, whereas the improvement on the learning text and the deterioration on the test text in the later stage might be ascribed to the fact that the network adapts itself more and more to the idiocyncracies of the learning text. If, as in our case, many training examples are available, it is possible to cut off the learning process when the results on the training material are more or less stable (after the 40st cycle). If not very many training examples are available, a so called validation set is used to evaluate the performance of the network during training: the training terminates when the maximum performance on the validation set is reached. The ultimate performance of the trained network is calculated based on the performance on the test material. For more information on rules and conventions for carrying out artificial neural network experiments, see (Prechelt, 1994). The fact that the drop in the learning graph for NETspraak is not very clear cut, might have something to do with the size of the learning text. Because of this size, the network will have some difficulty in adjusting itself to the idiosyncracies of the learning text. Since the results of NETspraak on the training text are more or less stable after the 40st cycle, we will restrict our attention to the results achieved during this cycle. A remarkable aspect of the learning graph in Figure 12 is the high score on the
30 test text: 96.7%, where (Sejnowski and Rosenberg, 1987) reported a score of 80% on similar material in English. Apart from the nature of English orthography and the fact that NETtalk also takes account of stress assignment, the fact that our learning text was four times larger, may well play a role. The performance of NETspraak is the more remarkable since in the learning text a number of French words occur that initially confused the network, but which it could eventually deal with very well: La favorite van Couperin (Reve, 1987 page 16). In order to enable the reader to gain an impression of the performance of NETspraak, we have provided a representative sample of the test text in Figure 13 together with the transcription provided by us and that provided by NETspraak (after 41 cycles through the learning text). Where the transcription produced by NETspraak differs from ours, it is printed in bold face. Note that NETspraak appears to know whether the word hij is to be transcribed as [h.Ei] or as [i.]13! The personal pronoun hij ('he') can be reduced to —i.— only if it is directly preceded by the finite verb. The mistakes that NETspraak makes are not usually very serious ones. None of the four mistakes shown in 13 is wildly out. In three cases we are dealing with a difference of only one phonetic feature (s-z., z-s., g-x.), in the other case the grapheem d is transcribed as [t.] where the null phoneme should have been chosen. Once in a while, however, NETspraak really makes a blunder. In a part of the text not reproduced here, for example, the personal pronoun JJ (You) is transcribed as [&.] instead of [y.]. In defence of NETspraak we note that the word JJ (You) appears only once in the learning text. We were surprised to observe that the space was transcribed once as [p.]. The reason for this has to do with the fact that no other phoneme in Figure 10 has been negatively specified for so many features, whereas the space is negatively specified for all phonetic features. Closer inspection reveals that the value of the output unit corresponding to the feature consonant was high; a moderate value for the feature anterior proved sufficient to choose [p.]. 4 Evaluation In subsection 4.1 we discuss briefly a number of ways to further improve NET- spraak's performance. In subsection 4.2 we address the question of whether the modelling approach using neural networks is indeed as promising as we may be inclined to believe on the basis of the results achieved so far. 4.1 Ways to Improve the Performance of NETspraak As can be seen from the learning graph in Figure 12, NETspraak achieved its best score on the test text after 41 cycles: 96.7% of the graphemes in the test 13 We should take into account that the distinction between upper case and lower case has not been allowed for. It is, therefore, not on the basis of such information that NETspraak is able to distinguish between the two uses of the word hip.
31 ... H-O-E- -Z-A-L- -H-E-T- -G-A-A-N- -D-A-C-H-T- -F-R-I-T-S-.- ... h.u.O. Xz.A.l. \0.ft.t. \x.a:0.n. Xd.A.x.O.t. \f.r.I.t.s.2. ... h.u.O. Xz.A.l. XO.ft.t. \x.a:0.n. Xd.A.x.O.t. \f.r.I.t.s.2. H-O-E- -Z-A-L- -H-E-T- -G-A-A-N- -Z-E-I- -H-I-J-.- -E-R- h.u.O. Xz.A.l. XO.ft.t. \x.a:0.n. Xz.EiO. \0.i.0.2. \ft.r. h.u.O. Xz.A.l. XO.ft.t. \x.a:0.n. Xz.EiO. \0.i.0.2. \ft.r. W-A-S- -E-V-E-N- -E-E-N- -S-T-I-L-T-E-.- -S-I-H-D-S- -J-O-O-P- W.A.z. \e:v.ft.O. \&.O.n. \s.t.I.l.t.ft.2. Xs.I.n.O.s. \j.o:0.p. W.A.z. \e:v.ft.O. \&.O.n. \s.t.I.l.t.ft.2. Xz.I.n.t.s. \j.o:0.p. U-I-T- -H-U-I-S- -I-S- -V-A-D-E-R- -V-E-R-V-O-L-G-D-E- QyO.t. \h.Qy0.z. \I.s. \f.a:d.ft.r. \v.ft.r.v.0.1.g.d.&. QyO.t. Xh.QyO.s. \I.s. \f.a:d.ft.r. \v.ft.r.v.0.1.g.d.&. H-I-J- -O-P- -E-E-N- -L-U-C-H-T-I-G-E- -T-O-O-N- -K-A-N- O.i.O. \0.p. \&.O.n. \l.".x.0.t.ft.g.&. \t.o:0.ng \k.A.n. O.i.O. \0.p. \&.O.n. \l.-.x.0.t.ft.x.&. \t.o:0.ng \k.A.n. I-K- -U-I-T-S-T-E-K-E-N-D- -M-E-T- -H-E-M- I.k. \Qy0.t.s.t.e:k.ft.n.t. Xm.E.t. \0.ft.m. I.k. \Qy0.t.s.t.e:k.ft.n.t. Xm.E.t. \0.ft.m. O-P-S-C-H-I-E-T-E-N-.- -J-O-O-P- -G-L-I-M-L-A-C-H-T-E-.- O.p.s.x.O.i.O.t.ft.0.2. \j.o:0.p. Xx.l.I.m.l.A.x.O.t.&.2. O.p.s.x.O.i.O.t.ft.0.2. \j.o:0.p. Xx.l.I.m.l.A.x.O.t.ft.2. Z-I-J-N- -V-A-D-E-R- -S-C-H-A-K-E-L-D-E- -D-E- -R-A-D-I-O- -I-N- z.&.O.m. \v.a:d.ft.r. \s.x.0.a:k.ft.l.d.&. \d.ft. \r.a:d.i.o: \I.n. z.&.O.m. \v.a:d.ft.r. \s.x.0.a:k.ft.l.d.&. \d.ft. \r.a:d.i.o: \I.n. E-N- -V-O-N-D- -E-E-N- -W-A-L-S-.- -H-I-J- -T-I-K-T-E- -I-N- E.m. Xv.O.n.t. \&.O.m. \W.A.l.s.2. \h.EiO. Xt.I.k.t.ft. \I.n. E.m. Xv.O.n.t. Xft.O.m. \W.A.l.s.2. \h.EiO. Xt.I.k.t.ft. \I.n. D-E- -M-A-A-T- -M-E-T- -Z-I-J-N- -H-A-N-D- ... d.ft. \m.a:0.t. Xm.E.t. Xs.&.O.n. Xh.A.n.t. ... d.ft. \m.a:0.t. Xm.E.t. Xs.&.O.n. Xh.A.n.t. ... Fig. 13. A sample of the test text with the transcriptions provided by us and by NET- spraak. The first line of each triple contains the text to be transcribed. The second line contains the transcription provided by us, and the third line contains the transcription provided by NETspraak.
32 text were transcribed correctly. This high score, together with the fact referred to earlier that 68% of the heptagrams occurring in this text do not occur in the learning text, appears to support the conclusion that during the learning phase significant generalizations are reached. As we have already seen in subsection 3.3, the score achieved by NETspraak is better than the score of 80% reported for NETtalk when applied to running text. As a possible explanation we suggested (apart from the more problematic nature of English orthography, and the fact that NETtalk can deal with stress phenomena) the fact that our learning text is approximately four times as large as that used to train NETtalk. It is possible that an even better result can be achieved by increasing the size of the learning text. Some indication of the relevance of this remark is provided by Rosenberg (1987), who in a later experiment achieved a much better score by using a set of 16,000 dictionary words to train NETtalk. On a set of 1000 words different from those in the learning material a score of 90% was achieved. A comparison of NETspraak's performance with that of its German counterpart NetzSprech (Dorffner, 1989) does not make very much sense. Dorffner not only restricts himself to the conversion of words in isolation, but, what is worse, he makes no distinction between learning text and test text. As regards the learning text Dorffner reports an "error rate going down to less than 3% for features and less than 10% for letters". A problem that we encountered in comparing the performance of NETspraak with that of traditional systems such as the INF-KUN system (Kerkhoff et ah, 1984) and the GRAFON system (Daelemans, 1988) was that available quantitative data on the performance of these systems are not easy to compare. In principle such a comparison would be possible and very interesting. In entering upon such a comparison not only the number of errors should be taken into account, but also the nature of the errors. In (Daelemans, 1988 page 13) a score of 99.26% is reported for a text comparable to the test text used by us. One difficulty in comparing the results of the various systems is a direct consequence of the fact that in the case of NETspraak, the correct transcription was provided beforehand, whereas in the other approaches the transcription provided by the system was evaluated afterwards. Both W. Daelemans (GRAFON) and W. Senders and J. Kerkhoff (INF-KUN system) have been kind enough to present the text we used in testing NETspraak to the systems developed by them. Using the transcription of this text provided by us beforehand, both traditional systems achieved a score of approximately 96%. The fact that this score is lower than that achieved by NETspraak (96.7%) need not surprise us: The learning text and the test text that we used for NETspraak have been transcribed by the same person, giving NETspraak a definite advantage over the other systems. Apart from the possibility, indicated above, of improving NETspraak's performance by increasing the size of the learning text, the following suggestions for further research may prove useful. For NETspraak we have chosen a 7-grapheme window. Although this choice
33 is not free from arbitrariness, it was partly motivated by the fact that context in most phonological rules is usually restricted to three segments or less in both directions. A seven-character window for NETspraak therefore appears to be sufficient to enable it to make significant generalizations. It is to be expected, however, that NETspraak's ability to master exceptions will improve if the size of the window is increased. Another approach that one could choose to improve the performance of the network, may require some explanation. The hidden layer used in NETspraak consists of 20 processing units. It is worthwhile experimenting with this number, although such an approach would not be entirely unproblematical. Increasing the number of units in the hidden layer will have a positive effect on the learning capacity, especially as regards the learning text. There is however the danger of the weights being adjusted too specifically to the learning text, which would have a deteriorating effect on the performance of the network on the test text. Using a smaller number of units in the hidden layer has a negative influence on the ability of the network to make significant generalizations, whereas using too many hidden units increases the danger of the system's becoming fixated on idiosyncracies of the learning text. Since a sound theoretical basis for deciding on the optimum number of hidden units is not yet available, finding the best constellation will remain a matter of trial and error. This trial and error approach is not restricted to this particular aspect of neural networking: the choice of the value for the learning constant (see subsection 2.3) is a case in point. We found that a maximum score of 97.5% could be achieved using 80 units in the hidden layer. Adding more hidden units did not result in any further improvement. Further improvement can be achieved by changing NETspraak's input in the following way: instead of restricting oneself to a representation of the 7 graphemes in the heptagram window as described in subsection 3.1, the feature specifications of the grapheme transcribed last could be added to the input to the network. The input would then consist of 196 input units for^he heptagram window and 22 input units representing the phonetic features of the grapheme processed last, making a total of 196 + 22 = 218 input units. This would result in an overall network architecture as given in Figure 14. Experiments with a network built according to this scheme with 100 units in the hidden layer resulted after 28 learning cycles in a score of 98% on the test text (see Weijters, 1990). Of course this feedback approach need not be restricted to the grapheme processed last, but may be extended to any number of graphemes already processed. In contrast with NETtalk, NETspraak does not pay attention to stress assignment. On the basis of the results reported for NETtalk, an extension of NETspraak in this sense would seem very promising, especially if combined with a feedback approach as discussed above. 4.2 The Usefulness of Neural Networks The practical usefulness of systems such as NETspraak seems beyond doubt. One great advantage of such systems as opposed to symbolic approaches lies in
34 Grapheme 1 ... Grapheme 4 Grapheme 7 Feedback il ... i20 i85 ... il 12 i 169 ... il96 i 197 ... i218 Input hi ha h99 hlOO Hidden layer Output Transcription of grapheme preceding grapheme 4 Fig. 14. Network architecture for a neural network with feedback the fact that we can train the same network for another language in a short time without having to change the system. Whereas the implementation of a classical symbol-oriented system for grapheme-phoneme-conversion takes years, training a neural network is a matter of days. It is moreover very well conceivable that a neural network which is already operational keeps on learning on the basis of feedback on errors made. The fact that we have been able to model the conversion of graphemes to phonemes rather succesfully using a neural network, may come as a surprise. However, linguistics is more than phonology, and phonology is more than just the problem of grapheme-phoneme conversion. The question remains whether it is possible to model other linguistic processes with the help of neural networks. For an interesting discussion of these and related questions the reader is referred to (Rummelhart and McClelland, 1986), (Pinker and Prince, 1988) and (Reilly and Sharkey, 1992). As far as the theoretical usefulness of the modelling technique using neural networks is concerned, various positions can be taken. One view holds that this approach, although it might result in systems that may be of some practical use, is of no use whatsoever from a theoretical point of view, since it does not increase in any way our understanding of the cognitive process being modelled. The opposite view would hold that it is apparently possible to model cognitive processes adequately without relying on a system of explicit rules. It may well be that linguists (and scientists in other fields of research) have been looking for
35 rule systems that have no basis in psychological reality. In cognitive processes no use is made of symbolic representations, nor of rules to manipulate these. Linguistic abilities do not depend on a knowledge of rules (either explicit ones or implicit ones), but result from an intricate network of weight assignments. One might wonder whether this modelling technique is not just another variation on existing statistical approaches. We will not enter that discussion here. For an extensive and critical discussion we refer to (Fodor and Pylyshyn, 1988). References R. van Bezooijen (1989) Evaluation of the suitability of Dutch Text-to-Speech conversion for application in a spoken daily newspaper for the blind. Spinn/ASSP-Report 15, Institute of Phonetic Sciences, University of Amsterdam. L. Bloomfield (1933) Language. George AEen and Unwin Ltd. London. W. Daelemans (1985) GRAFON: a system for automatic grapheme to phoneme transliteration and phonological rule testing. Internal report, University of Nijmegen. W. Daelemans (1988) GRAFON-D: A grapheme-to-phoneme conversion system for Dutch. Proceedings Twelfth International Conference on Computational Linguistics (COLING-88). Budapest, 133-138. G. Dorffner (1989) Replacing symbolic rule systems with PDP networks. Netzsprech: a German example. Applied Artificial Intelligence, Vol. 3: 45-67. J.A. Fodor and Z.W. Pylyshyn (1988) Connectionism and cognitive architecture: A critical analysis, in: Cognition 28, 3-71. C. Hoppenbrouwers and G. Hoppenbrouwers (1993) Feature frequencies and the classification of Dutch dialects. Verhandlungen des Internationalen Dialektologenkon- gresses Bamberg 1990. Band 1. Wolfgang Viereck (ed.). Franz Steiner Verlag Stuttgart. IPA, International Phonetic Association (1949) The Principles of the International Phonetic Association. London, (repr. 1978). J. Kerkhoff, J. Wester and L. Boves (1984) A compiler for implementing the linguistic phase of a text-to-speech conversion system, in: H. Bennis and W.U.S. van Lessen Kloeke (eds.). Linguistics in the Netherlands. Foris Publications, Dordrecht. M. Minsky, and S. Papert (1969) Perceptrons. Cambridge, Mass, MIT-Press. D. Parker (1986) Comparison of algorithms for neuronlike ceEs, in: Denker (ed.). Neural Networks for Computing. AIP Proceedings 151, New York. S. Pinker and A. Prince (1988) On language and connectionism: Analysis of parallel distributed processing of language acquisition, in: Cognition 28, 73-193. L. Prechelt (1994) Probenl - a set of neural network benchmark problems and benchmarking rules. Technical report 21/94, Fakultat fur Informatik, Universitat Karlsruhe. G. Reve (1987) De Avonden, Een winterverhaal. (34-th edition). Bezige Bij. Amsterdam, originally published 1947. R.G. Reilley &; N.E. Sharkey (1992) Connectionist Approaches to Natural Language Processing. Lawrence Erlbaum Associates, Hillsdale, N.J. C.R. Rosenberg (1987) Analysis of NETtalk's internal structure. Proceedings of the Ninth Annual Cognitive Science Conference. Seattle, WA. D. Rumelhart, G. Hinton and R. Williams (1986) Learning internal representations by error propagation. In: Rumelhart and McCleEand (eds.). Parallel distributed Processing: Foundations. 1. MIT-Press, Cambridge, MA, 318-362.
36 D.E. Rumelhart, J.L. McClelland and the PDP Research Group (eds.) (1986) Parallel Distributed Processing. MIT-Press, Cambridge, MA. T.J. Sejnowski and C.R. Rosenberg (1987) Parallel networks that learn to pronounce English text. Complex Systems, Vol. 1: 145-168. A.J.M.M. Weijters (1990) NETspraak: a grapheme-to-phoneme conversion network for Dutch. Proceedings of the IEEE symposium on neural networks. IEEE Student Branch, Delft: 59-68. R. Willemse (1987) Performance assessment of the dutch grapheme-to-phoneme module Esprit-project 860, Report nr. NU-GRPHASS-0509.
Back Propagation J. Henseler* Forensic Science Laboratory of the Ministry of Justice, Rijswijk 1 Introduction In the late 1950's two artificial neural networks were introduced that have had a great impact on current neural network models. The first one is known as the Perceptron (cf. Rosenblatt, 1958, 1962) and contains linear threshold units, i.e., outputs are either zero or one. The second network model is constructed from Adaline (Adaptive Linear) units which have a linear output, i.e., without a threshold. This network is known as the Madaline (cf. Widrow and Hoff, 1960). Both networks use a learning rule that is a variant of what is now called the delta rule (Rumelhart et al., 1986). The main drawback of these two neural network models is their restriction to one layer of adaptive connections. In their famous book Perceptrons, Minsky and Papert (1969) showed that such networks are only capable of associating linearly separable input classes. This means, for example, that neither the Perceptron nor the Madaline would ever be able to learn the exclusive-or (XOR) problem (cf. Section 2). Minsky and Papert also noted that these limitations could be overcome if an intermediate layer of adaptive connections is introduced. At that time, however, no efficient learning rule for networks with intermediate layers was known. In 1985 several learning schemes for adapting intermediate connections were reported (Parker, 1985; Le Cun, 1985). However, in this paper we will focus entirely on the generalized delta rule that was introduced in 1986 by Rumelhart et al. (1986). The application of the generalized delta rule requires two phases. In the first phase, input is propagated forward to the output units where the error of the network is measured. In the second phase, the error is propagated backward through the network and is used for adapting connections. Owing to the second phase this procedure is also known as Back Propagation of error. We note that this procedure is similar to an algorithm descibed much earlier by Werbos (1974). Section 2 describes the Perceptron learning rule. It also shows why a perceptron can not solve the exclusive-or problem. In Section 3 the Madaline learning rule is described and its relation to the Perceptron learning rule. In Section 4 the architecture of multi-layer neural networks is introduced. It describes how these networks may be adapted using the generalized delta rule. Section 5 pays attention to a serious drawback of this learning rule, viz., the existence of local minima. In Section 6 some second-order improvements on the generalized * We thank IBM for their hardware support under the Joint Study Agreement DAEDALOS
38 delta rule are presented, viz., the momentum and adaptive back propagation. In Section 7 a recurrent network is described that can be trained with Back Propagation. Such a network may be used for learning patterns with a temporal extent. Finally, in Section 8 an application of a multi-layer neural network for controlling a robot arm is discussed. There are two appendices to this paper. In Appendix A a vectorized version of the generalized delta rule is derived. Appendix B contains a pseudo-code description of the Back Propagation algorithm. 2 Perceptron learning rule The Perceptron was introduced as a layer of neurons that receive input from a retina. The neurons are not interconnected and can only be activated by input from the retina. A neuron receives activation from a retina point if and only if (1) there exists a connection, and (2) the retina point itself is activated, e.g., black. The sum of this activation in a neuron is called nett input a. Neurons in a perceptron are threshold units, i.e., the output of a neuron y is 1 if a exceeds the threshold 6 and 0 if not. Figure 1 depicts a perceptron with inputs Xi,...,x„. 0 a <e Fig. 1. Block diagram of a Perceptron neuron processing model. The processing model of a neuron in a perceptron with inputs x\,..., xn 6 R and connection weights wi,...,wn € R can mathematically be described as follows:
39 n a = YlwiXi _ (la>9 y~ \0<t<9 Here Wi ^ 0 means a connection to the f-th input exists and W{ = 0 means it does not exist. If Wi > 0 the input contributes to the activation sum a in a positive way, i.e., it excites the neuron. If Wi < 0 the input decreases the activation sum, i.e., it inhibits the neuron. A neuron will only turn on in case the excitatory input is more than 9 units stronger than the inhibitory input. Hence, a neuron, or a number of neurons, establishes a mapping from input activity on the output. The nature of this mapping is entirely determined by the perceptron configuration, i.e., by connections and thresholds. In terms of a Perceptron, pattern recognition may be interpreted as a mapping of retina images onto a number of categories. A perceptron can perform this task if it has an appropriate configuration. If such a configuration exists then it can be obtained by adapting the perceptron using the procedure presented in Table 1. According to this procedure the connections and thresholds are adapted, based on the actual output of a neuron y (cf. Equation 1) and its desired output, or target, Y. Table 1. Perceptron Learning Procedure INPUT: 1:,...,¾ TARGET: Y Calculate y according to Equation 1 if y ^ Y then if y = 1 then 9 = 9 + 1 for i = 1 to n do if xi = 1 then Wi = Wi — 1 endfor else 9 = 9-1 for % = 1 to n do if Xi = 1 then Wi = Wi + 1 endfor endif endif The perceptron learning procedure can be described as a delta rule in math- (1)
40 ematical form. Let the threshold change and the weight change be denoted by AS and Aw{, respectively. It is easy to see that the delta rule presented in (2) corresponds to the procedure described in Table 1 : A6 = y - Y = 6 Awi = -(2/- Y)xt = -6x{ (2) For some mappings, however, an appropriate configuration does not exist. As was pointed out by Minsky and Papert (1969) no perceptron configuration can be found for the XOR-function in Table 2. This configuration is known as the exclusive-or problem since either the first input or the second input must be activated but not both in order to turn on the output. In the XOR case this Table 2. Mappings of the OR, AND and XOR function, respectively. Input Xl 0 0 1 1 X2 0 1 0 1 Output OR 0 1 1 1 AND 0 0 0 1 XOR 0 1 1 0 problem can be analyzed as follows. A single neuron can only categorize inputs Xi and x-z in two classes, viz., Class 0 containing inputs for which W\X\-\-u>iXi < 0 and Class 1 containing inputs for which w\X\ + Wix-x > 6. For any value of w\, w-z and 0 this separation has the shape of a line meaning that Class 0 and 1 have to be linearly separable. From Figure 2 it follows that the AND and OR functions are linearly separable but that the XOR function requires a non-linear separation. If, on the other hand, we were allowed to add another input feature X3 to the perceptron that is the logical-and function of X\ and x-z it would be possible to solve the XOR problem. This input feature could be calculated by a second, intermediate, neuron. In that case, however, the perceptron learning procedure does not tell us how to configure this neuron since no target for its output is known since it is hidden from the network output. In Section 4 a generalization of the delta-rule will be presented that is capable of configuring intermediate, or hidden, neurons. First, we will deal with adapting continuous-valued weights in the next section. 3 Gradient descent The Perceptron Learning Procedure described in Table 1 only works with discrete- valued connections, inputs and outputs. In the Madaline (Widrow and Hoff,
41 Fig. 2. Geometric representation of the AND, OR and XOR functions. 1960), however, connections are continuous as are the outputs since a linear neuron function is used. Hence, a new learning procedure is required that is capable of minimizing the error for continuous values. In the Madaline this problem was solved by applying a Least Mean Squares (LMS) procedure. This approach requires that the error of the system is measured as the sum of the squared errors. For a single target Y the squared-error E is : E=(y-Yf (3) The LMS procedure calculates how the weights needed to produce y should be changed in order to decrease E. Obviously, a correct configuration has been learned if and only if E = 0. After this adaptation y is calculated again and the process is repeated. In case more than one pattern must be learned E is calculated as the sum of all pattern-errors. The weights are adapted by cycling through the pattern set and for each pattern adapting the weights according to the individual pattern errors. The method used for finding the correct adaptation vector {Aw\,..., Awn) is known as gradient descent. If we think of E as a function of w = (w\, ...,wn), then the gradient of E with respect to w denotes the slope of the "error-surface". By descending this surface downhill, i.e., in the direction of the negative gradient, we will finally reach at the bottom of the surface. At that point the error can no longer be decreased and the procedure finishes. In section 5 an example of gradient descent for a network with two weights is presented. The gradient of the error surface can only be calculated if the neuron- processing function is differentiable. Hence, the processing model in Equation 1 can not be used because the output function has an infinite gradient if a approaches 6. In the Adaline neuron-processing model the threshold is eliminated and a linear function remains, i.e., y = Y^i=iwixi- The corresponding error surface is smooth and the LMS procedure is applicable. Figure 3 depicts a typical error surface corresponding to the O It-mapping (cf. Table 2) for a single neuron with two weights. Using the linear neuron function
42 the error function E{w\, W2) can be written as the sum of the squared errors for each entry in the OR table. Fig. 3. "Bowl-shaped" error-surface with a minimum at the center. E(wuw2) = (1- Wi)2 + (1- w2f + (l-u>i- w2f (4) In this case a gradient descent will finally lead to the minimum of E. For the linear neuron model the learning rule for adapting u>i becomes : Aw{ = -rj dE dwi (5) Constant r\ is called the learning rate and determines how much the surface will be descended in one step. Taking large steps, i.e., using a large learning rate, speeds up the learning process. In some cases, however, it may lead to unstable behaviour of the system, e.g. introducing oscillations. By substituting E in equation (5) using Equation 3 and subsequently substituting y = Ya=i wixi> the derivative of the error measure with respect to Wi (cf. Equation 5) is : Awi = _„fcZ£ = -2,(, - Y)%L = -2r]Sd^WkXk = -2^,- (6) OWi OWi OWi This result is similar to the mathematical form of the Perceptron Learning Rule (cf. Equation 2). We conclude with noting that the LMS procedure is a useful alternative for the Perceptron Learning Procedure described in Table 1 for adapting continuous-valued weights.
43 4 Back Propagation In Section 2 it was explained why a Perceptron can not solve the XOR, problem, unless a hidden neuron is used. However, since no target output for a such a neuron is specified, neither the Perceptron Learning procedure (cf. Table 1) nor the LMS adaptation for the Madaline (cf. Equation 6) is applicable. Hence, a correct configuration can not be found. The generalized delta rule eliminates this problem by using the error gradient of the LMS procedure as a substitute target error for hidden neurons. 4.1 Multi-layer Network A multi-layer network is a special case of a Perceptron with hidden neurons. It consists of a number of consecutive layers, i.e., an input neuron layer, zero or more hidden layers and an output layer. In case there are no hidden layers the multi-layer network is equivalent to a Perceptron; that is, neurons in the same layer are not interconnected and neurons in the input layer represent input features, e.g., pixels. The output of the input layer is presented to the first hidden layer, or, if there are no hidden layers, directly to the output layer. Neurons in a hidden layer that do not receive inputs from the input layer are connected to the neurons in the previous hidden layer. Hence, the output of a hidden neuron is sent to the next layer which may either be another hidden layer or the output layer. Finally, the output layer sends its output to the environment. A multilayer network consisting of N layers is depicted in Figure 4. We have denoted the number of neurons in layer p by mp. layer I 2 ... N-l N Fig. 4. A multi-layer network consisting of N layers. Just like in the Madaline, the connections have continuous-valued weights and the neuron input and output are also continuous valued. The connection to the f-th neuron in layer p from the jf-th neuron in layer p — 1 has a weight
44 denoted by w\-. Two connected layers p — 1 and p with their connections and corresponding weights are shown in Figure 5. layer p — 1 layer p y0 „.p— 1 *"i7 "~~S;>^ P V""* P P —1 O ^=1 Fig. 5. Organization of connection weights corresponding to a neuron. We note that if a multi-layer network would be constructed from Adalines it is essentially equivalent to a Madaline, i.e., a single-layer network. This is caused by the linearity of the neuron transfer function in the Adaline implying that j/f = of. This can be shown by introducing w?- as the n-th order weight connecting 2/?_n~ to j/f. We note that wf- ' — wjf-.. A multi-layer network consisting of Adalines collapses into a single layer having weights w^ ,..., w^Nmo . We illustrate this by showing that y? is directly expressible as a linear summation of j/^- ,..., t#7^2 using first-order weights between layer p — 2 and p. p v—i 3 = 1 3=1 ^ = E wij-yj * = E wija*l = E < E ™r C2 = E E <-r J * 3 = 1 k=l k=l \ 3 = 1 -2 „" (1) mp_2 = E <(1^"2 a) ib = l The multi-layer structure does not collapse into a single layer if a non-linear output function is used, for instance, the hard-limiting threshold function used in the Perceptron neuron processing model (cf. Equation 1). However, as we indicated in the previous section, this step function is not differentiable and, hence, it does not allow the adaptation of weights using a gradient descent. Therefore, the step function is substituted by a sigmoid function having a similar
45 shape but with a continuous derivative (cf. Figure 6(a)). This results in a multilayer network with neurons computing the sigmoidal function / of the weighted sum a of their inputs (Rumelhart et al., 1986). /(*) = 1/(1+ e"*) f(x) = tanh(:r) 5.0 -2.5 Fig. 6. Two typical output functions used in multi layer networks, (a) sigmoid function, (b) tanh(x) function. The network output is obtained by propagating the input through the consecutive layers in Figure 4 until it reaches the output layer. Hence, this procedure is called forward propagation. If the inputs to the f'-th neuron in layer p are V\~ > ■ ■ • > S^r!.! ('-e-> the outputs from layer p —■ 1) with corresponding weights w: n> ■ then a? and the output j/f f°r this neuron are : l l + e ~(T; (8) Sometimes other output functions are used, for instance, tanh(«) (cf. Figure 6(b)). The tanh(«) function is essentially identical to the sigmoid function and is particularly useful when network output should range between —1 and 1. We note that the sigmoid function depicted in Figure 6(a) ranges between 0 and 1. tanh(«) = sinh(«) ex — e" cosh(«) ex + e~ = 2- 1 l + e --1 = 2/(2^)-1 (9) In the next section we describe how the gradient descent method underlying the delta rule used for the Madaline (cf. Section 3) may generalized such that it is suitable for configuring multi-layer neural networks with non-linear output functions.
46 4.2 Generalized Delta Rule The back-propagation procedure (Rumelhart et al., 1986) is essentially a gradient- descent method which minimizes an error E by adapting weights (cf. Section 3). The error is measured as the sum of the squared errors of the actual response yf* and the desired (target) responses Y{ of the neurons in the output layer. For a single example E becomes: £ = E(^-y>)2 (1°) j=i Although somewhat more complicated, this error function is essentially the same as the one presented in Equation 3 in Section 3. The error surface is defined as a function of the network parameters, i.e., the weights. Error E is minimized by a change (A) in the weights in the direction of the gradient descent, i.e., proportional to the negative gradient of E : A< = -¾ (11) Constant r/ is called the learning rate, and is a positive real number. Increasing the learning rate on the one hand speeds up the adaptation process but on the other hand may cause the system to become unstable. The derivative of E with respect to weights belonging to hidden layers is more difficult to determine because E is defined in terms of the error made by the output layer. However, it can be shown that the error in layer p can be expressed in terms of the errors occurring in the next layer p-\-1 and so on. The full derivation of the generalized delta rule is presented in Appendix A. This derivation introduces a delta error <5f for all neurons in the network which is used to calculate the components of the error gradient. If = *T <12> The delta error <5f is defined as the partial derivative of E with respect to the net input <5f of neuron i in layer p. * = wi (13) The partial derivative dE/dcr? is in fact a measure for the desired change in the output of the specific neuron j in order to minimize E. The delta error is spread through the network back from the last layer towards the first hidden layer directly following the input layer by a back-propagation process. The derivation of the procedure for calculating 8? in Equation 13 is described in Appendix A. It results in the generalized delta rule that can be used for
47 calculating weights w\-(t) based on their value at previous iteration t — 1 and the error: 5-1 <•(*) = <■(*-1)-^-1 (14) rf(l-rf)E<1^+ll<"<JV (15) Ufa-^)(^-^) p=^v We note that the factor j/f (1 —j/f) in the calculation of the delta error corresponds to the derivative of the sigmoid function (cf. Equation 8). Namely, it can be shown that for the derivative of a sigmoid function f(x) with respect to x, f'(x) = f(x)(l — f(x)) holds. In case the tanh(«) function (cf. Equation 9) is used, the factor yf(1 — yf) should be substitued by 1 — yf since tanh'(«) = l-tanh2(z). The Perceptron Learning procedure does not only adapt weights but also thresholds. It can be shown that the threshold adaptation rule in Equation 2 still applies when a threshold is entered in the sigmoidal function / (cf. Equation 8). Introducing a threshold avoids the situation where training is not very successful when \\a\\ >■ 0. In that case / is almost horizontal and /' approaches 0. Hence, the weight change will also be close to zero. Moreover, a threshold may also avoid the emergence of local minima in the error surface (cf. Section 5). A threshold parameter 0 is introduced for each neuron by using g(cr, 0) = f(a — 6) instead of just / in the processing model: g(a, 6) = f(a -6) = \—^ (16) l + e~\a ~a> As in Equation 11 the error E may be minimized by adapting 0. From Equations 8 and 16 it follows that the derivative of g(cr, &) with respect to 6 equals — f'{cr — 0). In Appendix A it is shown that the mathematical form of the Perceptron Learning rule for adapting the threshold (cf. Equation 2) still holds, viz.: A6 = rjS (17) In many implementations of the Back Propagation procedure, neurons have no explicit thresholds. Instead, a so-called bias neuron is added to the network. The bias neuron receives no input and constantly has output —1. Each neuron has a connection to the bias neuron. The adaptation (cf. Equation 12) of the corresponding connection weight reduces to the threshold adaptation in Equation 17. Hence, the weight to the bias neuron has the same functionality as the threshold. 5 Local minima A gradient descent procedure searches for a minimal error on the error surface (cf. Figure 3). Once a minimum is reached there is no way out regardless of the fact that other, better, minima may exist. If a better minimum exists the
48 current position is located in a local minimum of the error space. If, however, it is the lowest point among all then we speak of a global minimum. The possible occurrence of non-global minima is a well known problem that one has to be aware of using a gradient descent procedure (see also the contribution by Lenting and the contribution by Crama et al.). In order to understand this phenomenon, a multi-layer neural network (cf. Section 4) with three neurons and two weights is studied (McClelland and Rumelhart, 1988). The network contains one input, one hidden and one output neuron, hence it is called a 1:1:1 network (cf. Figure 7). Input Hidden Output Unit Unit Unit Fig. 7. A 1:1:1 network with one input, one hidden, one output neuron and two weights. Here, the problem is to configure Wi and w? such that the identity mapping is realized for binary input, i.e., the output should turn on if the input is turned on and it should turn off if the input is turned off. For the network in Figure 7 there exist two solutions to this problem. The first solution is straightforward. The hidden neuron simply propagates the unchanged on/off input signal and so does the output neuron. The other solution is that the hidden neuron transfers the opposite of the input neuron and the output neuron transfers the opposite of the hidden neuron. Obviously, taking two times the opposite of either on or off will result in the same signal. Both solutions correspond to global minima. In Table 3 weight configurations are presented that will finally converge and approach zero error when further adapted. Table 3. Weight and Bias Configuration for the solutions of the identity mapping in a 1:1:1 network. Solution straight not-not W\ -8 +8 U>2 -8 +8 bias i +4 -4 bias2 +4 -4 The gradient descent procedure can find appropriate configurations for the identity mapping without getting trapped in a local minimum. However, if the biases are fixed at zero, a local minimum appears in the error surface. The error function can be calculated by summing the errors of the network for the two possible situations, i.e. x = 0,y = 0 and x = l,y = 1. Using the process model of a multi-layer neural network (cf. Equation 8) we arrive at the following error function E(wi, w2) :
49 E(Wl, w2) = (1 + e~w2'2)-2 + (1- 1/(1 + e-^/C^8""1)))2 (18) The error surface is plotted in Figure 8 for w\,W2 € [—10,10] for two different viewpoints. In figure (a) the saddle point is very clear. The global minimum is at the foreground. In figure (b) the view is changed to emphasize the left side of the "saddle" which is actually a descent to a local minimum. Although this a very smooth descent it makes it impossible for a gradient descent procedure to get to the other side of the "saddle". (a) (b) Fig. 8. The error surface of the identity mapping for a 1:1:1 network with biases fixed at zero. Figure (a) shows a saddle point with the global minimum on the foreground, (b) shows a different view indicating the existence of a local minimum at the left side of the "saddle". The existence of local minima can very easily lead to a failure of the gradient descent search. If such a situation occurs one could try starting from a different initial weight setting. Fortunately, it seems that the error surface of a network with many weights has very few local minima. Apparently, in such networks it is always possible to slip out the local minimum by some other dimension. A more reliable method for escaping from local minima in a gradient search is called simulated annealing (Kirckpatrick et al., 1983). Normally, it is not possible to go uphill in a gradient descent. When applying simulated annealing every adaptation is performed with a certain probability. This introduces the possibility of going uphill, enabling an escape from local minima. Since it is more probable of getting out of a less deep minimum by chance the system is most likely to end in a global minimum instead of a local minimum. In simulated annealing this process converges by slowly "freezing" the system, i.e., by decreasing the probability of adaptation. A similar strategy is applied in the Boltzmann neural network (see the contribution by Spieksma).
50 6 Enhancements The Back Propagation procedure converges very slowly, which is typical for many gradient descent procedures. Moreover, when the number of neurons increases linearly, the speed decreases more than linear. This is caused by the fact that the dimension of the gradient equals the number of weights in the network. In a multi-layer network the number of weights is roughly equal to the square of the number of neurons in the largest layer. On the other hand, however, we saw that if the dimension of the error space increases there seems to be a smaller chance of getting trapped in a local minimum. If we were able to speed up the learning process, Back Propagation would be very useful for configuring large networks as well. Increasing the learning rate does not always speed up the learning process. As a matter of fact, if the learning rate becomes too large, the system will certainly begin to oscillate and the learning process halts. In this section we will describe three alternative methods for speeding up the learning process. The first method uses a so-called momentum (Rumelhart et al., 1986), the second is called the adaptive back-propagation algorithm (Silva and Almeida, 1990), and the third is called Super SAB, a self-adapting Back Propagation Algorithm (Tollenaere, 1990). 6.1 Momentum The weight adaptation described in Equation 15 is very sensitive to small disturbances. Suppose the direction of the gradient changes due to, for example, a bump in the error surface. In that case the back propagation procedure may just as well continue by going straight over the bump since it will vanish quickly. If, however, the gradient change is persistent, the adaptation will take notice of it. This strategy is accomplished by taking into account the previous adaptations in the learning process so that it gets a momentum. In practice this means that the weight adaptation calculated at step t (cf. Equation 15) is combined with the adaptation from step t — 1 multiplied with a so-called momentum parameter a. The adaptation process then becomes : Aw^t) = -^f+1y? + aAw%{t - 1) (19) The momentum parameter a has to be in [0,1) otherwise the contribution of each Aw\- grows infinitely. On the other hand, if a is too small the momentum becomes insignificant. One should therefore set the value of a close to one, e.g., 0.9. A basic problem with the momentum method is that it assumes the gradient slowly decreases when arriving close to the minimum. If, however, this is not the case the adaptation process will go through the minimum at high speed due to its momentum. 6.2 Adaptive Back-Propagation algorithm When looking at an error surface, see, for example, Figure 8, we see that slopes may be gentle in one direction but steep in another. If the gradient descent
51 travels by a small gradient, learning is slow. In such cases it would be a good idea to use a higher learning rate. On the other hand, if the gradient is steep the learning rate should be kept small. This strategy is accomplished by assigning to each weight w an individual learning rate r\ that is increased if the sign of its gradient component remains the same for some iterations, and is decreased otherwise. If Aw\(t — 1) and Aw?(t) are the weight changes at time t — I and t respectively and jjf (t — 1) is the corresponding learning rate at t — 1, then the new learning rate may be calculated as follows: „pm _ / rfS - 1) if AutWAutit - 1) > 0 '( ' ~ I Wit - l) if Ax^{t)AvJl{t - 1) < 0, (2U) Constants /i and d are an increase and a decrease factor respectively. Silva and Almeida (1990) note that in a wide range of tests performed with this technique, they found that a value of p somewhere between 1.1 and 1.3 was able to provide good results. For the parameter d, a value slightly below 1/p enables the adaptive process to give a small preference to learning-rate decrease, yielding a somewhat more stable convergence process. As one might expect, this technique may cause problems due to the fact that gradient components are changed independent from each other. This problem may be avoided by testing the total output error after adaptation has taken place. If there is an increase in error the new adaptation is rejected and a new set of learning rates is calculated using the gradient of the rejected adaptation. If this simple strategy does not work after a few trials, then it is always possible to simply reduce all the learning rate parameters by a fixed factor and repeat the process. 6.3 SuperSAB SuperSAB (from Super Self-Adapting Back propagation) (Tollenaere, 1990) is a combination of the momentum method and adaptive back propagation. This algorithm is based on adaptive Back Propagation with a momentum. In each step the learning rate is increased exponentially using /i (see adaptive back propagation). When the sign of a gradient component changes the responsible adaptation is cancelled using the momentum. This is an important difference with the original adaptive back-propagation algorithm where learning is only slowed down but where the last weight adaption is not cancelled after a gradient change. Furthermore, the learning rate is decreased exponentially using d (see adaptive back propagation). Before Back Propagation is continued, the momentum should be set to zero to avoid making the same mistake again. Experiments with Super SAB indicate that in many cases the algorithm converges faster than gradient descent. In all cases the algorithm is less sensitive to parameter values than the original back propagation algorithm. 7 Simple recurrent network The networks described in the previous sections are limited to realizing static mappings. Once a network is configured it only maps the input at time t on the
52 output according to some learned mapping. Hence, the network is not capable of taking into account inputs that it processed earlier unless they were presented during the learning period. One way to eliminate this restriction is to use a shift register that consists of N buffers, each capable of storing a single value. Each time a new input sample arrives the buffers are shifted, i.e., buffer N becomes N — I, N — 1 becomes N — 2 etcetera. The contents of buffer N are forgotten and the new sample is stored in buffer 1. A network with N buffers as input neurons is capable of processing temporal information restricted to the last N samples. First of all this solution is very awkward since it requires shift registers that are physically limited, meaning that only a limited number of samples can be retained. Secondly, this solution introduces a translation problem since a pattern can begin at N different positions. Another way to eliminate this restriction without using buffers is to use outputs at time t — 1 as input at time t (see also the contribution by Weijters and Hoppenbrouwers). These may either be outputs from neurons in the output layer but can just as well be taken from any other neuron in the network. Such a network is called a recurrent network and since its structure has remained the same it can still be trained using the Back Propagation procedure. One particular kind is called Simple Recurrent Network (SRN) and has been studied by Elman (1988). This network contains an input layer, a hidden layer and an output layer. The input layer is divided into input neurons that actually serve as the network input and so-called context neurons that are connected to the hidden neurons. For each hidden neuron there exists exactly one context neuron and after each iteration the output of a hidden neuron is copied to the output of its corresponding context neuron. The structure of this recurrent network is depicted in Figure 11. 8 Robot Arm In this section the Back Propagation procedure is used to configure a multilayer network for controlling a robot-arm system. After the network is adapted, or trained, it has an internal model enabling it to control the system. The robot- arm setup is drawn in Figure 9. It depicts a robot arm that is bent at the shoulder over <j)i, and at the elbow over <f>2 degrees. Hence, the robot-arm is said to have two degrees of freedom. The problem is to find 4>\ and ¢2 such that the hand of the arm reaches at a point that coincides with the crossing point of the looking directions a\ and (¾ of the left and right eye respectively. In a real situation the examples, needed for training the system, may be obtained by taking measurements. In the case of Figure 9 this could, for example, be accomplished by using a mechanical model. After having placed the arm .and eyes such that the eyes are looking at the hand, the corresponding angles can be measured. By repeating this procedure a collection of examples may be obtained. The advantage of this approach is that it is very straightforward to obtain examples for, e.g., a robot arm with five degrees of freedom. Deriving an
53 analytical model for such an arm is still feasible although it is very difficult and certainly not cheap. We have chosen a robot arm with two degrees of freedom because it is relatively easy to derive an analytical solution. Looking at the complexity of the solution for this simple robot arm (cf. Table 4) gives a good idea of how complicated solutions for industrial robot arms may get. Fig. 9. Two eyes are rotated over a\ and 02 degrees respectively, looking at the hand of a robot arm with two freedoms <j>\ and <f>2 respectively. The distance between hand and shoulder is denoted A. The length of the lower arm is B and of the upper arm is C. They determine the reach of the hand. We will use some addtitional variables enabling us to partition the transformation in 3 steps. In the first step the coordinates (x, y) of the hand are calculated relative to the shoulder. In the second step, angles 71 and 72 inside the "arm-triangle" are calculated. Finally, in the third step, the shoulder and elbow angles ¢1 and ¢2 are calculated. Table 4. Analytical form of transformation for the robot arm. I x = w — dtanoi/(tanor2 — tanoi) y = h — rftanori tana2/(tana2 — tanori) II A1 = x2, + y2 cos 71 = (A2 + C2 - B2)/2CA cos 72 = (A2 - C2 + B2)/2BA III (j>\ = 180 — axcta,ny/x — 71 ¢2 = 7i + 72
54 A multi-layer neural network can perform this mapping with reasonable accuracy after having learned a set of examples. In this case one may think of the network as an automatic interpolator. The development of this network may be divided in four phases that may be considered as a simple methodology for designing a multi-layer neural network that controls a robot arm. 8.1 Representation A neural network may be thought of as a computer that programs itself according to a set of examples. This is not as good as it sounds. A neural network will only be capable of solving a problem if it gets all the information that is required. This means that, most of all, a neural network engineer must analyze the problem and determine what information is relevant and how it should be fed into the network. Moreover, he or she should also decide what kind of information is delivered by the network. This is a representation problem. Essentially, the representation problem in neural networks is caused by the large variety of possible representations. Choosing a correct representation does not only depend on the problem domain, it also depends on the physical capabilities of the neural network. For example, a multi-layer neural network will never be able to reach 1 as output owing to the sigmoidal function. If binary examples should be learned, it is wise to take, e.g., 0.1 instead of 0 and 0.9 instead of 1. In the robot-arm system there are two problems. The first problem is that angles range between 0 and 360 degrees. This can simply be solved by scaling the angles to the range [0,1], dividing the angles by 360. The second, more serious problem, is that angles are periodic, i.e. 0 equals 360, or, after scaling, 0 is 1. This makes it very difficult, if not impossible, for a network to learn a proper model, because 0 and 1 have opposite meanings by their very nature. There are two ways to deal with this problem: 1. The first solution is that by making sure that the hand can only reach in front of the eyes, (A < h, cf. Figure 9), we ensure that ai,a2 € [0°,180°]. After scaling we obtain a representation in which 0 and 1 actually, and literally, have an opposite interpretation. However, when using this solution, it would mean that the angles of the arm are also restricted to this range unless other output features are added that, for example, represent the sign of ^i and <f>2- 2. The second solution is to represent angles as a sine, cosine pair. This means, however, that instead of two there are four inputs and four outputs. Due to this solution the scaling problem has slightly changed since the values of sine and cosine range between —1 and 1. This can be solved by adding 1 and dividing the result by 2. Using this representation 0 and 1 have opposite interpretations. We note that the linear scaling to [0,1] is not necessary if the tanh(«) function is used instead of the sigmoid function. We have chosen for the second solution since it is straightforward and the sine and cosine of the viewing directions can be measured directly when obtaining examples from a mechanical model. Now that we have established the representation we can proceed by generating a collection of examples.
55 8.2 Examples In order to train the network with the Back Propagation procedure (cf. Section 4) it is necessary to have a collection of examples. A single example consists of an array with input values and an array with output, or target, values. When training the network, an example is selected and is fed into the network. The output is compared to the target and subsequently the error for this particular example made by the network is calculated. With this error the network can be adapted using the generalized delta rule (cf. Equation 15). The shape of the examples is determined by the representation. Hence, in this case, examples will consist of eight floating-point numbers, i.e., (sinai, cosai, sina2, cosa2, sin ^i, cos ^i, sin ^2, cos ¢2)- Using the analytical solution described in Table 4 a set of examples can be generated. The robot arm should perform equally well for all points. Hence, it is necessary to select at random pairs («1,02) in the area that can be reached so that the examples have a uniform distribution. If, for example, they are not uniformly distributed but all situated at the left side of the shoulder, the arm is not likely to learn positions at the right side of the shoulder. In addition to the example set it is useful to have a test set that is constructed in the same way but which contains different examples. Calculating the sum of the squared errors of the examples in the test set provides us with a measure of the overall performance of the network. If this error is low then there are good reasons for assuming that the network has created a generalized model since the samples in the test set were not really learned. 8.3 Configuration Before training can begin we have to configure the network. The number of input and output neurons is already determined by the representation. Determining the number of hidden layers the number of hidden neurons in the network is normally a difficult task that can only be accomplished by trial and error. However, in this case, we are dealing with a continuous mapping (cf. Table 4) and only one hidden layer is needed. Namely, according to Lippmann (1987) a three layer perceptron with N(2N + 1) nodes using continuously increasing non-linearities can compute any continuous function of N variables. Unfortunately, the theorem does not indicate how weights or non-linearities in the network should be selected or how sensitive the output function is to variations in the weights and internal functions. Based on Lippmann's statement we assume that one hidden layer will suffice. However, it is not exactly clear how many neurons it should contain since the network has four outputs. Next, we will show that by using the error of the test set it is possible to indicate when the number of hidden neurons is not sufficient. 8.4 Learning The rules presented in this section are based on experience. Hence, their validity can not be proven but in general they may be used for developing a useful
56 learning strategy. During the learning phase it is important to monitor the error of the example set and of the test set. Learning should always proceed if there is still a considerable reduction of the error. Only when the changes get very small, i.e., learning becomes tediously slow, one should start wondering if the learning goal has been achieved or if a problem has come up. We will discuss a number of situations that may occur. 1. Normally, the errors on both the example and the test set should decrease continuously with small disturbances. If, however, these disturbances are large and very frequent this probably means that the system is unstable. In that case the learning-rate parameter should be decreased. 2- If the error of the example set has become, or is almost zero, then the learning phase should be halted because no more can bejearned from the current examples. In this case there are two possibilities: (1) the error on the test set is also zero or almost zero, and (2) the error on the test set is still considerable. In the first situation the network has successfully completed the learning task- In the second situation there is a good indication that there are not enough examples or that they are not representative for the problem domain. It is advisable to review the examples and try again. 3. If shortly after the beginning the error on the example set is decreasing very slowly but is still considerable there may be a problem. First of all it may be that the learning rate is too small. A typical learning rate is 0.3. By increasing the learning rate the error should decrease more rapidly. However, if this leads to unstable behaviour, it may be that the examples are inconsistent or that an inefficient representation was chosen. For instance, mapping 0° on 0 and 360° on 1 is inefficient because 0 and 1 are opposite neuron activity levels while 0° and 360° represent the same angle. Inconsistencies in the example set may be found by examining if there are contradictions. In that case an alternative representation should be considered. Another explanation for not being able to learn the examples may be that the network has not enough hidden neurons. Hence, if the representation seems correct and the error is really not going to become zero, then it is advisable to try again with more hidden neurons. 4. If the errors are non zero and do not change, this could mean the learning process is trapped in a local minimum (cf. Section 3) or it could indicate that there is no solution. One way to deal with this problem is to use another initial set of weights and start all over again hoping that a local minimum will be avoided. 5. It may be possible that both the example error and test error are decreasing smoothly but that suddenly the learning rate drops drastically. After a while the example error decreases again but the test error only increases. A possible explanation is that until the moment the learning rate dropped the network was perfectly well capable of creating a model. Then, suddenly, the network is not able to gain more accuracy and learning stops. Apparently the network tries to achieve the desired accuracy by learning a perfect mapping without
57 generalizing. Hence, the error of the test set increases. By adding more hidden neurons this problem may be avoided. 8.5 Simulation Results The following simulation results were obtained with a Back Propagation training procedure using only the momentum ehancement. Using the analytical solutions stated in Table 4 a learning set and a test set were constructed. Both consist of 100 randomly generated arm-eye orientations. The mean error on the examples of learning set and the test set during 1,000 training cycles are shown in Figure 10 (a) to (f) below. earmng=0.3 momentum=0.0 learn . learn learn earning=s4.0 momentum=0.7 earning=0.9 momentum=0.0 Ieamings0.3 momentum=0~ _ £ , . £L learn learn learn earnings0.9 momentum=0.7 0- eaming-0.3 momentum-0.7 ieaming=0.3 momentum=077 learmng=0.3 momentum=0. Fig. 10. Simulation results obtained by exucting 1,000 cycles over a randomly generated learning and test set. Figures (a)-(g) to the robot-arm application: (a), (b), (c) and (d) correspond to a 4:4:4 network, (e) shows instable behavior because a is too large; (f) and (g) correspond to a 4:8:4 and a 4:16:4 network respectively. Figure (h) shows the curves for 2:1 network trained with 'or' and 'and' problem, for a 2:2:1 network for the 'xor' problem. Figures (a) to (e) were obtained from a 4:4:4 network. First of all it should be noticed that the error on the test set reduces quickly meaning that apparently
58 there is a strong correlation between the examples in the learning set and those in the test set. Figures (a) and (b) were obtained without a momentum. Compared to (c) and (d), where the momentum is 0.9, the curves in (a) and (b) are initially less steep. The best results after 1,000 cycles were obtained in case (d) due to the higher learning rate. However, in figure (e) the learning rate is set to 4.0 which is too large. The result is that the curves start oscillating and that the learning rate is very slow or sometimes even negative. An adaptive Back Propagation algorithm like SuperSAB would detect the oscillations and immediately reduce the learning rate parameter. However, when the curve is smooth (e.g. figure (a)) SuperSAB increases the learning rate (cf. figure (b)). Figure (f) shows learning results for a 4:8:4 network and (g) for a 4:16:4 network. Although the learning rate parameter is set to 0.3 learning is even more quickly then in (d). It must be realized, however, that training a 4:16:4 network requires much more effort then training a 4:4:4 network, i.e., the true learning rate in (e) and (f) may actually be slower. It should be clear from these simulation results that in order to achieve a high learning rate it is important to tune the learning parameters. When looking at the performance of the robot arm when it is controlled by the network it is clear that the network can only interpolate from the examples it has learned. In Figure 10(h) the simulation results for the 'or','and' and 'xor' problems are shown. The 'or' and 'and' problem were trained on a 2:1 network and converged very quickly. The 'xor' problem was trained on a 2:2:1 network and required considerable longer time to converge compared to the 'or' and 'and'. Very typical for the 'xor' problem is that it seems initially as if the network has got stuck in a local minimum but that after a few hundred cycles the error reduces quickly. 9 Conclusions The most important conclusion to this paper is that multi-layer neural networks in combination with Back Propagation have a very wide area of application. Adaptive multi-layer neural networks form a useful technique for categorizing, recognizing and modeling data distributions. In such applications this technique must be considered as a serious competitor of classical (statistical) methods. One of the major advantages that makes this technique so desirable is that no a priori assumptions have to be made concerning the nature of the input distribution. However, in this paper it is also pointed out that some serious drawbacks exist. First of all, Back Propagation is a gradient descent in network weight space. Consequently, the search for solutions in this space may be hindered by local minima. Moreover, a good representation of the input and output of the network is essential. This usually involves a thorough analysis of the problem domain that may be just as difficult as designing an algorithmic solution. Furthermore, examples needed for training the network may not always be available. Another issue that has not yet been mentioned is that Back Propagation is rather difficult to realize in hardware compared to other network paradigms that rely on local adaptation rules. This means that when using Back Propagation, learning has to be done in advance and the resulting weights can be stored in a physical target
59 neural network machine. For Back Propagation this is the only way to benefit from the virtue of neural networks, i.e., their ability to process large numbers of input in parallel. Input Pattern Output Pattern Context Pattern Hidden Pattern Fig. 11. A Simple Recurrent Network (Elman, 1988).
60 Appendix A: Derivation of the Generalized Delta Rule In this Appendix a vectorized version of the generalized delta rule is derived. A vector notation is used because it enables us to exploit the regular structure of the network which will finally result in simpler rules. Moreover, Henseler and Braspenning (1990) use this result to prove that the generalized delta rule can also adapt multi-layered neural networks with complex-valued weights. Consequently, the net inputs to and the outputs from the neurons in layer p are denoted by vectors crp and yP respectively: tP - y= \ . (21) The lower indices in Equation 20 run over the neurons in larger p. Appropriately, the weights between layers p — 1 and p in a multi-layer network can be written in a weight matrix Wp: v v -i \ wp (22) ^mpl • • • *^mpmp_i / Let vector y°(t) be the input in the network at time t. Generally we shall leave out the (t) index. The j-th component of?/0 denotes the activity of input neuron i. With y° the input in the network, the output y1* may be obtained by iteratively calculating y1,..., yN, i.e. the output of layers 1,..., N respectively. This process is called forward propagation of the network input. Equation 8 may be rewritten as follows: <?p = WPf-1 (23) y" - /K) (24) The generalized-delta rule is based on a gradient-descent method which minimizes a total error E by adapting weights in the opposite direction of the gradient of the error surface in weight space. The error is measured as the sum of the squared errors of the actual responses y\ , ■ ■ ■, j/mN and the desired (target) responses Y\,... ,YmN of the neurons in the output layer. In vector notation, the error function is equal to the squared length of the error vector, Y — yN. mN £ = B*-^)2 = iiY-^n2 (25) »=i Since the network output yN is determined by the connection weights, the error E is a function of the connection weights and will therefore be represented by E(Wl,..., WN) which we will denote E(W) for short. The objective is to find a network weight configuration W1,..., WN such that the error E{w) is minimal.
61 A very simple method that may be used to approximate this minimum is to adapt W1,..., WN in the direction of the negative gradient. This method is called gradient descent. The error gradient contains components for each weight w\'• in the network. In a gradient descent, each weight wp; is adapted .proportionally (a) to the negative partial derivative of E(w) with respect to u&: We note, however, that gradient descent does not guarantee a global minimum to be found. It is only guaranteed that a local minimum is found. The generalized delta rule may be derived by rewriting Equation 26. Firstly, the chain rule is used to rewrite the partial derivative in the right-hand side: d«£ aof 5«,?. J dut, k=i (27) = 0Mi The factor Sf introduced in Equation 27 is called the delta error of the i-th neuron in layer p. The change for an entire weight matrix Wp is obtained by removing the indices i and j from Equation 27. This means that the weight change equals the product of 6 in layer p and the output yT of the previous layer p— 1. The superscript T denotes the transpose of a vector, i.e., yT is a row vector and the product of S and yT is a matrix. AWP = -r)6pyr-1T (28) Constant r/ is called the learning rate, and is a positive real number. Increasing the learning rate on the one hand speeds up the adaptation process but on the other hand may introduce oscillations in the learning process meaning that the descent along the error surface is not effective. One method that is often used to speed up the adaptation process without introducing oscillations is to modify Equation 28 by including a second-order term which is called a momentum (Rumelhart et al., 1986). The momentum represents a fraction of the previous weight adaptation. Let AWp(n) denote the weight adaptation at the n-th iteration, then Equation 28 is modified as follows: AWp(n + 1) = _,,#y~1T + aAWp(n) (29) The momentum coefficient a € [0,1) is a constant which determines the effect of past weight changes on current direction of the gradient descent. The momentum term filters high-frequency oscillations, i.e., it suppresses oscillations allowing the learning rate to be larger. The derivative of the error E(w) with respect to the net input of in a neuron may be calculated by applying the chain rule again. Namely, we calculate the
62 derivative of E(w) with respect to j/? and multiply this with the derivative of yfc with respect to the net input <j\ of neuron j in layer p: # dEjw) 8<rf ^ 3E{w) d$ h W d* (30) Allthough the partial derivative of j/? to ap is zero if i ^ j the summation over j = 1,..., mp in Equation 30 is entered to formulate Sf as the inner product of two vectors: 6! = dE(w) dE{w) (31) WmJdaf) This equation can be extended to a matrix multiplication resulting in SpT = («?>■■■>%,): T _ (dEjw) 8E(w) (32) !? p Matrix #* is called the layer differential matrix and the delta error 8P in layer p is calculated as follows: p = ^pt9E(w) dyP (33) Neurons in a multi-layer network that are located in the same layer are not interconnected; hence, matrix #* is diagonal since the i-th component of if depends on <j\ only, i.e., 8$/da\ = 0 if f ^ j. It can be shown from the definition of / in Equation 8 that if y = f(a) then j/ = /'(^) = 2/(1 — J/)- Hence, #* can be presented as : /rf(l-rf) #-P = 0 \ (34) V 0 ...14,(1-!«,,)/ The derivative of E(w) with respect to j/? in Equation 30 can only be calculated directly for p = N, i.e., for the output layer. This was expected since the error is measured in the output layer for which the target values Y are specified by the example. According to the definition of E{w) in Equation 25, the derivative for the output layer is : 8E(w) dyN -2(y-i^) = 2(i^-y) (35) The factor 2 will be left out by entering it in the learning rate i) (cf. Equation 29). The delta error in layer N can be calculated using Equation 33. The error Sp in
63 hidden layer p is calculated in terms of the error made by the subsequent layer p + 1 (up to N) as follows : m») _ "v ^w wp _ r? ,P+1 j_ ^ -+v **+1 =o if i?i mp+1 = E ^+1<' (36) This result enables the formulation of the generalized delta rule for 8f in its usual form by substituting Equation 36 into Equation 30 and denoting dy? /dap = mp+1 «r = /vn E <?+H+1 (37) i=i Apparently, the error propagation according to Equation (37) looks like forward propagation (cf. Equation 24) except that it is in the opposite direction, hence called back propagation. We will rewrite Equation 36 as a matrix multiplication resulting in (dE(xv)/djf)T = (dE(w)/dyp1,.. .,dE(w)/dyPm): V wmp+il ■ wmp+1mp which reduces further to, (38) 5|W = ^p+i v+1 (39) Combining the results found in Equations 33, 35 and 39 the following equations can be formulated for calculating the delta error vector Sp : #PTWP+lT8P+1 1< p < N &NT(yN - Y) P = N ^ix :;, Lrv" m Equation 40 is also referred to as the generalized delta rule and is used when calculating the weight change AW defined in Equation 29. This rule is a generalized version of the delta rule (cf. Equation 2) used for adapting weights in layered networks without hidden layers, i.e., Perceptrons (Rosenblatt 1958). It can be shown that the generalized delta rule also applies to threshold adaptation when a threshold parameter is added to the sigmoidal function (cf. Equation 16), i.e., Equation 24 becomes if = f(ap — 0P). If the j-th neuron in
64 layer p has threshold 6? then the threshold change (A) to minimize the error E(w) is given by: n"' V 86? V dy? 86? n dy? 8a? ^ ( l) We note that the final transition is accomplished by substituting Equation 30 taking into account that 8y? /dcr? = 0 if i ^ j. Using a vector notation and adding a momentum (cf. Equation 29) the following threshold adaptation rule is used: A6p(n +1) = V8P + aA6p(n) (42)
65 Appendix B: Back Propagation algorithm MAINLOOP BACK PROPAGATION randomize weights repeat INPUT :Yu...,YmN OUTPUT: error total error = 0 for all examples {X,Y} forward propagation(X) back propagation(Y, error) total error = total error + error endfor endfor error = 0 for i = 1 to m,N ^ = 1^(1-1^)(1^-¾) error = error + (yf* — Y;)2 until total error < e END MAINLOOP FORWARD PROPAGATION INPUT: xu...,xmo OUTPUT: - y° = x for p = 1 to TV for i = 1 to mp for j = 1 to rop_i a = a + ti^-if1 endfor tf = /(*) endfor endfor END FORWARD PROPAGATION for p = N — 1 downto 1 for i = 1 to mp a = 0 for j = 1 to mp+i a^a + w^S^1 endfor «r = »?(i-»ry endfor endfor for p = ./V downto 1 for i = 1 to mp endfor for i = 1 to mp_i for j = 1 to mp endfor endfor endfor I = -^Wf-1 + a«r J« END BACK PROPAGATION
66 References J.L. McCleEand and D.E. Rumelhart (1988) Training hidden units: the Generalized Delta Rule. Chapter 5 in Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises. MIT Press, Cambridge, MA. J.L. Elman (1988) Finding structure in time. CRL Technical Report 8801. Center for research in Language, University of California, San Diego. J. Henseler and P.J. Braspenning (1990) Training complex multi-layer neural networks, Proceedings of the Latvian Signal Processing International Conference, Vol. 2, Riga, 301-305. S. Kirckpatrick, CD. Gelatt and V. Torre (1983) Optimization by simulated annealing. Science 220, 671-680. Y. Le Cun (1985) A Learning Procedure for Assymetric Threshold Network. Proceedings of Cognitiva '85. (In French), Paris. 599-604. R.P. Lippmann (1987) An introduction to computing with neural nets. IEEE ASSP Magazine 3 (4), 4-22. M.L. Minsky en S.A. Papert (1969, 1988) Perceptrons: An Introduction to Computational Geometry. The MIT Press, Cambridge, MA. D.B. Parker (1985) Learning-Logic. TR-47, MIT, Center for Computational Research in Economics and Management Science. Cambridge, MA. F. Rosenblatt (1958) The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 65, 386-408. F. Rosenblatt (1962) Principles of Neurodynamics. Spartan, New York. D.E. Rumelhart, G.E. Hinton and R.J. Williams (1986) Learning internal representations by error propagation. Chapter 8 in Parallel Distributed Processing : Foundations. Vol. 1, MIT Press, Cambridge, MA, 318-362. F.M. Silva and L.B. Almeida (1990) Acceleration techniques for the Backpropagation algorithm. Lecture Notes in Computer Science: Neural Networks 412 (Eds. L.B. Almeida and C.J. WeEekens), 110-119. T. ToEenaere (1990) SuperSAB, fast adaptive Back Propagation with good scaling properties. Neural Networks 3 (5), 561-573, Pergamon-Press. P. Werbos (1974) Beyond Regressions New Tools for Prediction and Analysis in the Behavioral Sciences. M. Sc. thesis, Applied Mathematics, Harvard University, Boston, MA. B. Widrow and M.E. Hoff (1960) Adaptive switching circuits. Record of the 1960 IRE WESCON Convention, New York, IRE, 96-104.
Perceptrons H.J.M. Peters Department of Quantitative Economics, University of Limburg, Maastricht 1 Introduction A perceptron is a neural network that is trained under supervision. This means that the perceptron's decisions during training are compared with the desired decisions; based on this comparison the internal weights of the network are adjusted until a satisfactory result is reached. Not only the (rate of) convergence of this learning process is important, but also the problem of representation: Which (practical) problems can be written in a form suited to apply the perceptron; that is, which problems can be written as a linear threshold function? What are the implications of such representations for the efficiency and the rate of convergence of the learning process, and the necessary storage capacity? The by now classical work of Minsky and Papert (1969,1988), Perceptrons, An Introduction to Computational Geometry, on which this paper is based, in particular provides a detailed study of representation problems in connection with the perceptron. The organization of this paper is as follows. Section 2 gives a brief historical account of the development of perceptron theory. In section 3 definitions and a few preliminary results are presented. Section 4 develops some theoretical results which can be seen as exemplary for what perceptrons can and cannot do. Section 5 is on training and convergence; in particular, the basic perceptron convergence theorem is stated and proved. Section 6 contains a few concluding remarks. 2 Historical overview Perceptrons were introduced by Rosenblatt (1959). In his Principles of Neuro- dynamics (1962) Rosenblatt writes: Perceptrons ... are simplified networks, designed to permit the study of lawful relationships between the organization of a nerve net, the organization of its environment, and the "psychological" performances of which it is capable. Perceptrons might actually correspond to parts of more extended networks and biological systems; in this case, the results obtained will be directly applicable. More likely they represent extreme simplifications of the central nervous system, in which some properties are exaggerated and others suppressed. In this case, successive perturbations and refinements of the system may yield a closer approximation. Thus, the perceptron can be regarded as a highly simplified model of the human brain or at least part of it. Rosenblatt's book revived interest in neural
68 "connectionistic" networks, but was not the first work in this area. Neurological networks had been introduced and discussed earlier by McCulloch and Pitts in their articles A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) and How We Know Universals (1947). In these articles network architectures were described which in principle were capable of recognizing spatial patterns in a way invariant under certain groups of geometric transformations. Further, Hebb's book The Organization of Behavior (1949) must be mentioned within this development. Although, in the fifties, some further developments of neural networks occurred, things became quiet by the end of this decade. To a considerable extent, this was due to the success of the serial von Neumann computer. As an aside, note that neural networks were first developed in the forties, at a time when computers hardly existed, and programming languages above a minimal standard did not exist at all; in spite of this, neural networks are now often being offered as an alternative to "old fashioned" programming. Rosenblatt's perception brought new life to an almost extinct area; this per- ceptron, in all its simplicity, appeared to be capable of "learning" certain things. On the other hand, it turned out that perceptrons were not able to learn certain other things, in spite of all the effort put into extending and refining the training process, and building bigger machines. Namely, most researchers in the field were looking for more general methods which should make the perceptron capable of handling a large(r) class of problems. This is not true as far as Minsky en Papert in their book Perceptrons: An Introduction to Computational Geometry (1969) are concerned. Instead of looking for a method which would work in every possible situation, they provided a mathematical analysis and explanation of the fact that the particular method used by the perceptron performs well in some cases and badly in other cases. Consequently, the book reveals not only the possibilities but also the restrictions of the perceptron; for this reason, the limited interest in neural networks during the seventies has often been ascribed to the publication of this book. In the republication of the book in 1988, Minsky and Papert remark that research in the area of neural networks had come to a halt already at an earlier stage, due mainly to a lack of fundamental theories. Too much (vain) effort had been invested in the simple and somewhat ad hoc training process, at the expense of the more important problem of representation of knowledge. Indeed, during the seventies research in this last area has expanded enormously. The present revival in the field of neural networks and, more generally, of par- allellism (or "connectionism") is, among other things, perhaps due to the further development of multilayered perceptrons; these perceptrons will not be discussed in this paper (see, however, the contributions by Henseler and by Weijters and Hoppenbrouwers). It should be mentioned that Minsky and Papert have been rather sceptical concerning the possibilities of multilayered perceptrons, which makes the above reproach understandable. At this moment it is not yet clear whether multilayered perceptrons will lead to a breakthrough in parallel computing. The emphasis, however, tha.t Minsky and Papert give to the importance
69 of fundamental theories of knowledge representation, remains justified. 3 Perceptrons and linear threshold functions Perceptrons were introduced by Rosenblatt (1959, 1962). The present paper is based on Minsky and Papert (1969). The concept of a perceptron is illustrated by figure 1. Figure 1 shows the principle of parallel computation in general, and of the perceptron in particular. In this figure the general principle of parallel computation is applied to a problem of pattern recognition. The letter "X" is drawn in a plane, which is being scanned by local sensors. The information of these sensors is passed on to functions y>,-, which assign a certain value to it. For example, the plane R may be divided into small squares that are black or white depending on the pattern, in this case the letter "X". The function <pi assigns a certain value depending on the configuration of white and black squares in its domain, which is the part of R covered by the corresponding sensor. In Q the values of the (pi's are combined, leading to a certain value of the function tp', from this value it may be inferred, for example, that the pattern under consideration is the letter "X", or a cross and not a circle. An essential feature is that by parallel processing a global statement is obtained from local information. In a perceptron, the function -ip is a predicate which is itself a linear combination of predicates y>,-. A predicate is a function of subsets of R that has Fig. 1. Parallellism, and the perceptron. two possible values. We think of these values as representing truth or falseness, and it is customary to associate 1 with "true" and 0 with "false". Let
70 # = {^i, ^2; ■ ■ ■»f>n\ be a set of predicates. The predicate ip is called linear with respect to # if there exist a number 6 and numbers «i, «2, ..-,¾ such that, for every X C R, ip{X) = 1 if and only if a±ip±(X) + .. . + an<pn(X) > 0. The number 0 is the threshold and the a,-'s are the coefficients or weights. The predicate ip is called a linear threshold function. More compactly, ip can be written as iP(X) = [£ a^(X) > (?]. Here, [...] is a predicate assigning to the expression between the brackets the value 0 if the expression is false, and 1 if the expression is true. A perceptron is a device capable of computing all predicates which are linear with respect to some given set # of partial predicates. For instance, suppose the retina R is divided into a (finite) number of squares, and associate with each square i the predicate y>,-: "The square is black". Thus, the predicate (ft has value 1 if and only if the corresponding square is black. By taking all weights a; equal to 1, and 0 equal to 25, the linear threshold function ip assigns value 1 to a pattern X in R if and only if X occupies more than 25 squares. Furthermore, in a perceptron the weights are adjustable by a learning process; see section 5. In what follows, a special kind of predicate called "mask" plays an important role. Suppose, as above, that the (bounded) retina R is divided into a finite number of squares. We identify these squares with points. With each point p of R the predicate ipp is associated, defined by <pp(X) :=[p£l] for every X C R. More generally, with each subset A of R a predicate <p^ : X >—► [A C X], the mask of A, is associated. These masks can be used in definitions of predicates. For instance: [X contains at least M points] = Epefl VpPO > M — 1], [X contains more points than Y] = [J2P£R VpPO ~ <Pp(Y) > 0]> [X is for the larger part located in the right half of R] = EPe^shl^(*)-£Pe*leflM*)>0]- Suppose R contains n points. Then each predicate, being a function that assigns 0 or 1 to every subset of R, can be identified with a vector in 2"-dimensional Euclidean space (with coordinates in {0,1}). It is easy to verify that the 2" masks form a basis of this space; consequently, each predicate can be written as a unique linear combination of masks. In particular, this implies the following theorem. Theorem 1 If the retina R zs finite, then each predicate is a linear threshold function with respect to the set of all masks. Consequently, if R is finite, each predicate can be computed by a perceptron; therefore, problems that can be represented as a predicate can be computed by a perceptron. The performance of a perceptron mainly depends on the following two factors: - How "local" are the predicates <fi?
71 - How many predicates are needed, and what are the proportions of their weights? Minsky and Papert (1969) distinguish between several measures of "localness" of a predicate. The most important of these are the maximal diameter of the area of R to which the predicate is restricted, and the maximal number of points determining the value of the predicate. We will confine our attention to the latter measure. In order to give a formal definition, let the support S(<p) of an arbitrary predicate <p be the smallest subset S of R with f{X) = f{X D S) for every subset X of R. It is not hard to show that, if the support exists, then it is unique. The cardinality 1^(^)1 of S((p) is called the degree of <p. Predicates with small supports are, generally speaking, too local to be interesting. We are, however, interested in predicates which have R as support, but can be expressed as a combination of predicates with small supports. The order of a predicate ip is the smallest number k such that there is a collection of predicates # = {f} with repect to which ip is a linear threshold function and with \S(<p)\< k for allp €#. Observe that the order of a predicate rp does not depend on its specific representation. Masks have order 1, because for each subset A of R ^po = [X>*(*)>i^i-i]> i.e., <pa is a linear threshold function with respect to predicates of degree 1. Note, however, that the degree of ^ is equal to \A\. An example of a predicate of order 2 is the "counting predicate" rPM(X):=[\X\=M], where M is a nonnegative integer at most |i?|. This can be seen as follows. Assume the points of R are numbered, with masks y>,- for points i and ^¾ for two-point sets consisting of points i and j (i,j = 1,2,...). Then VM(X) = [(2M- 1)£ <pi{X) + (-2) Y, VH(x) >M2- 1]. For the right hand side of this equality yields [(2M - 1)\X\ - \X\(\X\ -1)- M2 > -1] = [(|X| - Mf < 1], which has value 1 if and only if \X\ — M. This shows that the order is at most 2. That the order is exactly 2 follows from corollary 3 in the next section. The counting predicate is an example of a predicate where a perceptron would perform quite well. The number of local predicates is relatively small, namely |i?| + ||i?|(|i?| — 1), and the weights are not too large so that we can expect a reasonable rate of convergence during the training phase.
72 4 Easy and difficult predicates Some predicates are "difficult" in the sense that they are of high order and that the weights in a representation are large. It is consequence of theorem 1 that if the order of a predicate is equal to k then the predicate can be written as a linear threshold function with respect to the set of masks of maximal degree k; namely, a predicate of order k can be written as a linear threshold function of predicates of degree at most k, and these can be written as linear threshold functions of masks of degree at most k. Consequently, to find the order of a predicate one only needs to consider representations in terms of masks. It is not always clear at first sight whether the order of a predicate is small or big. For example, the interesting predicate which tells us whether a certain pattern in R is convex turns out to be of order at most 3, whereas the predicate which recognizes connected patterns is of order |i?|. In this section, among other things, these statements will be proved together with a more general result, the group invariance theorem. This theorem applies to predicates which are invariant under certain permutations of the retina (for instance, the exact location of a pattern on the retina does not influence its convexity or connectedness), and states that the weights in a linear threshold function representation are independent of such permutations. Throughout it is assumed that the retina R is a finite approximation of a (bounded) subset of the Euclidean plane—for instance, think of the page in front of you as being divided into a finite number of small squares. A subset X of R is convex if with each pair of points in X also all points on the connecting line segment are in X. (Of course, some caution is in order here because R is finite, but this caution is presumed from now on.) Figure 2 shows some examples of nonconvex sets. Observe that in a nonconvex set X there is always a pair of points a, 6 of which the midpoint is not in X. Based on this observation, we can define V'CONVEX(X) = [^2<P{xi,Xj,Xk}(X) -ip{Xi,Xk}(X) > -1] as the predicate recognizing convexity of X. Here, summation is over all triples Xi, Xj, Xk with xj the midpoint of the other two points. Obviously, the order of V'convbx is at most three. Before continuing with more "difficult" predicates we first formulate and prove the group invariance theorem. By way of an illustration, suppose we wish a predicate to recognize the letter "A" no matter where it is located on the retina. Then this predicate should not depend on certain permutations of the retina, e.g., certain translations. In order to formalize this notion, the concept of a group is important. Suppose G is a set with an operation under which that set is closed. Thus, denoting the operation by juxtaposition, we have gh € G for all g, h 6 G. Now G is called a group if, additionally, the following conditions are satisfied: (i) There is an element of unity, e € G, with eg — ge — g for all g G.G. (ii) Each element g £G has an inverse element g-1 £ G with gg_1 = g~lg = e. (iii) The group operation is associative: (gh)i = g(hi) for all g, h, i € G.
73 Fig. 2. Nonconvex sets. In the present context the group of all permutations P of the finite retina R is of interest. For a subgroup G of P, we say that two subsets X and Y of R are G-equivalent, denoted by X =q Y, if there is an element g in G for which X = g(Y). For instance, if G consists of all horizontal translations of R (think of R as transformed to a cylinder by gluing the left and right ends together, for instance), then the letter "A" somewhere on R is equivalent to the letter "A" shifted over any distance to the left or right. It is easy to verify that =g is indeed an equivalence relation in the usual mathematical meaning of the word. We further say that two predicates <p and f' are G-equivalent, denoted by <p =g <p', if there is an element g in G for which <p(g(X)) = <f'(X) for every X c R. Also this defines an equivalence relation in the usual sense. In the example above, the predicate recognizing an "A" located at certain fixed coordinates is equivalent under the group G of horizontal translations to predicates recognizing the "A" located at different horizontal coordinates. Finally, we say that the set of predicates # is closed under the group G if for every <p in # and g in G the predicate <pg is also in #. For instance, the set of predicates such that each one recognizes the "A" at a different horizontal location is closed under the group of all horizontal translations. Now the group invariance theorem can be stated. Theorem 2 Let G be a subgroup of the group P of permutations of the finite retina R, and let # be a set of predicates closed under G. Let the predicate ip be a linear threshold function with respect to # that is invariant under G. Then there exists a linear representation of ip, for which the coefficients depend only on the equivalence classes of the predicates in #, that is, P(<p) = P(<p') whenever <p =g <p'■ Proof Let ip have a linear representation [X^e^ a{f)f{^) > 0]- (This is without loss of generality, for if the threshold is unequal to 0 we can always
74 normalize by adding the predicate <pq, with constant value 1 to #. At the end of the proof we can drop this additional predicate again.) For any g € G, the map <p i-^ tpg is & bijection on #, so that, £ a(<p)<p(X) = ]T a(<pg)<pg(X) <p£4> <?€$ for all X, because the same numbers are added in both sums. Let X be a subset of R with tp(X) = 1. Then, for each g € G, by G-invariance of ip, YJ»{f9)^9{g-l{X))>Q, and therefore ]T a{<pg)<p(X) > 0. Summing over all g in G and interchanging summation signs, we obtain ]T(]Ta(W)MX)>0 which can be written as J2 P(<pMX) > o, with (3(<p) := Ylg€G a(V9) f°r aU V € ^- The same argument for an X with ip(X) = 0 will show that J2 PfrMX) < 0. Combining the two inequalities yields V(X) = [£)%>M*)>()]. Finally, suppose that ^ =g ^', and let h € G with ip — <p'h. Then /?(*>) = ]T a(^) = J] atfhg) = ]T a(^) = ftp1) scg sgG 3eG where the third equality derives from the fact that the bijection g i—► hg simply permutes the order of adding the same numbers. This concludes the proof. □ A first consequence of theorem 2 is the following corollary. Corollary 3 Let G be a group of permutations of R with the property that for any pair of points p,q of R there is a g € G with g(p) = q. Then the only first- order predicates invariant under G are ip(X) = [\X\ > m], *P(X) = [\X\ > m], tp(X) = [|X| < m], and ip(X) = [\X\ < m], for some m.
75 Proof Let p,q £ R, X C R, and let g € G with g(p) = q. Then fp{X) = <pq(g(X)), so ipp =a <pq. Therefore, in view of theorems 1 and 2 we may assume V(X) = []Ta^(X)>0], P€X for a first-order predicate ip invariant under G. For a > 0 this is equivalent to rP(X) = [\X\ > 9/a]. The other predicates are obtained for a < 0 or a = 0 and by rewriting. □ Another consequence of the group invariance theorem concerns the following predicate V'odd(^) = [|^| is an odd number]. We consider this predicate because it illustrates the mathematical methods used and the kind of questions they enable to discuss. It turns out that this predicate is of maximal order: Theorem 4 V'odd is of order \R\. Proof Obviously, V'odd is invariant under the group of all permutations. Suppose the order of V'odd is equal to m. By theorems 1 and 2 it can be written as m 3=0 ip£$j where #y contains all masks of degree j. (The threshold can be taken equal to 0 without loss of generality.) Observe that, for every j \X\\ |X|(|X|-l)...(|X|-j + l) 3 ) j\ which is a polynomial of degree j in \X\. It follows that ?n X>;(X>(*)) j=o i^e^ is a polynomial of degree at most m in \X\, say P(|X|). Consider a sequence Xo,X\,... ,X\R\ of subsets of R with |XS| = i. Since P(\X\) > 0 if and only if \X\ is odd, P(\Xo\)<0, PflA-iUX), P(\X2\)<0,... which is only possible if the degree of the polynomial P(|X|) is at least |i?|. But this implies m > \R\. n The following theorem implies that the number of predicates needed in a representation of V'odd is large. E ?(*) = (
76 Theorem 5 Suppose ipODD is represented as a linear threshold function with respect to a set of predicates # containing only masks. Then <P contains all masks. Proof Suppose, to the contrary, that the mask <pa (A c R) is not an element of #, and that V'odd = Eve$ a(^)>' > ^1- For anv predicate ip define tpA by X i-> tp(XnA). Then, for every <p € #, <pA = <p if S(<p) C A and tpA is identically zero otherwise. Let <f>A be the set of masks in # whose supports are subsets of A. Then V^DD = [E„e<M <*{<p)<p > &], and |S(y>)| < \A\ for all y> € <^. This contradicts theorem 4 because it implies that the order of 4>ADD, viewed as a predicate on A, is less than \A\. □ Summarizing, the predicate V'odd has order equal to the cardinality of the retina, and in a linear threshold function representation with masks all masks are needed. Furthermore, by a combinatorial argument it can be shown that in such a representation the weights grow at least as fast as 2's^^_1 (see theorem 10.1 in Minsky and Papert). Such a representation is given by Vw(X) = [-£(-2)|5(v)lP(*) > 1] where summation is over all masks. Consequently, a perceptron not only has to compute a large number of predicates, but also the weights of these predicates increase exponentially. For instance, for a relatively small retina of 5 x 5 squares the number of masks is 225 and, in absolute value, the largest weight is 225; thus, the internal proportions of the weights grow exponentially large. As a final example* the predicate V'connected will be considered. Call two points of the finite retina R adjacent if they correspond to squares with a common edge. A subset X of R is connected if for any two points p, q in X there is a path of adjacent points in X through p and q. Connectedness is an important feature in pattern recognition. It will be shown that V'connected has arbitrarily large orders as R grows in size. We first prove the following theorem. Theorem 6 Let A\,. .., Am be disjoint subsets of R with equal cardinalities 4m2, and define the predicate ip(X) = [\X fl A{\ > 0 for every i]. Then the order of ip is at least m. Proof Let G be the group of all permutations of R with g(Ai) = A{ and g(p) = p for every g € G, i = 1,..., m, and p € R\ U/ Aj. Clearly, ip is invariant with respect to G. Let # be the set of masks of degree k or less, where k is some number at most \R\. Note that, for <p, <p' £ <P, <p =q ip' if and only if \S(<p) fl A{\ = \S(<p') fl Ai\ f°r every i. Let #i,#2, ■ • ■ denote the corresponding equivalence classes. For every equivalence class #y and every subset X of R let Nj(X) := \{ip € #y : S(sp) C X}\. By a simple combinatorial argument, / pfnAil \( \xnA2\ \ ( \xnAm\ \ i( >- Vls^nAii; {\S(<p)nA2\J ■■■{\S(<p)nAm\)'
77 where <p is an arbitrary element of #j. This implies that Nj(X) is a polynomial of the form Nj{x\,. ..,xm) of degree at most k by taking X{ = \XP\Ai\. Suppose E a<plP > 0] is a representation of ip as a linear threshold function with respect to the set of masks of degree at most k; for what follows we can take the threshold equal to zero without loss of generality. By theorem 2, the group invariance theorem, we can write which is itself a polynomial of degree at most k. Thus, we can write xl>(X) = [Q(xlt...,xm)>0] where Q := J2PjNj is a polynomial of degree at most k. Consequently, by definition of ip and x{, Q(xi,..., xm) > 0 if and only if X{ > 0 for all f. By making the substitution Xi = (t — (2i — 1))2 in Q{x\,..., xm), Q becomes a polynomial of degree at most Ik in t. Let t take on the values t = 0,1,..., 2m. Then 0 < x, < Am2 for all Xi. Observe that one of the Xi's equals zero for t odd, and all Xi's are positive lit is even. So Q is positive for even t and nonpositive for odd t. By counting the number of sign changes we obtain 2k > 2m, so k > m, which concludes the proof. □ Minsky and Papert call theorem 6 the "one-in-a-box" theorem since the predicate investigated in this theorem is true for those patterns which have a nonempty intersection with each member of a given collection of disjoint subsets of R. Theorem 6 will be used to prove the announced result concerning V'connected- Call a predicate, defined for differently sized retinas, of finite order if there is a number k such that the order of the predicate is at most k whatever the size of the (finite) retina. Theorem 7 The predicate V'connected is not of finite order. Proof Suppose the order of V'connected is uniformly bounded by k, and let m > k. Consider an array of 2m + 1 rows each containing 4m2 squares, see figure 3. For each i = 1,..., m let Ai be the set of points (squares) of the 2f th row. Let R be the union of the even rows, i.e., of the Ai, and R of the odd rows. Define the predicate ip on R by ip(X) = 1 if and only if V'connected(^U.R) = 1. Let V'connected have a representation [J2a(f)'P > ^] where the <p's are masks of degree at most k. Define, for a mask y> = <pA (A C (R U R)), <p' by X i—► <Pac<r{X) (X C R). Then <p'(X) = 1 if and only if <p(X U R) = 1; consequently, \^2loc{f)f' > 0] is a representation for ip, of order at most k. Because k < m, this contradicts theorem 6 applied to ip. D 5 Learning and convergence As is apparent from the preceding sections, the usefulness of perceptrons and of neural networks in general is intimately related to representation of knowledge.
78 A1 A2 Am \„ Fig. 3. Proof of theorem 7. An essential feature of the perceptron is, however, that it can be trained. Because it is able to learn, one does not have to know the exact representation of a particular predicate in order to apply a perceptron. Recall that a perceptron computes predicates of the form V>(X) = [][>^(x) > 0]. This is an exact representation of the predicate. For many complex problems, however, we do not know this exact representation; in particular, we do not know the weights av. A perceptron is programmed—parallel, or by simulation—in such a way that these weights can be adapted. The learning process starts with a more or less arbitrary set of weights. Next, the perceptron is "fed" some examples—for instance the complete set of objects to be classified, or a representative subset. For each example the output of the perceptron is compared with the desired output, and if necessary the weights are adapted. This process is repeated until a reasonable result is obtained. For an example, see the contribution by Weijters and Hoppenbrouwers in this book. The algorithm to adapt the weights {av \ <p € #} may be as follows. Suppose we have a collection of patterns F = F+ U F~ we wish to classify and—for convenience—assume 0 = 0. We will denote a set of weights {av \ <p £ $} as a vector A in |#|-dimensional space. Further, for X € F the vector with the values <p(X) as coordinates is denoted by 4>(X). The predicate ip classifying the patterns in F can be written as xl>{X) = [A-0(X) >0] for some weight vector A, where we assume that A ■ $(X) > 0 if X € F^ and A -#(X) < 0 if X € F~. Consider the following "learning algorithm": Start Choose an arbitrary vector A. Test Choose an X G F.
79 If X € F+ and A - &(X) > 0: go to Test. If X <E F+ and A - &(X) < 0: go to Add. IfX€F~ and A - &(X) < 0: go to Test. If X € F~ and A ■ &(X) > 0: go to Subtract. Add Replace A by A + &(X). Go to Test. Subtract Replace A by A - &(X). Go to Test. Summarizing, if a pattern X is classified in the right way, then the next test pattern is chosen; if a pattern X is wrongly classified as belonging to F~, then the corresponding #-vector is added to A; if a pattern X is wrongly classified as belonging to F+, then the corresponding #-vector is subtracted from A. Surprisingly enough, it turns out that this simple algorithm works. We prove this result for a simpler formulation of the learning algorithm. Instead of distinguishing between vectors <P(X) for patterns X, we will simply distinguish between vectors <f in a collection F of zero-one vectors. Consider the following program. Start Set A to an arbitrary # of F. Test Choose an arbitrary # £ F. If A #>0 go to Test; (P) otherwise go to Add. Add Replace A by A + #. Go to Test. Observe that this program can indeed replace the previous one by taking for F in (P) the set {$(X) : X € F+} U {-$(X) : X € F~}. The following theorem is known as the perceptron convergence theorem. Theorem 8 Assume there exists a vector A* for which A* ■ # > 0 for all # in F', then program (P) will go to Add only a finite number of times. Proof Let || ■ || denote the Euclidean norm, and let m be the number of predicates <p, which is equal to the squared maximal length of a vector # in F. Since F is a finite set, there is a number 8 > 0 with A* ■ <t> > 8 for all # in F. Define the map C : A i-> [A* ■ A)/\\A\\. The Cauchy-Schwarz inequality, \A* ■ A\ < \\A*\\ \\A\\, implies C(A) < \\A*\\ for all vectors A. We consider the behavior of C(A) on successive passes of the program through Add. Then A*-At+1=A*-(At+$) - A* -At+A* $ > A* -At+8 so that, after the n th application of Add we obtain A*-An>n5. (1) Because At ■# must be nonpositive (or the program would not have gone through Add), we further have pi+i||2 = Ai+i -At+i
80 = (At+0)-(At+0) = 11^ + 2^.0+11^112 <\\At\\2 + m so that, after the n th application of Add we obtain pn||2 <nm. (2) Combining equations refeql and 2 yields C(An) = ^r>-^L. \\An\\ Vnm Because C(A) < \\A*\\ the program can pass through Add only so long as n < m||A*||2/<52. This completes the proof. □ Remark It is easy to verify that theorem 8 still holds if F is a compact set instead of a collection of zero-one vectors. The algorithm in theorem 8 will after finitely many times result in a vector A0 which has the property that A0 ■ 0 > 0 for all 0 in F—the proof of the theorem actually gives an indication of the rate of convergence. In terms of the original problem, the predicate ip = [A0 ■ 0 > 0] will have the following (desired) property: X eF~ => t/>(X) = 0, X £F+ => t/>(X) = 1. This is often expressed as "the predicate ip separates the sets F+ and F~~." Of course, the vector A0 does not have to be equal to A*. There exist some variations on this algorithm, for which variations of the perceptron convergence theorem hold. An important variation is classification in more than two classes. We conclude this section by formulating the corresponding algorithm. Let F\, F^, ■ ■ ■ be classes of patterns and assume that there exist a number 8 > 0 and vectors A* for which, for all j ^ i XeFi=>A*f 0(X) > A* ■ 0(X) + 6. The corresponding training program is as follows. Start Choose A\, Ai,... (=£ 0) arbitrary. Test Choose i,j and X £ F{. If M ■ 0(X) > Aj ■ 0(X), go to Test; otherwise go to Change. Change Replace A{ by Ai +0(X), Aj by Aj -0(X); go to Test. Under the mentioned conditions, this program will go to Change only a finite number of times.
81 6 Concluding remarks The main conclusion of the preceding sections is that the usefulness of the simplest of neural networks, the perceptron, depends essentially on the representation of the problem to be handled. If many masks are needed or the internal proportions of the weights in an exact representation are large, then the perceptron might not perform very well. Examples were given in section 4. For instance, for the predicate V'odd it can be shown that the number of "learning" examples must grow exponentially with the number of squares, i.e., the size of the problem. On the other hand, Minsky and Papert present some examples where the perceptron could perform well, such as recognition of convexity, of recognition of hollow or solid squares, and some others. More complex problems may sometimes be solved by so-called multilayered perceptrons—an example is the exclusive or (XOR) predicate, see the contribution by Weijters and Hoppenbrouwers. These neural networks are also trained under supervision, according to the so called generalized Deltarule, which is an extension of the perceptron learning algorithm. Both are based on the "steepest descent" optimization principle. See the contribution by Henseler on Back Propagation. References D.O. Hebb (1949) The Organization of Behavior. Wiley, New York. W.S. McCuEoch and W. Pitts (1943) A logical calculus of the ideas immanent in neural nets. Bulletin of Mathematical Biophysics, 5, 115-137. M.L. Minsky and S.A. Papert (1969, 1988) Perceptrons: An Introduction to Computational Geometry. The MIT Press, Cambridge, MA. W. Pitts and W.S. McCulloch (1947) How we know universals. Bulletin of Mathematical Biophysics, 9, 127-147. F. Rosenblatt (1959) Two theorems of statistical separability in the perceptron. Proceedings of a Symposium on the Mechanization of Thought Processes, Her Majesty's Stationary Office, London, 421-456. F. Rosenblatt (1962) Principles of Neurodynamics. Spartan Books, New York.
Kohonen Network O.J. Vrieze Department of Mathematics, University of Limburg, Maastricht 1 Introduction One of the main problems in informatics concerns the representation of data including their mutual relations. The more economically such a representation occurs, the more "intelligent" the resulting knowledge system can be made. Mutatis mutandis this holds for the human brain system. In thinking processes and in information handling processes in the unconscious parts of the brains there is an urge to represent knowledge in reduced form, leaving the overall picture. The goal of human intelligent information handling is generally aimed at the construction of simplified maps of the observable world. Dependent on the way real world data presents itself, different abstraction levels may be discerned. It is already known for a long time that the various parts of the brain are ordered with respect to sensitivity to modalities. This especially holds for the cortex cerebralis. Further there are areas that perform special tasks like speach control or the analysis of sensory perceptions like sight, hearing, sense, etc. More recently it appeared that within these specialized areas a next hierarchical structure exists. A striking observation was that for visual and for somato-sensory response-signals typographically the same ordering was found at the cortex as holds for the associated sense itself. In such a case one speaks of a somato-topic map. This feature forms the starting point of a Kohonen network. Though the main structure of the human brain network are genetically determined, experiments show that this sensory projection can also be learned by experience. For instance the loss of some sense or of brain tissue or the suppression of sensory stimulation during youth may result in a complete leave out of the above mentioned projection. In such cases it appears that the corresponding areas of the brains are used for other projections. A Kohonen network can be interpreted as a mechanism to simulate the learning process that enables certain areas of the brains to handle in an orderly manner the sensory perceptions. Notice, that a Kohonen network is not a physical analogy of an expected or registered neuronal configuration, but merely a system that functionally simulates the learning processing functions of certain areas of the cortex. In Section 2 the principles of the Kohonen network will be outlined. Next, in Section 3 the use of Kohonen networks will be treated. For instance, for the one-dimensional case it will be proved how Kohonen networks can represent a probability distribution of one variable. In Section 4 two application are worked out, one concerning the control of a robot and one concerning a "heuristic for the travelling salesman problem.
84 This paper will be concluded with a few remarks. Finally, in the Appendix we give an application on ordering of color signals, which is meant as an intuitive adstruction of the operation and the principles of the self-organizing property of the Kohonen mechanism. 2 The Principle of the Kohonen Network Usually the input-output relation of a model of a neuron is given by a threshold function. Input 12 ._ S3 _ Sn M-3 On Output T) Fig. 1. The formal neuron. In Figure 1, £j, j = 1,2,..., n, are the inputsignals and r\ represents the outputsignal. Further, fii, fi2, . .., fin, can be interpreted as synaptic efficiency- coefficients, also called the weights. Using a threshold function, the relation between the input signals and the output signals can be expressed as: r) = kx 6(%2t*&-8), (1) t=i In expression (1), 6 represents the Dirac or Heaviside function, i.e., S(x) = 1 when x > 0 and 6(x) = 0 when x < 0. Hence the neuron is firing at level k whenever In practice a neuron often behaves as a leaking integrator. So a representation as a dynamical system leads to the following state equation: -^=Yl^i-l(v) ifr?>0
85 and ^ = max jf>& - 7(1?), 0} if i? = 0, (2) where 7 is a loss term, which is generally nonlinear. The unsupervised learning behavior of an artificial neuron as described above can be represented by defining the efficiency-coefficients fij, j = 1,.. .,n, time independent. Already in 1949 Hebb postulated the hypothesis that the first derivative of fij is proportional to both the incoming and the outgoing signal. Besides, it is likely that a kind of forgetfulness will take place or some other effect that reduces the learning quality. Usually in a Kohonen network the rate of forgetting is taken proportional to fij and also to some function 0 of the outputsignal 77. *&- = ar£,--0(r,)H. (3) In the Taylor expansion of j3(rj), normally the constant term equals 0. a is called the adaption parameter or learning parameter. Equation (3) forms the base of a Kohonen network. The asymptotic behavior of the vector fi = (fii ,fi2, ■ ■ ■, fin) associated with a stochastic input £1,...,£n determine the properties of the network. The next proposition, that holds under quite general conditions, can be found in Kohonen (1988). Proposition 1: Let £1,^2, ■ ■ ■ ,£n be time invariant stochastic variables and let C be the associated (n x n) correlation matrix. Then, the dynamical system (1), (2) and (3) will converge to the asymptotic value fi*, where fi* is the eigenvector corresponding to the biggest eigenvalue of C. In other words, the neuron becomes specific sensitive to a special form of the statistical input data. Namely a form that represents a fundamental property of the statistic input £1,.. ., £„. We will now consider a neural network as described in Figure 2. Two basic properties are essential. In the first place, every neuron receives the same input. In the second place lateral feedback is included, i.e., every neuron receives as additional input the output of all neurons including itself. In practice, the functional interaction levels of the lateral feedback signals often are "Mexican hat" shaped, cf. Figure 3 A "Mexican hat" interaction behaves as follows. When a neuron is firing, near by situated neurons are stimulated,- diminishing with increasing distance to the firing neuron. From a certain distance on inhibition will take place that gradually vanishes when the distance becomes very large. In Kohonen networks usually only the inner stimulation area is used, i.e., when a neuron i fires, a positive feedback takes places for all neurons i', whose distance to i is smaller than some given number p. When iVj = {i'\d(i,i') < p}, the stimulus pattern over iV,- will have the form of the inner area of the "Mexican hat". Such an artificial neural network can be used as a learning mechanism to quantify vectors by approximation. The way this can be done will now be described.
86 Input x fe- e- ■-e- -e- -e- -e- -e- -e- -e- -e- -e- -e- ~e~ -e- ■-e- -e- -e- -e- p -e- -e- -e- -e-. Output y Fig. 2. Operational module for neural systems. Interaction Fig. 3. The "Mexican-hat" function of lateral interaction. Let I be a raster of points, indicating neurons. The learning process proceeds along discrete time moments t = 1, 2,.. .. To every point i € I, at every time moment t, a vector mu = (m«i, «Jt»2i ■ ■ ■, "i*in) is associated. Here, mtij can be interpreted as the updated efficiency-coefficient of neuron i for input channel j. The learning process is fed with inputdata x% = (xti,xt2, ■ ■ ■, xtn), * = 1,2,.... Let X be the set of all possible inputs x. At every time moment t an x € X is being selected according to a specified chance experiment. This results in training series xt, t = 1, 2,.... Next fix a p > 0 and a row at, t — 1, 2,... with lirrii_>oo at — 0. At every time moments the vectors mi are adapted along the following scheme: (i) Select ie such that \\xt - rnnc\\ = min,-e/ \\xt - m«|| (ii) LetNt:={i\ \\ie-i\\<p] (iii) Set rrii+u = mu + athnc(xt - mti) for i € Nt. Set mt+li = mu for i ¢ Nt. Initially mu can be arbitrarily chosen. The interpretation of the above scheme is as follows. In step (i) the neuron ic
87 is selected that fits the input at best. This neuron fires. Next those neurons are selected whose distance to ic is at most p and those neurons are stimulated by the lateral feedback mechanism. This results in an update of the efficiency-coefficient. The constants huc determine the magnitude of this lateral stimulation. In concordance to the "Mexican hat" one could take /j88-c = ik ^.,-11 ■ However easier to handle is the choice huc = 1 which is just as effective and sometimes even better. Tacitly, in the above procedure, we have assumed that both I and X are metric spaces. Observe that the scheme (i)-(iii) can be considered to be the discrete version of the dynamical system (1)-(3). Fig. 4. Illustration of "projection" and dimension reduction. Repeatedly application of the procedure (i)-(iii) results in a map of I into X. This map is determined by the probabilities with which the different points of X appear in the series. For example, in Figure 4, X is a distorted block and I is a rectangular raster of points. If we take the uniform distribution on X then the mi8's, i € I will asymptotically approach the values as indicated in Figure 4. In fact to every i <E I a segment of X can be associated. Notice that the shape of X and the shape of the image of I coincide but differ in dimension. This phenomenon turns out to be a general property. In literature on Kohonen networks this property is often called "self-organizing feature map" or "topology- conserving learning maps". Two other examples that show the topology preserving property can be found in Figure 5 and Figure 6. In Figure 5 X is a square and the training series xt is again selected according to the uniform distribution. I is a square raster of points with starting vectors mti concentrated around the center of X. The image of/ into X becomes more and more representative for X with increasing number of iterations. In Figure 6 X is a triangle. Again the uniform distribution is used. Here the points of I are located at a line, so one dimensionally. Also in this case the image of I tries to copy X as close as possible. Thus, Figure 6 shows an example of a learning process in which the representation of the object under study occurs in a lower dimension than the dimension of the object itself.
500 cycfes 2000 cycies (a) <») Fig. 5. Weight vectors during the ordering process of a uniform distributed square into a. lattice. 3 Mathematical Aspects of the Kohonen Network The proof that the above procedure (i)-(iii) has the mentioned asymptotic property will only be given in case that X is a linesegment endowed with the uniform distribution and where / = {l,2,...,n}is linearly ordered. The neighborhood Nt of a point ic £ I that is "hit" at time point t is chosen as follows: If tc € {2,3, ...,71- 1} then Nt = {i&- 1,^,^ + 1)- If ic = I, then Nt = {1,2}. If ic = n, then Nt = {n - 1, n}. Now, if we apply the procedure (i)-(iii) with hiic = 1 and lim^oo at = 0, the next theorem, can he proved.
89 0 20 100 1000 10000 25000 Fig. 6. Weight vectors during the ordering process of a uniform distributed triangle into a curve. Proposition 2: With probability 1 the series mn, mi2, ..., mtn is either monotone decreasing or monotone increasing for all t sufficiently large. Whether the sequence mn, mti, ... mtn is ordered can be measured by the expression n D = ]jP|m« -m«_i| - \mtn - mn\. »=2 According to the triangle inequality we have D > 0 and D = 0 if and only if the sequence mn, mti, ■ - •, mtn is monotone. The self-organizing effect of the algorithm follows by the fact that at each stage the probability that D decreases is larger than the probability that D increases. At every stage at most 4 terms of D will change, in which ic — 1, ic, and ic + l are involved. For every configuration of the points ic—2, ic— 1, iC} ic + l, ic+2 it can easily be shown that the probability of a reduction of D is larger than the probability of an increase. For instance with at = \ the result of an iteration based on Xt can be found below:
go Before the iteration. After the iteration. The following assertion can easily be verified: Assertion 1: Once the sequence mn, m<?7 , .., min is monotone for a ctttam t, then it is also monotone for all larger t. Namely, if \mti — xt\ is minimal for i = ic, then the images of ie — 1, ic and ic + 1 slide a bit into each other while the other images remain unchanged. Hence the ordering will not be disturbed. So the proposition is proved if with probability 1 the sequence mn, rri\i, -., mtn is monotone for certain t. Assertion 2: When the sequence <xt, ( = 1, 2,... is appropriately chosen, the sequence mn, mn, - ■ -, mtn becomes monotone with probability 1 for certain t.
91 This assertion can be proved using Markov chain theory., A specific ordering ro*i, ro*2> - •-) w*n represents a state of the system. States may or may not change after each iteration. It can be proved that all non-monotone states are transient states and that exactly one of the both monotone states (decreasing or increasing) will be reached with probability 1 and that these states are absorbing. In Figure 7 the asymptotic result of an application of the procedure on a uniform distribution on [0,1] is shown. Figure 8 shows an application on a normal distribution on [0,1]. 1.00 0.75- 0.5 0.25 - 0.00 rD 10 20 25 Neuron Number Fig. 7. Kohonen self organizing map of a uniform distribution. Both figures indicate that the limit values of the images mu have to do with the original distribution. This relation can be stated as follows. Let m* be the limit values in case of a uniform distribution. In the limit mu does not change anymore in expectation, hence: E{x\x &Si}- m* = 0, for each ie I, (4) where Si C X is such that if x <E Si,then i is selected. It can straightforwardly be verified that for 3 < i < n - 2 for i = 1 for i = 2 for i ~ n — 1 for i~n [\(™*i-2 + n»?-i), i«+i + ™*+2)l St S = flU(m$ + m$)] S = [0,|(rn5 + rnS)] -¾ = [h«-3 + m*n_2), 1] Si = [|(mn-2 + »"*_!), 1] The set of functional equations (4) thus reduces to a matrix equation with two solutions, which are symmetric to each other. In Table 3 and Figure 9 the solutions are numerically and graphically presented for different values of |/|.
92 1.00 -i 0.75- 0.5 - 0.25 - 0.00 10 20 25 Neuron Number Fig. 8. Kohonen self-organizing map of a normal distribution. Table 1. Asymptotic values of m* for the cases \I\ = 5, 6, 7, 8, 9 and 10. Length of array (I) m* mi, m^ m± m^ m6 rrij mg m$ m^o 5 0.2 0.3 0.5 0.7 0.8 - - - - - 6 0.17 0.25 0.43 0.56 0.75 0.83 ---- 7 0.15 0.22 0.37 0.5 0.63 0.78 0.85 --- 8 0.13 0.19 0.33 0.44 0.56 0.67 0.810.87- - 9 0.12 0.17 0.29 0.39 0.5 0.61 0.7 0.83 0.88 - 10 0.11 0.16 0.27 0.36 0.45 0.55 0.64 0.73 0.84 0.89 1 = 5 J l_ | ! I I I I l_ »1 L. -I 1 I I I ! .1 >0| ■ ■ ' ' ' ' ' l_ J \ Fig. 9. Asymptotic values of m* for different values of |/|.
93 It appears that, with the exception of the borderpoints, the points m* are symmetrically divided along [0,1]. To every i € I an interval of X can be associated, consisting of those points x for which \x — m*\ < \x — m*,\, each i' € I. Let X{ = {x € X; \x — m*\ < \x — m*,|, each i' € I}, then m* is the relative center of gravity of X(, weighted according to the used probability mechanism. Furthermore the probabilities p(Xi) are equal for each i with a small deviation at the endpoints. This deviation decreases both relatively and absolutely with increasing n. That all probabilities p(Xi) are equal is reflected in Figure 9 by the phenomenon that around the average, where the most probability mass is concentrated, the intervals Xi are smaller. Hence, in order to represent such a probability distribution relatively more points out of I are used to represent the area around the average when compared to the extreme ends. In Figure 10 the asymptotic values for a broken ring are depicted. Further the so-called Voronoi mosaic is given, being the subdivision of X in the areas Xi- Notice that the areas Xi are separated by hyperplanes that are orthogonal to the connecting lines between the points m*. Infuence region of unit i Support of P(x) Fig. 10. Equilibrium state in self-organization. 4 Practical Applications In this Section we discuss two practical applications. The first one concerns an application of Ritter, Martinetz and Schulten (1989) to a learning process with respect to the kinetics of a robot arm. The second one concerns an application of Angeniol, De la Croix Vanbois and Le Texier (1988) on the traveling salesman problem. 4.1 Control of a robot arm Two cameras observe a point object at a table as well as the position of the grab of the robot. The cameras can only "see" two-dimensionally, resulting in the two
94 coordinates (xu, «12) and (#21; #22)- The goal of the learning process is to learn the robot to move his grab to the right position when he observes an object. The grab can be positioned by tuning three angles: (?i, (?2 and 63 as indicated in figure 11. Hence the learning process aims at approximating the map fi : ((#11,^12), (^21,^22)) '—> (01)02,¾) such that fJ-((x 11,x 12), (#21,^22)) does the grab move to the desired position at the table. Fig. 11. The camera positions and the grab of the robot. We consider the discretized version of the problem. To that purpose, imagine a raster at the table and if the grab is positioned in such a way that he can pick up a certain raster point then we suppose that all positions at the table for which this point is closest can also be picked up. The network I consist of as many points as the points of the raster and the points of I are logically arranged similar to the raster points. To every network point i £ I, at every time point t, two vectors are added: mu and Vu- The vector d,- consist of 15 components. The first three concern the angles (9ui,0u2, and 0ti3), being the temporarily tuning choices associated to table rasterpoint i. This tuning prescription results in a certain position of the grab which is observed by the two cameras: mu = ((mun, m«i2), ("it»2ij mn'22))- The other 12 components of d,- concern the 12 elements of an adapter matrix of size (3x4). This matrix is used when the angles associated to this rasterpoint have to be adapted. This matrix will be denoted by Au- The adaption process goes as follows: Randomly a table rasterpoint is chosen, say i, with coordinates X{ = ((arm, Xiu),
95 (^21,^-22))- Next Gt+ii = Qu + atAti(xi - mu) (5) and At+u = Ati + Ati(xi - mt+ii) x (mt+u - rnu)T\\mt+u - rnti\\~2 (6) Here at £ [0,1] is the learning factor at iteration t, while the components of mt+u correspond to the position of the grab as seen by the cameras, when the grab is tuned according to Ot+u- Equation (5) can be interpreted as the combination of a rough tuning 0ti and a fine correction tuning atAti(x{ — ran)- Equation (6) can be rewritten as At+u = Ati + {A*; - Ati)AmiAmJ\\Ami\\-2, (7) where Arm = mt+u — mu A*i = -Ati(xi - mti){AmJl)T A -1 - (( 1 1 ) ( X ! xA 1 y-Amm' Amn2 ^Ami2i' Ami22 J When in equation (7) A* is considered to be the true A{, then we derive the usual adaption scheme for Kohonen networks. Based on the general theory of Kohonen networks it can be stated that the above scheme converges to an optimal control of the grab. As lateral feedback mechanism one could take for instance a Gaussian function as discussed in Section 2. The model presented here concerns the most simple version of a grab control. The next step is of course to treat the situation for which as X the total surface of the table is included. We will not elaborate this problem here, but indicate that when a certain area of the table asks for a greater precision, this can be reached by putting more probability weight on such an area. In figure 12 an illustration can be found. 4.2 Self-organizing Maps and the Travelling Salesman Problem Angeniol et al. (1988) applied the idea of a self-organizing map in the form of a heuristic to the traveling salesman problem (TSP). They confined themselves to TSP's in a plane surface. We consider M cities with coordinates (xn, «8-2) for city i. The distance from city i to j equals ^(xn — Xji)2 + (xi2 — #j2)2- The network is of the Hamilto- nian type: each city is connected to all other cities. The application of the idea of a self-organizing map goes via the construction of certain curve. This is formed by nodes that move each iteration towards the cities. If the nodes and the cities coincide or if the distances between the nodes and the cities are small enough, the iteration stops and an approximating solution
96 Fig. 12. A non-uniform map between rasterpoints and the table positions. is found. During an iteration, nodes can be generated or disappear. The way this happens can be found below. Suppose that at a certain moment N nodes are present and that city i is selected for the next iteration step. This iteration step consists of two parts: (i) Compute mio,- Vj with Vj - (xn - Cj{f + (xi2 - cj2)2- Here (cji,cj2) are the coordinates of node j. This results in node jc, the node closest to city i. (ii) Move all nodes in the direction of city i according to the following rule: cjk <- cjk + f(G, n)(xik - cjk), for k - 1,2. Here f(G, n) = (1/V2) exp(-n2/G2) with G the learning factor and n - ini{j - jc (mod N),jc - j (mod N))}. In the above scheme it is assumed that the nodes are numbered according to a fixed pattern. The learning factor G can be adapted after each iteration, for instance G = (1 - afK with a € (0,1) and K € K. According to a fixed sequence every city is selected and after the last one the procedure continues again with the first one, etc. The generation and disappearance of nodes goes as follows. If a node is added to 2 cities then this node is copied. The extra node gets the same coordinates but a different rank number, namely as immediate neighbor of his creator. Thus these two identical points split up after the next iteration. A node disappears
97 if after 3 consecutive rounds along all cities this node is not selected at all as closest node. Again using the general theory of Kohonen networks it can be shown that if after a certain iteration the nodes are close enough to the cities then the allocation will not change anymore. In order to check whether this situation is reached an arithmetic criterion can be build in. The diagram below serves as an example. Fig. 13. Application of a Kohonen network to a TSP with 1000 cities. 5 Conclusions A Kohonen network as a self-organizing mechanism supplies an important contribution in the development of neural networks. The learning aspect is mainly aimed at the quantification of vectors,which can be accompanied by a reduction of the dimension. Further, the property that "shapes" remain kept with self- organizing feature maps, makes the Kohonen network a very strong instrument. A striking fact is that both anatomically and functionally certain cortex areas can be discerned that have similar properties as Kohonen networks. Examples are the processing of sound and light stimuli. With respect to the theoretical properties of Kohonen networks little is known. Especially the rate of convergence related to the time-dependent learning factor is an uncertain aspect. For the application to the TSP no theoretical foundation is present indicating the quality of the heuristic. In conclusion, a
98 Kohonen network appears to be a promising technique for which coming mathematical foundations have to show the quality as well as future directions of development. In the appendix a further application of the self-organizing property of a Kohonen network is worked out. This example is given by Henseler and Postma (1990). Appendix Recently Henseler and Postma reported an application of the self- organizing principle of Kohonen networks. A screen, subdivided in grids or squares is casted with paint daubs of different colors. A square is selected randomly. Next, with the aid of the basic colors yellow, blue and red a color is mixed that fits the color of the selected square at best. Such a color can be represented by a vector (y, b, r) representing the proportions of the three consisting colors. Now, a daub of this color is thrown to the screen, aiming at hitting the selected square. However not only this square will be hit, but also neighbors with decreasing intensity for squares further away. Subsequently, a next random square is chosen, etc. Using a computer program Henseler and Postma show that this procedure orders colors at the screen. Figure 14 gives an example. The left figure is the starting position. The figure to the right gives the ordering after the self-organizing process. Ill Blue Fig. 14. The self-organizing process of ordering colors. Below a pseudo-code list is given of the computer program that enables experimenting with the above described phenomenon.
99 THE ALGORITHM Program SOFM {Input} m, n : natural number {dimension map} B : natural number {dimension neighborhood} a : rational number {adaption rate} {Output} V(l..m, l..n, 1..3) : rational number {Local} i, j, k : natural number W{,Wj : natural number if (1..3) : rational number d, minimum : rational number b : natural number begin {random initialization of V} for (i,j,k) in [1..n] x [1..m] x [1..3] do V(i,j, k) = random number in [3/8..5/8] end for {self-organization} repeat till V nearly does not change any more {take a daub K of arbitrary color} for Jb in [1..3] do K(k) = random number in [0,1] end for {determine square V(wi,Wj) closest to K} minimum = infinity for (i,j) in [1..m] x [1..n] do {compute distance between V and K} d = 0 for Jb in [1..3] do d = d + (V(i, j, k) - K{k) * (V(i, j, k) - K(k)) end for {save (i, j) if d has smallest distance so far} if d < minimum then minimum = d W{ = i Wj = j end if
100 end for {adapt V(wi,Wj,k) and its neighbors} for (a, j) in [w{ — B, W{ + B] x [wj — B, wj + B] do {do not surpass the defined screen!} if (a, j) in [l,m] x [l,n] then {6 in (0,5] indicates in which neighborhood (i,j) can be found} b - MAX(ABS(a - un), ABS(j - Wj)) {adapt color V(i,j,k) dependent on the distance to V(wi, Wj, k) and on a] for k in [1..3] do V(i,j,k) = V(i,j,k)+ +a*(K(k)-V(i,j,k))/(2*b+l) end for end if end for end repeat end program References B. Angeniol, G. de la Croix Vaubois and J.Y. le Texier (1988) Self-organizing feature maps and the Travelling Salesman problem. Neural Networks 1, 289-293 D.O. Hebb (1949) The Organization of Behavior. Wiley, New York. H. Henseler and E. Postma (1990) Self-organization as computational principle (in Dutch). Convex 6, 16-19. T. Kohonen (1988) An introduction to neural computing. Neural Networks 1, 3-16. T. Kohonen (1988) Self-Organization and Associative Memory. (2nd ed.), Springer- Verlag, Berlin. H.J. Ritter, T.M. Martinetz and K.J. Schulten (1989) Topology-conserving maps for learning risoco-motor-coordination. Neural Networks 2, 159-168.
Adaptive Resonance Theory E.O. Postma and P.T.W. Hudson Department of Computer Science, University of Limburg, Maastricht 1 Introduction This paper provides an overview of the family of ART (Adaptive Resonance Theory) neural networks. Theoretical backgrounds of these networks can be found in the numerous publications by Grossberg and his coworkers (see Grossberg, 1982; 1986; 1987a and Grossberg 1987c for some summarizing overviews). In this contribution, we focus on the practical aspects of ART to help the reader to grasp the approach without being bothered too much by theoretical details. ART networks learn input patterns by classifying them in an unsupervised way: there is no external teacher telling the network under which category an input pattern should be stored. The fact that learning proceeds unsupervised imposes restrictions on the way input patterns are treated in ART networks. These restrictions follow from consideration of the stability-plasticity dilemma. According to this dilemma, a neural network should be plastic, in order to store novel input patterns; however, it should also be stable, in order to protect stored patterns from being erased (cf. Grossberg, 1987b). ART networks cope with the stability- plasticity dilemma by treating novel input patterns differently from old (earlier learned) input patterns. Section 2 provides a concise introduction to the Adaptive Resonance Theory that helps the reader to understand certain design choices in the networks. Section 3 treats the ART1 network in detail. This network classifies binary patterns in arbitrarily fine categories without the need for an external teacher (i.e., learning is unsupervised). In addition, a variant of ART1 is outlined that simplifies the architecture considerably. Section 4 discusses the modifications needed to extend the network to classify analog patterns (ART2) and Section 5 discusses supervised learning in ART networks. Finally, Section 6 evaluates ART networks. 2 Adaptive Resonance Theory The formulation of the Adaptive Resonance Theory goes back to the first publications of Grossberg (compiled in Grossberg, 1982). It is based on many notions drawn from biology, psychology, and mathematics. Below we shortly describe three central notions: (a) the notion of two stages, (b) the notion of two- component input, and (c) the notion of integration of bottom-up and top-down processes.
102 (a) Two Stages The theory pertains to networks of two interconnected layers, designated Fl and F2 in Figure la. Fl is an input layer and F2 is an output layer. A pattern in F2 represents a categorization of a pattern at Fl. The connections between Fl to F2 (represented by the arrows in Figure la) enable both layers to communicate. (b) Two-Component Input In ART, input signals are composed of two components: a specific informational and a nonspecific arousal component (see Figure lb). The informational component contains the patterned information or content embodied in the input ("what is it?"). The arousal component represents the general activation that is generated by the presence of the input alone, irrespective of its content ("there is something!"). (c) Integration of Bottom-up and Top-down Processes The integration of input from the environment and internally generated expectations based on knowledge of the environment is an important notion in ART (see Figure lc). The integration of the processing of bottom-up input signals with the knowledge- based (top-down) signals, provides a key to the stability-plasticity dilemma. It enables ART networks to differentiate between novel and old patterns: a novel pattern does not match expectation whereas an old one does. output F2 Fl input (a) knowledge w input environment (b) (c) Fig. 1. Basic notions of the Adaptive Resonance Theory, (a) Network consist of an input layer Fl and an output layer F2. (b) Input consists of two components: one component contains the patterned information, the other signals the presence of input, (c) Pattern processing comes about by the integration of what comes in (the environmental input) and what is known about it (the knowledge about the environment).
103 3 ART1: Classifying Binary Patterns In this paper, we will refer to the processing elements in the network as nodes (i.e., neurons or units), and the connections as links (i.e., synapses). The nodes are organized in layers. The excitation of a node is defined as the weighted sum of its inputs. If the excitation exceeds the node's threshold, the node becomes active. In ART1, this threshold is implemented by using a sigmoid activation function. The weight of a link connecting node A with node B is adapted using a learning rule that takes into account the activations of both A and B. 3.1 Major Components Figure 2 shows the architecture of ART1, revealing its major components. Circles represent nodes, rectangles represent layers, and lines represent connections. Thick lines indicate nonspecific connections, whereas thin lines indicate specific connections. A line ending in an arrow excites the target layer or node, a line ending in a disk inhibits the target layer or node. Below, we discuss each of these components in detail. Fig. 2. Architecture of the ART1 network. The Clamping Field I The bottom rectangle in Figure 2 represents an external clamping layer. Nodes in this layer are clamped by the input pattern and are not part of the ART1 network. Its function is to provide a buffer for input patterns: the nodes (circles) are directly activated by externally presented input
104 patterns. The two-component nature of input patterns is reflected in the specific connections the clamping layer makes with Fl (small arrows) and the nonspecific connections (large arrows) it makes with the nodes designated by G and A (see below). The Input Field Fl As is shown in Figure 2, Fl represents the layer directly fed (in a one-to-one fashion) by the nodes of the clamping layer I, i.e., Fl receives the informational component of the input. The Output Field F2 Each node in F2 represents a category. An active node in F2 indicates the category under which the input pattern at Fl falls. Only one F2 node is active at a time. For this reason, the internal structure of F2 forms a Winner-Take-All (WTA) network, i.e., the nodes are competitively coupled so that only one node (the winner) can be active at a time (cf. Grossberg, 1973, 1987b). Figure 3 shows the architecture of WTA-layer F2: each node excites itself (self-excitation) and inhibits all others lateral inhibition. Fig. 3. Winner-Take-AE structure of output layer F2. F1-F2 connections Fl and F2 are fully interconnected (i.e., every node in Fl is connected with every F2 node) by separate adaptive feedforward (i.e., from Fl to F2) and feedback (i.e., from F2 to Fl) links. The bidirectional nature of these connections reflects the notion of integration of bottom-up and top-down processing in Adaptive Resonance Theory. Gain Control Associated with Fl is a gain-control node (G, see Figure 2). This gain-control node receives a direct but nonspecific signal from the input: the arousal component of input I. Additionally, it receives a nonspecific inhibitory signal from F2. An active gain-control node activates all Fl nodes equally. Several alternative gain-control architectures have been shown to be formally equivalent with the one shown in Figure 2 (see the Appendix in Carpenter and Grossberg, 1987a).
105 Attentional Subsystem An attentional subsystem A receives nonspecific signals from two layers: nonspecific excitation from I and nonspecific inhibition from Fl (large filled circle in Figure 2). The A node compares the size of the Fl activations with the size of the input I. If the Fl activations are (to some degree) smaller than the input activations, the A node is activated. It then affects the nodes in F2 in a state-dependent manner: only the winning F2 node is suppressed by an active A node, leaving the other F2 nodes unaffected. 3.2 Matching at Fl with the ^-Rule The integration of top-down expectation and bottom-up data in Adaptive Resonance Theory is implemented in ART1 by matching a weighted F2 pattern with an I pattern at layer Fl. For such a matching, a special rule is introduced. Fl nodes obey the |-rw/e: "two out of three signal sources must activate an Fl node in order for that node to generate suprathreshold output signals" (Carpenter and Grossberg, 1987a, p. 65). As evident in Figure 2, Fl receives its three signals from: (1) the input; (2) the gain-control node G, and (3) the feedback connections from F2. According to the |-rule, Fl is able to send its pattern to F2 only when it is simultaneously activated either by I and G, or by I and F2 (simultaneous activation by F2 and G cannot occur because F2 inhibits G). Whenever there is activation at F2 each node in Fl becomes active whenever the it receives simultaneous activity from F2 and its node in I. This matching operation at Fl, essentially implements the binary intersection of the transformed F2 pattern and the input pattern I. 3.3 Processing Patterns in ART1 The aim of the ART1 network is to classify input patterns, made of binary data, into one of the categories formed by the nodes in F2. This comes down to the mapping of a set of similar input patterns (i.e., instances of a category) onto the same node (i.e., category). As an illustration of processing in ART1, the course of events after presentation of an input pattern I (i.e., a learning trial), is described in a qualitative way. The Search for a Category Suppose that the binary input pattern (1,0,1) is presented to a network that has already learned (and classified) some binary patterns of length 3. As shown in Figure 4a, this pattern feeds into Fl. Simultaneously, node G activates all nodes in Fl, ensuring that the |-rule is obeyed (i.e., Fl activations exceed their thresholds). Node A receives a nonspecific excitatory signal from the input, but is not activated because it receives a nonspecific inhibitory signal from Fl that is stronger by definition. The suprathreshold Fl pattern excites the competitively interacting nodes in F2. Which node is the winner depends on the amount of excitation it receives from Fl. When the winner is determined (being the middle F2 node in Figure 4b), its activation results in a feedback pattern to Fl and inhibition of G. (We note that the |-rule is still
106 obeyed, but that the inputs now originate from I and F2 instead of I and G). Fl combines the F2 pattern (the learned expectation) with its actual input I. As a result the Fl pattern is attenuated (see Figure 4c). The magnitude of this attenuation depends on the mismatch between the expectation and input signal: only the Fl nodes activated by the expectation pattern and input pattern remain active as dictated by the binary intersection operation. If this mismatch exceeds a (previously fixed) threshold, the inhibition of Fl to the A node is no longer sufficient to prevent it from becoming activated. As a result, the A node sends a reset wave towards layer F2. This reset wave suppresses the winning node in F2 selectively during the whole learning trial (see Figure 4d). Consequently, the expectation generated by this node is inhibited and the original input pattern is restored at Fl (G is no longer inhibited by F2). It should be noted that the reason for a mismatch to occur is the selection of an inappropriate category. The expectation generated by this category is not able to enhance the input pattern. The course of events described above is called a search-cycle. A learning-trial may F2 F2 F2 FJ (a) (b) (c) (d) Fig. 4. Processing in ART1. Filled circles indicate active nodes. Black lines indicate excitatory (arrows) or inhibitory (circles) signal-carrying links, (a) Presentation of input pattern I activates Fl. (b) The Fl pattern activates the winning F2 node that sends a learned expectation pattern back to Fl. (c) The expectation pattern does not match the input pattern (gray circles). The A node is disinhibited and a reset wave selectively inhibits the active F2 node, (d) The original Fl pattern is restored and a new F2 node can be selected. involve several search-cycles until an appropriate F2 node is encountered. This F2 node generates an expectation that preserves and enhances the Fl pattern so that node A remains disabled. There are two reasons for this to happen: (i) the input pattern is recoded and treated as a new member of an earlier learned category. Recoding is the attenuation of the input pattern during input- expectation matching at Fl, small enough to prevent the A node from becoming active, or (ii) an uncommitted F2 node is selected. The input pattern is not transformed during matching and represents the first instance of a new category. (In fact, it is the category, until another input pattern is recoded under the same F2 node.) In the case that no uncommitted F2 node is left, the input pattern is rejected and not learned at all. This occurs solely when all F2 nodes represent categories
107 that are incompatible with the input pattern. In this case encoding the input pattern would exceed the storage capacity of the network and, therefore, erase memory of earlier learned patterns. This illustrates the important feature of protection of earlier learned information, providing a solution to the stability- plasticity dilemma. The Storage of Information It is assumed that the consecutive search-cycles proceed very rapidly. During the search, the weights of the connections (operating on a much slower time scale than the activations) are not changed. When the search stops, the activations at Fl and F2 are stabilized. If a category is found (the input is not rejected) both the Fl and F2 patterns are stable and in a state called resonance. The feedforward links are then adapted with a special learning rule. In order to appreciate the need for this learning rule, the following example illustrates what would happen if a simple Hebb rule (i.e., incrementing weights of links connecting active nodes) were to be used. Suppose that an Fl pattern consists of solely active nodes and that all (bottom-up) weights have small random values. When a category is found, then, according to the Hebb-rule, the weights of the links from all Fl nodes to the F2 node are increased, elevating them above the other random low-valued weights. Suppose now that, following this, another Fl pattern is presented, containing a few active nodes. Obviously, this pattern is part of the first pattern, or a subpattern. This subpattern will be projected onto the same F2 node as the first Fl pattern, because all the weights of the links from the active nodes (constituting the subpattern) were increased during presentation of the first pattern. Generally this means that each pattern that represents a subpattern of an earlier pattern will be mapped onto the category associated with that earlier pattern (e.g., after classifying pattern 'E', the patterns T and 'F' will inevitably be classified onto the same F2 node). In the extreme case of a pattern with only active elements, all succeeding patterns will be lumped into the same category. It is obvious that this has to be prevented. For this reason, in ART, a special non-Hebbian learning rule is introduced. Without going into much detail here (see section 3.4), this special rule ensures that Fl patterns with a large number of active elements lead to smaller weight increments than Fl patterns with a small number of active elements. This can be achieved by letting the links that converge onto the F2 node compete for a limited amount of "total weight". When there are many competitors (converging links) each one receives only a small increment, whereas when there are only a few competitors a larger increment per link is allowed. Vigilance: the Coarseness of Categories In ART, learning proceeds without supervision. Which F2 node will be selected for the first pattern in a category is, therefore, arbitrary. This is not, however, true for the size of the categories. Whether the patterns of an APPLE and an APE are to be classified into the same category, depends on the size of the category (e.g., the categories FRUIT and ANIMAL vs. the category TO BE FOUND IN TREES). Therefore, in ART,
108 a special vigilance parameter is introduced. This parameter determines the magnitude of Fl-activity attenuation that is allowed before the A node is activated. A small vigilance value allows large mismatches between expectation and input, while a large vigilance value tolerates only small mismatches. The size of the categories after learning are, therefore, inversely related to the value of the vigilance parameter. 3.4 Formal Description of ART1 This section treats ART1 in a formal way. The following table summarises the notations for easy reference. symbol N M rii Tlj li f 0,9() Wij Wji description number of nodes in Fl number of nodes in F2 node inFl (i£ {1, ..,#}) node in F2 (j € {1,.., M}) activation value of node i, node j respectively external input feeding into node i sigmoid function weight value of a link from Fl to F2 (bottom-up) weight value of a link from F2 to Fl (top-down) rate parameter for activation dynamics rate parameter for weight dynamics Activation Dynamics The network dynamics are described by differential equations in "dimensionless form" (cf. Carpenter and Grossberg, 1987a). The variable a2- represents node i's (intrinsic) activation value, whereas its output (as "seen" by other nodes) is given by /(a;) (i.e., the sigmoid of the intrinsic activation). The activation change of node n; in Fl obeys the equation dt M M - a,- + (1- Aa.i)(Ii + D^2 f(aj)wji)~(B + Cat)^/(¾) .decay excitation inhibition (i) with A,B,C,D constant parameter values (A > Q;max(l,D) < B < l + D;C > 0). This equation consists of three parts: (1) an autonomous decay term; (2) an excitation term, and (3) an inhibition term. These three terms are treated in detail below. (1) The decay is proportional to the activation of n2-. Hence a node receiving no input will decay exponentially towards its minimum value; (2) The excitation of rii, is gained by term (1 — Aa,-). According to this term, no matter how large
109 the excitation, a; saturates at its maximum value l/A. (When a; becomes larger than l/A the excitation gain term becomes negative.) The second term (Ii+D J2j f(aj)wji) accounts for the two excitatory inputs of rii, the external input J; and the weighted top-down input from F2; (3) The inhibition of rij is gained by term (B + Cat), keeping the minimum activation value at —B/C. The second term ^\- f(o,j) represents the lumped inhibitory input from F2 (acting via the gain-control node G). We note that this equation represents a shortcut with respect to the wiring of the gain-control node. It is, however, formally equivalent with the architecture as shown in Figure 2 (see Carpenter and Grossberg, 1987a). The activation change of node rij in F2 obeys an analogous equation ± -\ N M -aj +(1- Aaj)(f(aj) + Dy£jf(ai)wij) - (B + Caj)^ /(¾) i k& (2) with A, B, C, D constant parameter values (for valid values see Grossberg, 1973). The excitation term in this equation implements the self-excitation of F2 nodes (needed for competitive interactions) and the weighted input from Fl. The inhibition term represents the lateral inhibition each F2 node receives from all other F2 nodes. As said before, the self-excitation and lateral inhibition endow F2 with the Winner-Take-All property (see Grossberg, 1973). When Aa is large the F2 layer behaves approximately like a choice circuit (cf. Carpenter and Grossberg, 1987a). At every time instance in a choice circuit, the node with the largest excitation has an activation value that equals 1.0, while the rest of the nodes have activation value 0.0. Weight Dynamics Variable Wij represents the weight value of the bottom-up link connecting n; (in Fl) with rij (in F2). This link is adapted according to Wij = Kf(aj) dt N L(l - Wij)h(a,i) — wij ]P h(ak) (3) with L a constant parameter value (see below), and h() a "binary switch" function: ,, N f 1 if a > 0 ,., A<a>= \0if a.^n (4) <0. The learning-rate parameter Xw is small compared to Aa (i.e., weights vary slowly with respect to activation changes). The first (excitatory) term between the brackets, L(l — Wij)h(a,i), describes the weight increase when <n > 0. Term (1 — wij) ensures that Wij saturates at (maximum) value 1. The second (inhibitory) term, WijJ2k&h(ak), represents the negative effect of the summed "other" Fl nodes, gained by the magnitude of Wij. (Larger weights receive more inhibition.) The initial term, f(a,j), indicates that learning (weight change) takes place at a rate proportional to the output
110 of the F2 node. Weights of connections that feed into an inactive F2 node are, therefore, not changed. A simple differential equation describes the dynamics of the top-down weights d dl -W; 'ji K f(aj) [~Wji + h(ai)]. (5) In this equation, the decay of weight Wji is gained by the output of the F2 node. The joint activation of the Fl and F2 node, n2- and rij, represented by the product f(aj)h(ai) may compensate this decay. It can be easily seen from this equation that the top-down weights diverging from an active F2 node simply follow the Fl activations: at equilibrium the above equation reduces to Wji = h(a{). Fast Learning ART1 can operate in two learning modes: fast learning (i.e., patterns are directly encoded at their first presentation) or slow learning (patterns are gradually encoded over several presentations). Here the fast learning case will be discussed, allowing considerable simplification of the learning equations. Fast learning (large Xw) implies even faster activation dynamics. Therefore, the outputs of Fl and F2 nodes can be considered as rapidly switching binary variables (as described by the binary switching function AQ). In the fast learning case, weights are changed in a single time step. The bottom-up weights change according to (cf. Simpson, 1989) Wij = l^h^'dh{ai) = h{aj) = l (6) I 0 otherwise The top-down weights are adapted by _ J 1 if h(a,i ji ~ \0 othen ) = h(aj)=l . rwise ^ ' Initially, the bottom-up weights should obey the following inequality: 0.0 < 'ij < l-1+n ' anc^ ^e top-down weights: 0.5 < Wji < 1.0, to ensure that initial w; input patterns are mapped onto uncommitted F2 nodes. Vigilance The A node compares the full inhibitory signal from Fl with the full excitatory signal it receives from the input. When the difference between these signals exceeds a certain threshold, the A node becomes active. The vigilance parameter p is inversely related to this threshold. Whenever the ratio KjJ- of the size of Fl activations |i^l| and the size of I activations |/| becomes smaller than p € [0,1], the A node is activated and the active F2 node reset.
111 3.5 An Example Learning Session The following example illustrates the dynamics of search and storage in an ART1 network during three learning trials in a learning session. Suppose that the learning set consisting of the patterns PI, P2 and P3 (shown in Figure 5) is to be learned. The ART1 network used in this example consists of an Fl layer with 2 x 2 nodes (the input grid) and an F2 layer of 3 nodes representing the categories X, Y and Z. The vigilance parameter is given a large value to ensure that each input pattern will be mapped onto a distinct category node. The weight values of the F1-F2 links are randomly initialized with small bottom-up values and large top-down values. At the first trial, pattern PI is presented. This pattern will " . PI P2 P3 trial pattern expectation mismatch category E3 B U B L B E x z Fig. 5. A learning session in ART1. be mapped onto the F2 node Z that receives the largest excitation. As shown
112 in Figure 5, the expectation matches PI perfectly since top-down weights are large. Consequently PI is preserved at Fl (mismatch 0). Then Fl and F2 are in resonance and the pattern PI is stored under Z in F2. Following the presentation of PI, a second pattern, P2, is presented to Fl at trial 2. While P2 represents a subpattern of PI, its Fl pattern will be preserved by the top-down expectation from F2 node Z (mismatch 0). The patterns resonate and P2 is stored under Z. As a result the top-down expectation pattern is transformed to P2. (Recall, that the top-down traces simply follow the Fl activations.) Due to the fact that learning proceeds rapidly, the memory for the first pattern PI is now lost. At trial 3, a third pattern P3 will not map onto F2 node Z, because it is not a subpattern of P2. Instead it is directly mapped onto the uncommitted F2 node X. Following this, the first pattern (PI) is presented again at trial 4. Clearly, it will not be mapped onto its original category node Z. Neither will it be mapped on F2 node X. The reason for this is that the expectation patterns generated by both these categories do not suffice to preserve the Fl pattern. The sequence of events is shown in Figure 5. First the network tries to store PI under category X, because its bottom-up weights match the pattern maximally (P3 and PI share two active elements). Unfortunately, F2 node X sends back an expectation pattern that fails to maintain the Fl pattern (mismatch 1). Hence a reset-wave selectively inhibits the X node during the remainder of this trial. Now, the network maps PI onto category node Z, because the bottom-up weights provide the second-best match for PI (P2 and PI share one active elements). The resultant feedback pattern from Z is, however, worse at preserving the Fl pattern (mismatch 2), and again a reset takes place. Now that both committed nodes are reset, PI is automatically stored under the uncommitted category node Y. Now that each of the patterns PI to P3 has been categorized, future presentation will lead to direct access of the corresponding category. Remarks We remark that the number of recodings in ART depends on the order that patterns are presented. In particular, when patterns are presented in such a way that each successive pattern contains more active elements than its predecessor (e.g., in the example above P2, P3, PI), then no recoding will occur. Furthermore, it should be noted that when a new pattern is presented, all committed F2 nodes are searched, starting with those whose bottom-up weights match the new pattern maximally. If the pattern is rejected for the first node, than all successive uncommitted F2 nodes are searched, until it can be stored under one of them, or stored under a new uncommitted F2 node. (See Carpenter and Grossberg (1987a) for a more detailed treatment of search-order in ART1.) When ARTl is presented with an arbitrary set of input patterns, learning stabilizes in at most m'm(N,M— 1) presentations of the entire set provided that L is small (Georgiopoulos, Heileman, and Huang, 1990; 1991;1992).
113 3.6 Simplifying ART: SMART Tapang (1989) proposed a Simplified/Modified ARTl (SMARTl) network. The structure of SMART is shown in Figure 6. In SMARTl, the gain control node and the f-rule are omitted without losing ARTl's functionality. In ARTl, the matching operation at Fl entails a binary intersection of the top-down expectation pattern and input pattern. Only Fl nodes that receive simultaneous activation from F2 and I remain activated. Alternatively, in SMARTl the binary union of these patterns represents the matching. When this union results in an activation that is larger than the activation at input I, Fl sends a nonspecific excitation signal to the A node that exceeds the nonspecific inhibition it receives from the input I. Consequently, a reset wave is released and the active F2 node is inhibited. The initial weight values in SMARTl differ from those in ARTl. ART SMART F2 u Fl I J Fig. 6. Comparison of ART and the Simplified/Modified ART network (SMART). Specifically, the initial top-down weights should all be zero. When they would have large values (like in ARTl) the union of the expectation and input would be the same irrespective of the input pattern. Zero-valued weights will preserve initially presented input patterns. A functional difference between ART and SMART concerns the preferable order of pattern presentation preventing their recoding. As discussed in the "Remarks" section, the preferred presentation order for ART is from small-sized to large-sized patterns. In SMART it is just the reverse, from large- to small-sized patterns. In SMART a large-sized pattern, encoded under an F2 node, yields a large feedback expectation. This expectation prevents small-sized patterns from being mapped under the same F2 node, because it will raise the Fl activation above the vigilance threshold causing a reset wave. Consequently, after a large pattern has been stored under one F2 node, a subsequent smaller pattern will be stored under another F2 node in SMART.
114 4 ART2: Classifying Analog Patterns In ARTl, the matching of a top-down expectation pattern and an input pattern occurs by comparing the total activation of Fl with the total activation of input I. Obviously, the method is possible only because in ARTl input patterns are binary. Therefore, Carpenter and Grossberg (1987b) extended their original network to ART2 in order to enable classification of analog (i.e., continuous-valued) patterns. Below, a qualitative description is given of the necessary modifications. In ART2, the matching of categories against analog input patterns require a rather complicated circuitry of three Fl- sublayers (see Figure 7). For each Fl node in ARTl, seven nodes are needed in ART2. Moreover, in order to normalize the activation of four of these nodes, five nonspecific inhibitory interneurons are needed (not shown in Figure 7). Figure 7 shows the architecture for one Fl node. The aim of this circuit is to compare the normalized input value I with the normalized top-down expectation. The circuit contains a lower loop k-l-m- n that circulates a single input value, and an interacting upper loop o-p-m-n that circulates a single expectation value. Input value J; feeds into k. Node k feeds this value into node 1, simultaneously being normalized (indicated by the gray arrow) with respect to the total pattern I. The corresponding Fl value feeds into node o. This node sends a normalized value to node p. The outputs of nodes p and 1 feed into node m. The normalized value of m is fed into node n. Node n, in turn feeds the result back to nodes o and k. In addition, n and o feed into node q constituting the normalized mismatch between input and expectation. The resultant functional behaviour of this intricate circuitry is that Fig. 7. Modified Fl structure in ART2. The figure shows the ART2 equivalent of one ARTl Fl node. Redrawn after Simpson (1989). input and expectation are simultaneously normalized and matched. In ART2,
115 both bottom-up and top-down weights are changed according to the ART1 top- down equation. Competitive interactions between the bottom-up weights are not necessary in ART2, due to the normalization of patterns at Fl. 5 Supervised Learning with ARTMAP ART networks form categories in an unsupervised way. There is no teacher telling the network to which class a particular instance belongs. Some tasks require, however, the explicit linkage of examples to classes. ARTMAP (Carpenter, Grossberg, and Reynolds, 1991), a model incorporating two ART networks coupled by an associative network, extends adaptive resonance theory to supervised learning. In ARTMAP, one ART network (ARTa) is presented with an input pattern (example) and the other (ARTb) is presented with the associated output pattern (teaching pattern). Both the input and output patterns are classified by the respective F2 fields of ARTa and ARTb, yielding active F2 nodes in both modules. A so-called MAP field contains nodes that receive inputs from the F2 fields. The F2 field of ARTa is connected by unidirectional adaptive links to the MAP nodes, while the F2 field of ARTb connects bidirectionally in a one-to-one fashion with the MAP nodes. An internal control system conjointly maximizes predictive generalization and minimizes predictive error by adjusting the category size (vigilance) autonomously. ARTMAP selforganizes its weights in real time and stabilizes its weights after learning. Previously learned classes are not overwritten by new patterns. Consequently, unlike backpropagation, ARTMAP does not exhibit catastrophic forgetting. 6 Evaluation The main feature of the ART networks described in this paper is that they provide a solution to the stability-plasticity dilemma. Learned information is automatically protected and ART shuts off its learning when its capacity is used (i.e., no uncommitted F2 nodes are left). For this reason the networks can be readily applied to problems such as pattern recognition without the danger of previously learned information being overwritten. Clearly, ART1 provides an architecture that is more attractive and probably better realizable than ART2. In computer simulations, ART1 (or SMART) can be implemented using the simple learning equation given in this paper. For ART2, a major simplification for the simulations might be the (algorithmic) normalization and matching of patterns at Fl, avoiding the complications involved with implementing the computationally expensive circuitry of Figure 7. An updated version of ART2, called ART2-A (Carpenter, Grossberg, and Rosen, 1991a), incorporates some computational shortcuts leading to an improved learning speed. An extension of adaptive resonance theory to hierarchical networks has also been proposed (Carpenter and Grossberg , 1990). Also, "fuzzy" reformulations of ART1 (Carpenter, Grossberg, and Rosen, 1991b) and ARTMAP
116 (Carpenter et al., 1992) have been proposed using concepts from fuzzy-set theory. ART networks have some limitations that should be taken into account when considering their application. One such limitation is that ART is not capable of translation-, scale- and rotation-invariant pattern processing. The matching operation employed in ART is a first-order measure and, therefore, not capable of detecting the higher-order relations in the input necessary for such invariant processing (cf. Moore, 1989). A second limitation of ARTl networks is that they are sensitive to noise in their input patterns. As remarked by Hertz, Krogh and Palmer (1989), when random bits are missing from input patterns the recod- ing operation may degrade the category representations significantly. A third limitation of ART is that its category representations are localised (i.e., represented by a single F2 node) instead of being distributed over multiple nodes. As a result, damage to a single F2 node leads to the loss of an entire category. Although, F2 may be redefined to hold distributed representations of categories, an entirely different learning scheme would be needed to preserve the properties of the original architecture. Although application of ART networks may require one or more alterations of the standard algorithm, its self-stabilizing property makes it an attractive candidate for application to real-world problems (see, e.g., Caudell, Smith, Escobedo, and Anderson, 1994). References G.A. Carpenter and S. Grossberg (1987a) A massively parallel architecture for a self- organizing neural pattern recognition machine. Computer vision, graphics, and image processing, Vol. 37, 54-115. G.A. Carpenter and S. Grossberg (1987b) ART 2: self-organization of stable category recognition codes for analog input patterns. Applied Optics, Vol. 26, 4919-4930. G.A. Carpenter and S. Grossberg (1990) ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures. Neural Networks, Vol. 3, 129-152. G.A. Carpenter, S. Grossberg and J.H. Reynolds (1991) ARTMAP: Supervised realtime learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, Vol. 4, 565-588. G.A. Carpenter, S. Grossberg and D.B. Rosen (1991a) ART2-A: An adaptive resonance algorithm for rapid category learning and recognition. Neural Networks, Vol. 4, 493-504. G.A. Carpenter, S. Grossberg and D.B. Rosen (1991b) Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, Vol. 4, 759-771. G.A. Carpenter, S. Grossberg, N. Markuzon, J.H. Reynolds and D.B. Rosen(1992) Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks, Vol. 3, 698-713. T.P CaudeE, S.D.G. Smith, R. Escobedo and M. Anderson (1994) NIRS: Large scale ART-1 neural architectures for engineering design retrieval. Neural Networks, Vol. 7, 1339-1350.
117 M. Georgiopoulos, G.L. Heileman and J. Huang (1990) Convergence properties of learning in ART1. Neural Computation, Vol. 2, 502-509. M. Georgiopoulos, G.L. Heileman and J. Huang (1991) Properties of learning in ART1. Neural Networks, Vol. 4, 751-757. M. Georgiopoulos, G.L. Heileman and J. Huang (1992) The N-N-N Conjecture in ART1. Neural Networks, Vol. 5, 745-753. M. Georgiopoulos, J. Huang and G.L. Heileman (1994) Properties of learning in ARTMAP. Neural Networks, Vol. 7, 495-506. S. Grossberg (1973) Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, Vol. LII, 213-257. S. Grossberg (ed.) (1982) Studies of mind and brain: neural principles of learning, perception, development, cognition, and motor control. Boston: Reidel Press. S. Grossberg (ed.) (1986) The adaptive brain I: Cognition, learning, reinforcement, and rhythm. Elsevier/North-HoEand, Amsterdam. S. Grossberg (ed.) (1987a) The adaptive brain II: Vision, speech, language, and motor control. Elsevier/North-HoEand, Amsterdam. S. Grossberg (1987b) Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, Vol. 11, 23-63. S. Grossberg (1987c) Nonlinear neural networks: principles, mechanisms, and architectures. Neural Networks, Vol. 1, 17-61. B. Moore (1989) ART1 and pattern clustering. In D.S. Touretzky, G. Hinton and T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School Morgan Kaufmann, San Mateo, CA, 174-185. P.K. Simpson (1989) Artificial Neural Systems: Foundations, Paradigms, Applications, and Implementations. Pergamon Press, New York, NY. C.C. Tapang (1989) An alternative matching mechanism: Getting rid of attentionaJ gain control and its consequent 2/3 rule in ART-1. Technical Report, Syntonic Systems, Inc.
Boltzmann Machines F.C.R. Spieksma Department of Mathematics, University of Limburg, Maastricht 1 Introduction The purpose of this paper is to introduce the reader to a specific type of neural networks called Boltzmann Machines. The Boltzmann Machine is a model originally proposed in a paper of Hinton and Sejnowski (1983) (see also Ack- ley, Hinton and Sejnowski (1985)), and it possesses attractive properties for solving problems from such diverse areas as: pattern recognition, combinatorial optimization and learning. The combination of massive parallelism from neural computing and simulated annealing is the characteristic feature of Boltzmann Machines and results in a promising computational tool. Potential benefits of the Boltzmann Machine entail among others: - the model can be used in different problem areas; it is generally applicable. - there is a sound mathematical background available which facilitates the analysis of the model. - it is relatively easy to implement. This paper is organized as follows. The next section is devoted to a detailed description of Boltzmann Machines. Different types of Boltzmann Machines are reviewed and some examples are given. Also, that section briefly addresses theoretical aspects of Boltzmann Machines. Section 3 deals with applications. First, we describe in a general way how problems from combinatorial optimization may be solved using Boltzmann Machines; then this is illustrated by two examples. This paper concludes with Section 4 where a brief summary is given and some possible future developments are stated. Let us close this introduction by mentioning that this paper is based on the book of Aarts and Korst (1989) in which a rigorous treatment/on the subject of Boltzmann Machines is presented. 2 The Boltzmann Machine This section describes how Boltzmann Machines work (Subsection 2.1), the state transition mechanism (2.2) and different types of Boltzmann Machines (2.3). 2.1 Description A Boltzmann Machine B can be seen as a set of elements V (the neurons) and a set C of pairs of elements of V (the connections). All connections of the
120 form (v,v), with v £ V, called loops, are assumed to be elements of C, that is {(v,v)\v£V}cC. Each element can be in one of two states, or more precisely, to each element v £ V one of the two values {0,1} is associated. This corresponds to an element being 'on' (it is associated with 1) or 'off' (it is associated with 0). A configuration k of a Boltzmann Machine is determined by a (0 — 1) vector of length |V|, such that the 0-th component of the vector, k(v), represents the state of element v in this configuration. Thus, for each v £ V we have k(v) = 1 or k(v) = 0. Each connection (v\,v2) £ C has a certain weight or connection strength denoted by u,viv2 € IR" Connections with a positive weight are called excitatory, those with a negative weight are called inhibitory. The weight of a loop, or wvv, is called the bias of element V. A Boltzmann Machine is bidirectional, that is w»i»2 = w»2»i for all (01,02) G C. Now, let (01,02) be a connection in C. We define (1*1,02) to be activated in a given configuration k if both elements vi and v2 are 'on' or, if k(v\) = 1 and k(v2) = 1. Finally, let us define a function F that gives each configuration of the Boltzmann Machine a certain value. This value can be interpreted as a measure of the quality of that particular configuration of the Boltzmann Machine. Define (»i,»s)ec We will refer to F as the consensus function and to its value as the consensus. The Boltzmann Machine strives to maximize the consensus function, or in other words, it wants to find a configuration with maximal consensus. It then follows from the definition of F that in the Boltzmann Machine excitatory connections will tend to be activated while activation of inhibitory connections tends to be avoided. Let us for a moment consider the small example given by Figure 1, where: B = (V,C) V = {vi,V2,V3,V4} C = {(vi,Vi) : i= 1,..., 4} U {(01,02), (01,03),(^1, «4), («2, «3), («3, «4)}, with wViVi = 1 for all 2 = 1,..., 4 and with all other weights indicated as in the figure. As there are 4 elements in the example, the number of possible configurations equals 24 = 16. For instance, if all elements are on (implying that k = (k(vi), k(v2), k(v3), k(Vi)) = (1,1,1,1)), the consensus equals the sum of
121 Fig. 1. Example of a Boltzmann Machine. the weights of all connections, which in this case turns out to be 0. The reader may convince him/herself of the fact that there are 2 configurations reaching a maximal consensus namely (1,0,0,1) and (0,1,1, 0), with consensus 3. Note that the configuration (0,1, 0,1) has the following property: if the state of exactly one of the elements is changed, giving rise to a configuration I, the consensus of that configuration I will not be larger than the consensus of (0,1,0,1). This property will turn out to be important in the sequel. 2.2 The Transition Mechanism How does a Boltzmann Machine try to reach a maximal consensus? This is done by allowing the elements of the Boltzmann Machine to change states. Obviously, a change of the state of an element affects the consensus of the Boltzmann Machine. The way in which elements change states is governed by a so-called state transition mechanism. In order to describe this mechanism we have to make a distinction between two kinds of Boltzmann Machines namely, - the sequential Boltzmann Machine; here elements may change their state one at a time - the parallel Boltzmann Machine; now, elements can change states simultaneously. Let us first consider the sequential Boltzmann Machine. For each configuration jb, we define a neighborhood JV* as the set of configurations, obtained by changing the state of exactly one element. So, in our example in Figure 1 the neighborhood of (0,1,0,1) is: iVA = {(0,1,0,0),(0,1,1,1),(0,0,0,1),(1,1,0,1)}. (Notice that if I € Nk then k € N,) Let us have a look at the difference in consensus between two configurations
122 ki,k2 in the same neighborhood. Assume that ki(v) = 1 and k2(v) = 0. Then the difference AFk1(v)(= F(k\) — F(k2)) is given by AFkl(v) = ]P wvuki(u) + wvv. {v,u)£C Obviously, if k\(v) = 0 and k2(v) = 1 then AFkl(v) = -( ]P 1^,^1(14)+^,,). {v,u)£C This shows that elements of the Boltzmann Machine are relatively independent; more precisely: the effect on the consensus of a change of state of element v is determined only by the states of elements with a connection to v and the corresponding weights. This means that parallel implementation is possible as will be discussed in 2.3. In the sequel AFkl(v) denotes the difference in consensus between configuration ki and k2, where k2 only differs from k\ with respect to the state of element v. Now, define a configuration k to be locally maximal if AFk{v) < 0 for all v £ V. Or, in other words, if the consensus of a configuration k cannot be improved by changing the state of a single element, then this configuration k is called locally maximal. Thus, (0,1, 0,1) in the example of Figure 1 is locally maximal. The state-transition mechanism may now be described in the following way. Suppose a certain configuration k\ is reached. Then, an element v is randomly selected and AFkl(v), the effect of the consensus of changing the state of element v, is computed. Now depending on (i) the value of AFkl(v), (ii) a control parameter c, (c > 0), the transition is accepted or not. To be more precise, as in the simulated annealing algorithm, these two factors are used to compute an acceptation probability P for the transition from k\ to k2. This probability equals PkAv,c) = — (.AFkAvW l + exp( ^) In Figure 2 the relation between Pkl(v,c) and AFkl(v) is depicted for different values of the control parameter c. Concluding, the Boltzmann Machine starts by picking randomly an initial configuration, generating randomly an element v, and computing, for a relatively large value of the control parameter c, the possibility of a state transition of element v. This process is carried out iteratively while c is slowly decreasing. . (Notice that this implies that the probability of allowing a deterioration in the consensus decreases as time proceeds.) At each value of c a fixed number of iterations (T) is performed and the process stops when during L consecutive iterations no change in the consensus has occurred.
123 Fig. 2. Threshold functions of acceptation probabilities. Obviously, the values T, L and, more importantly, the way in which c is decreased (the so-called cooling schedule) determine the outcome of this process. More in particular, there are different ways of prescribing such cooling schedules. Usually, a cooling schedule is specified beforehand, (for different variants, see Aarts and Korst (1989)), but it is also possible to design cooling schedules which depend on the behavior of the process itself (see Andresen (1991)). In any case, when specifying T, L and the cooling schedule, one has to find a balance between on the one hand cooling too fast (i.e. choosing T too small or letting c decrease too fast; this may cause the Boltzmann Machine to get stuck in a local maximum of poor quality), and on the other hand cooling too slow, which results in excessive computation times (for more comments along this line we refer to the contribution by Crama et al.). Mathematically it can be proved that given enough time the Boltzmann Machine stabilizes in a global maximum. More practically, when less time is available, it is easy to see that the Boltzmann Machine at least always stabilizes in a local maximum, since as c approaches zero, only transitions with an improvement in consensus are accepted. 2.3 Different Types of Boltzmann Machines In the previous subsection we described the state transition mechanism for a sequential Boltzmann Machine. However, we noted there that, due to the relative independence of the elements, parallelism is possible. Here, we will explore this subject deeper.
124 In a parallel Boltzmann Machine elements are allowed to change states simultaneously. To give an exact description we should distinguish between so-called synchronous and asynchronous parallelism. In synchronous parallelism sets of state transitions are evaluated consecutively, while the state transitions in one set are evaluated simultaneously. Then, the accepted state transitions of a particular set are communicated through the Boltzmann Machine. This implies that for the next set of state transitions, the exact configuration of the Boltzmann Machine is known. To use synchronous parallelism implies the availability of a global clocking scheme, which is not required in asynchronous parallelism. Here, elements continously generate state transitions which are evaluated on the basis of not necessarily up-to-date information as the state of connected elements may have changed meanwhile. Another important characteristic of a parallel Boltzmann Machine is whether there is limited or unlimited parallelism. In limited parallelism only not connected elements may change states in parallel. This restriction does not apply to unlimited parallelism. Let us consider a synchronous Boltzmann Machine with limited parallelism. Then we may want to partition the elements of the Boltzmann Machine in sets of maximal size, such that no two connected elements belong to the same set. In this way the greatest speed-up is achieved, since all elements of one set can simultaneously propose a state transition. For instance, consider again our example in Figure 1. There we may partition the elements as follows: {{M,{*>3},{*>2,M}. Now, elements v2 and v4 may change their states simultaneously, realizing an increase in the number of state transitions per time unit. In the unlimited case, the following may happen. Suppose two connected elements Vi and v% are off and they both switch to on. Then, the connection {y\, v2) is activated, although wVlV2 was not considered in the calculation of AFk^v). In principle, this could make the Boltzmann Machine accept unwanted state- transitions; however, the probability of a transition based on an erroneously calculated AFk1(v) decreases as the control parameter c decreases. This so happens, because the probability of a transition decreases as c decreases. This may explain the fact that in practice, unlimited parallelism does not seem to affect the quality of configurations obtained by the Boltzmann Machines. However, for Boltzmann Machines with limited parallelism it can be proved (under some minor assumptions) that the final configuration converges to the optimal configuration, whereas this statement is not (yet) proved for Boltzmann Machines with unlimited parallelism, due to the fact that state transitions based on erroneously calculated AFk^v) may occur.
125 3 Solving Combinatorial Optimization Problems with Boltzmann Machines In this section we describe how Boltzmann Machines can be used to solve problems from combinatorial optimization. A general explanation is given in (3.1) while (3.2) and (3.3) focuss on specific combinatorial optimization problems, namely the max-cut problem and the traveling salesman problem respectively. Some familiarity with problems from combinatorial optimization is assumed. 3.1 General Description An instance of a combinatorial optimization problem can be viewed as a finite set of feasible solutions, with a certain cost associated to each feasible solution, represented in a concise manner. Of course, the problem is to find a feasible solution with minimal cost. Usually, due to the enormous number of feasible solutions, complete enumeration to find a best solution is impractible. How can a Boltzmann Machine be used to solve problems from combinatorial optimization? Obviously, we have to design our Boltzmann Machine in such a way that it represents the problem we want to solve. This can be achieved in the following way. A combinatorial optimization problem can be formulated with binary variables, that is with variables Xi belonging to {0,1} for all i. Let us now define a Boltzmann Machine such that each binary variable is represented by exactly one element. In this way, a configuration of the Boltzmann Machine defines a (not-necessarily feasible) solution of the combinatorial optimization problem; we simply set X{ = k(v{) for all i. Now, the set of connections C and the corresponding weights w should be designed in such a way that the consensus function is feasible and order preserving. Feasibility of the consensus function is achieved when each local optimum of the Boltzmann Machine corresponds to a feasible solution of the combinatorial optimization problem. Thus, when the consensus function is feasible, a feasible solution is guaranteed to be found by the Boltzmann Machine. An order preserving consensus function is a consensus function such that the quality of local optima of the Boltzmann Machine reflects the quality of the solutions of the combinatorial optimization problem. More precisely, if solution a is better then solution 6, then for the corresponding local maxima the same relation should hold. This implies that for an order preserving consensus function a best solution of the combinatorial optimization problem corresponds with a global optimum in the Boltzmann Machine. As the Boltzmann Machine maximizes its consensus function, it looks for a configuration corresponding to an optimal solution of the combinatorial optimization problem. In the following we discuss two kinds of combinatorial optimization problems with a feasible and order-preserving consensus function.
126 3.2 The Max-Cut Problem Consider the following problem. Given is a graph G = (N, E) where N denotes the set of vertices, with \N\ ~ n, and E the set of edges. To each edge a positive weight dij is associated. The max-cut problem is to partition the nodes of N into two disjoint sets iVi and N2 such that the sum of the weights of edges having one endpoint in iVi and the other in N2 is maximal. In Figure 3 a small example is depicted. Fig. 3. A max-cut problem. The following 0-1 mathematical programming formulation describes the max-cut problem: n n max]P ]P dij[(1 - Xi)xj + ar,-(l — Xj)] Xi € {0,1} for i = 1,..., n with Xi = 1 if node i is in iVi, = 0 elsewhere. How is the corresponding Boltzmann Machine defined? First, we realize that the Boltzmann Machine should consist of n elements where each element corresponds to a vertex in the graph G- Second, consider the following sets of connections: Ci = {(vitVi) :i€N} (the loops), C2 = {(Vi, Vj):(i, j)€E}. The weights for connections in C\ are given by YTj^i dij for all i £ N, and weights for connections in C2 are given by —2dij for all (i,j) G E. Now, it is
127 not hard to prove that the resulting consensus function is feasible and order preserving; consider the consensus of a configuration k: n F(k)z= E (EdtfX*M)2+ E -2<m(»o*(«v) n ti n n = E E %(^ + ^) + E E -2rfvz=zJ i=i y=z+i f=i j—i-\-i which is equivalent to the original formulation. Hence, it follows that the Boltz- mann Machine for the max-cut problem with the connections and corresponding weights as defined earlier, has a feasible and order-preserving consensus function. The Boltzmann Machine belonging to the example of Figure 3 is given in Figure 4. Fig. 4. A Boltzmann Machine for the max-cut example. Unfortunately, the problem of choosing the connections and their weights in order to have a feasible and order-preserving consensus function is not always as easy as for the max-cut problem For the following combinatorial optimization problem, this becomes mote difficult. 3.3 The Traveling Salesman Problem (TSP) Consider the following problem. Given n vertices (or cities) and a distance ctj for each pair of vertices (i, j), i = 1, . .., n, j = 1,..., n, determine the minimal length of a tour visiting each city precisely once. Problems of this type may occur as subproblems in the area of scheduling. A 0 — 1 mathematical programming formulation for this problem is given by
128 with variables and with parameters min Y, d»; subject to V^ xtp ~ 1 for all p Xip ~- < « = 1 y X{p = 1 for all i x,p G {0,1} for all i,p 1 if the trip visits city i at the p-th position 0 otherwise Hjpq Cij if q = (p + 1) mod n 0 otherwise. So for any round tour along the cities as determined by the variables X{p, we have that dijpqXipXjq = (¾ if and only if city i immediately precedes city j on this trip. Moreover, dijpq = 0 otherwise. Thus J2i j v q ^ijpqxipxjq ls the total length of the tour given by x. In order to use a Boltzmann Machine to solve the TSP we need an element Vip for each variable X{p. Furthermore, we choose the set of connections as C\ UC2UC3 where C\, C? and C3 are as follows: Ci = {(%>,%>)} (the loops), C-z - {(vip,Vjq)\i ¢. j A q = (p + 1) mod n} and C3 = {(i>ip,Vjq)\(i = jAp^q)V(i^jAp= q)}- If we choose the weights of the connections in the following way: for all (Vip,Vip) G Ci, we take wVipVip > maxAi;) k?i(cik + c,-j) to avoid loops, for all (v,p,Vjq) G C2, we take wVipVjq — ~ctj and for all (vip,Vjq) G C3, we take w„ipVjq < ~m\n{wVipVip,wVjqVjq} to get a tour, then it can be proved that the consensus function is feasible and order preserving. We shall omit this proof; for details see Aarts and Korst (1989).
129 4 Conclusions In this paper we have described the Boltzmann Machine and its potential use for problems from combinatorial optimization. Two remarks are worth mentioning here. Firstly, the traveling salesman problem shows that it is not always easy to find a consensus function that is feasible and order preserving. In other words, the translation of the problem formulation into a provably equivalent Boltzmann Machine is generally nontrivial. In fact, for more complicated combinatorial optimization problems (e.g. job shop scheduling), one has not yet succeeded in designing a satisfactory Boltzmann Machine. Secondly, the "cooling down scheme", i.e. the way in which the control parameter c is decreased appears to be critical; to find a satisfactory cooling down scheme is a problem in itself. In the past years however, a number of applications has appeared in literature, for which Boltzmann Machines were succesfully used. For an overview of these applications we refer to Aarts and Korst (1989). It is to be expected that Boltzmann Machines or similar approaches will play a major role in solving problems from combinatorial optimization. In fact, also due to the increasing availability of special purpose hardware, especially designed for Boltzmann Machines, its general importance will become more significant. References E.H.L. Aarts and J. Korst (1989) Simulated Annealing and Boltzmann Machines. John Wiley and Sons, Chichester, England. D.H. Acjkley, G.E. Hinton and T.J. Sejnowski (1985) A learning algorithm for Boltzmann machines. Cognitive Science 9, 147-169. B. Andresen (1991) Parallel implementation of simulated annealing using an optimal adaptive schedule. Proc. European Simulation Multi-conference, 296-300. G.E. Hinton and T.J. Sejnowski (1983) Optimal perceptual inference. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Washington DC, 448-453.
Representation Issues in Boltzmann Machines J.H.J. Lenting* Department of Computer Science, University of Limburg, Maastricht 1 Introduction Finding a suitable representation for a problem in the context of specific hardware and software environments is commonly acknowledged to be an important — and far from easy — area of research. Whereas most traditional approaches within the field of symbolical AI explicitly address this issue, it is sometimes suggested in neural network research that the task of getting the representation right can be left to the network. This is untrue. For some problems, neural networks get along with remarkable success. For others, however, results are rather disappointing. Both in learning and non-learning networks, the problem representation (the encoding of the problem in neural weights or inputs) often appears to be instrumental in this respect. As an example, we shall look at the representation of a Traveling Salesman Problem (TSP) on a Boltzmann machine for combinatorial optimization. Boltzmann machines generally perform well on many graph problems (e.g., the max-cut problem described in the contribution by Spieksma), but rather disappointingly on other combinatorial optimization problems, like the Traveling Salesman Problem, which has occurred a number of times in earlier chapters (viz., the contributions by Vrieze, Crama et al., Postma and Spieksma). In this paper, we investigate the role of the TSP representation in this matter. Statements on the effect of representational variations are based on experimentation, using a (simulated) Boltzmann machine with unlimited, synchronous parallelism (see the contribution by Spieksma). An account of the experiments is presented in the appendix. 2 Performance Evaluation of the Boltzmann Machine 2.1 Performance Criteria Important promises of neural networks lie in the area of computation speed and robustness. As for the usefulness of Boltzmann machines for combinatorial optimization, we consider computation speed to be by far the most important criterion. Robustness is less of an issue, at least if we use the performance of * Research was supported by the Netherlands Foundation for Scientific Research NWO under grant number 612-322-014
132 simulated annealing as our point of reference. No major difference is to be expected in robustness between sequential simulated annealing on a Von-Neumann computer, and parallel simulated annealing on a Boltzmann machine, since simulated annealing itself is relatively insensitive to problem changes due to its probabilistic nature. Summarizing, apart from solution quality, we shall use but one criterion for the performance of the Boltzmann machine on TSPs: computation time. 2.2 Boundary Conditions for Representation Research Whereas the performance of Boltzmann machines on many graph problems compares favorably with both simulated annealing and tailored algorithms, the results of Boltzmann machine simulations on TSPs have been much less impressive (Aarts and Korst, 1989a;b). One has tried to improve on this in many ways, trying out adaptive cooldown schedules (Andresen, 1991), non-digital Boltzmann machines with continuous cell values (Gutzmann, 1987), and machines with additional, weightless connections (Aarts and Korst, 1989a). In this paper we shall concentrate on alternative problem representations that do not require adaptation of the Boltzmann machine architecture. The motivation for this restriction is twofold. Firstly, the impact of problem representation on performance, which we are currently focusing on, can not be properly determined if we use different architectures. Secondly, the (commercial) feasibility of producing Boltzmann machine hardware is endangered if different problems require different architectures. Consequently, we simply stick to the conceptual description of the Boltzmann machines that was provided in the contribution by Spieksma. Encoding a TSP on the Boltzmann machine is thus tantamount to determining the connection matrix. Of course, this can only be done properly, after one has decided how to map the solution space of the TSP on the configuration space of the Boltzmann machine. 3 The Quadratic Assignment Representation 3.1 Mapping the Solution Space onto the Configuration Space We recall the TSP encoding as a quadratic assignment problem from the contribution by Spieksma. A TSP solution (cyclic permutation of cities to be visited) is mapped onto the cells 2¾ of the Boltzmann machine by defining X{j = 1 if city i is at position j in the tour, and X{j ~ 0 otherwise. We remark that this mapping is all but surjective: only a small fraction of the configurations corresponds to an actual tour. More precisely, picturing the configuration (xij) as a matrix, the configuration represents an actual tour if and
133 only if each row and each column contains exactly one nonzero element. A row full of zero's would imply the associated city to be skipped, a zero column indicates an empty position in the tour, and more than one nonzero element in a row or column corresponds to a city occurring at different positions, or more than one city at one position, respectively. 3.2 Mapping the Objective Function onto the Connection Matrix In the following, we describe how the objective function (minimal tour length) is translated into a definition for the connection matrix of the Boltzmann machine. We shall refer to connection strengths as "weights", using Wijki to denote the weight between cells 2¾ and Xki. The non-zero entries of the connection matrix (wijki) are divided in three categories: city weights the entries connecting cells with themselves; distance weights the entries connected to pairs of cells denoting adjacent cities in the tour; permutation weights the entries connected to pairs of cells that must not be both "on" if the configuration of the machine is to represent a solution (i.e., pairs of cells denoting different cities at the same position, or the same city at different positions). Denoting these categories with d, C2, and C3, respectively, and the distance between cities i and j with d(i,j), the weight matrix (wijki) is defined by: Wijki - -d{i, k) if Wijki G C2 wijij > max (<f(i, A) + <f(i, to)) if w%jij ktm] k^m Wijki < - min(wijij, wuki) if Wijki G C3 Wijki — 0 otherwise This assignment of the weight matrix can be understood as follows. First of all, the Boltzmann machine strives for maximal consensus, so we need to ensure that high consensus values correspond to short tours. This is achieved by the assignments to ^-category weights. If we would leave it at that, consensus would be maximal if all cells of the machine were set to zero, representing a "tour" visiting no cities at all. To counteract this, cities are stimulated to embark on the tour by assigning the bias connection weights in class C\. The stimulation of each city is chosen sufficiently high to compensate for the associated distance weights in the worst possible situation (in which a city's neighbors (in the tour) are exactly those cities which are most distant to it). If we would leave it at that, consensus would be maximal if all cells were set to one, that is, if each city occurs at all positions. To counteract this, the "permutation weights" (C3) are chosen sufficiently high to guarantee that any configuration with more than one nonzero cell in some row or column will lead to a lower consensus value then a neighboring configuration with one nonzero cell in that row or column.
134 This guarantees that the encoding is "feasible": the consensus function has a local maximum if and only if the configuration represents a solution, that is, an actual tour. 4 Weak Spots in the Quadratic Assignment Representation Comparing the annealing process on a Boltzmann machine with that on a Von- Neumann machine we distinguish a number of "weak spots", which could be responsible for the disappointing performance in terms of solution quality. search space size In sequential simulated annealing, the solution space equals the search space2, whereas on the Boltzmann machine, the solution space is only a tiny subspace of the search space. neighborhood structure A transition in the Boltzmann machine comprises one cell switching from 0 to 1, or from 1 to 0. This "bit switch" neighborhood structure is much less efficient than (for example) a 2-exchange neighborhood structure (see the contribution by Crama et al.) in sequential simulated annealing. objective function The objective function used in sequential simulated annealing is an accurate measure of the quality of the current state, whereas the consensus function used in the Boltzmann machine is dominated by 'noisy' terms that reflect the distance of the current configuration to the closest solution, rather than the distance to the best (or a reasonably good) solution.3 4.1 The Size of the Search Space As for the first weak spot, the configuration space involved is much larger than the solution space of the TSP. For an n-city TSP there are (n —1)!/2 distinct non- equivalent tours4, whereas the n2 binary cells of the corresponding Boltzmann machine lead to a configuration space of 2" configurations, only n\ of which correspond to actual tours. What this amounts to in terms of "tour density" is pictured for various values of n in table 1. 4.2 The Neighbourhood structure In sequential simulated annealing, the neighbourhood structure can be chosen so as to minimize the diameter (maximum distance between two elements) for the search space. On the Boltzmann machine, we are forced to use the 2 in the sense that all search space elements correspond to solutions. 3 Since at least four non-tour configurations lay between any two solutions, this flaw is liable of "telling the machine to turn left where it should turn right". 4 There are n\ permutations, but it does not matter in what city we start, nor in what direction the tour is traversed, leading to division by n and 2, respectively.
135 Table 1. The density of tours in the Boltzmann configuration space n 6 10 12 20 100 n! 720 3.6 X106 4.8 X 10* 2.4 x 1018 0.9 x 10158 2^ 68 x I0a 1.27 x 1030 2.2 x 1043 2.6 x 10120 2.0 x 103010 tour density 1 x 10-8 2 x 10-24 2 x 10-35 1 x 10-102 5 x iQ-2853 rather unattractive neighborhood structure of bitwise switches. Consequently, a 2-exchange, which requires one step on a Von-Neumann machine with the appropriate neighbourhood structure, will at best require 4 steps and at worst a number close to the dimension of the problem (the number of cities) on a Boltzmann machine. The probabilistic nature of state space transitions implies that these numbers are lower-bound estimates of the efficiency decrease incurred. 4.3 The Combined Effects of the Weak Spots Summarizing, the quadratic assignment encoding of a TSP is 'unfortunate' as a consequence of three interacting aspects: the low density of solutions in the state space, the primitive neighborhood structure, and the low distinctive ability of the consensus function. Apart from these representation problems, there are also architectural ones (e.g., involving errors due to unlimited parallelism and difficulties in determining an appropriate cooldown schedule) which can interact with the representation problems. In view of our focus on representation, we choose not to discuss these additional problems here. 5 Searching for Improved Representations In this section we explore potential remedies against each of the representational flaws postulated in the previous section. It will appear that none of these "remedies" render truly adequate improvements. The representational flaws thus appear to constitute a serious problem. 5.1 Adjusting the Objective Function To counteract the 'noisy' influence of terms unrelated to solution quality in the consensus function, Aarts and Korst (1989a;b) suggest to decrease the city weights, thus egalizing the "consensus surface" defined by the weight matrix. They propose to define the entries u^-y equal to the mean value of the sum of distances from two different cities to city i, instead of the maximum value. This would result in diminishing the difference in magnitude between the distance weights on one hand and city and permutation weights on the other, thus reducing the dominance of the latter two in the consensus function.
136 However, this adjustment does imply that one gives up on feasibility. It is no longer guaranteed that the consensus is locally maximal in each configuration which denotes a solution. Consequently, the Boltzmann machine is not guaranteed to settle down in a configuration representing a TSP solution. For some TSPs, the machine is even bound to end up in non-tour configurations. As an <KH.K) Fig. 1. The Korean-Dutch connection: a difficult case for the egalized TSP representation example, an "eccentric" TSP involving ten cities in Holland and two in Korea (see figure 1) will most likely result in two Dutch cities being left out of the tour altogether in the final solution. The consensus change incurred by excluding a Dutch city adjacent to a Korean one from the tour equals the city weight of the Dutch city minus the sum of the distances to its neighbors. The latter sum will approximate d(H,K), because the distances between Dutch cities are negliga- ble in comparison with the distance d(H, K) between the two countries. In the egalized representation of an n-city TSP, the city weight of city i equals j For Dutch cities, the sum in the above expression will approximately equal 2 • d(H,K).5 Consequently, the city weights of Dutch cities approximate ^-j- • d(H, K), in this case ^- • d{H, K). It appears that the city weight is dominated by the distance weights. In other words, the consensus will increase if the Dutch "border" city is removed from the tour. As a consequence, feeding the Boltzmann machine with an "eccentric" TSP like the Korean-Dutch connection will render an incomplete configuration with a probability that approaches 1 if cooldown speed approaches zero. 5 We assume that the sum of distances to cities within Holland is negligible with respect to twice the distance between the two countries.
137 This is hardly a reason to reject the representation, however. The number of cities left out is likely to be sufficiently low to easily add them to the tour afterwards, either by greedy heuristics or, as we did in our experiments, by switching to a feasible connection matrix (while retaining the configuration) and performing a quench6. Our experiments confirm the results mentioned in (Aarts and Korst, 1989a), indicating that the mean quality of solutions found is indeed improved. Cooldown did end up in non-tours much more often than they reported, however7. We attribute this discrepancy to the difference in degree of extremity (resemblance to the Korean-Dutch TSP) between our respective test problems. In any case, their "try-again" strategy in case of non-tour final configurations will perform very badly on TSPs like the Korean-Dutch connection, which is why we prefer our own, safe strategy of "postmortem quenching". In addition, the non-feasibility of the consensus function requires adaptation of the stop criterion to prevent persistent oscillatory behavior in cases where the consensus of some configuration equals that of a neighboring configuration8. This can — and, in several of our experiments, indeed did — occur if the feasibility of the consensus function is no longer guaranteed, as in the case of the "egalized" representation. The criterion advocated by Aarts and Korst, "stop if no cell switch has occurred during a certain number of sweeps", should be changed into "stop if no consensus change has occurred during a certain number of sweeps". 5.2 Counteracting the Inefficiency of the Neighbourhood Structure Though the neighbourhood structure itself can not be tampered with, its adequacy can be enhanced by encoding the problem as a linear assignment problem, Xij = 1 denoting that city i follows city j (immediately) in the tour. The weight matrix elements Wijij now refer directly to the cost of traveling from city i to city j, obviating the need to separate the city weight from the distance weight. We may simply define the bias connections to,j,j by w^ij > max{<f(i, &), d(j, k)} - d(i, j) k The first term in this expression represents the largest distance from either of the two cities i and j to a third city, the second denotes the distance between the two cities. The permutation weights are defined, analogously to those in the quadratic assignment encoding, to compensate for any of the associated bias connection weights in case of "conflicting adjacency9": mjki < - m\n{wijij, wkiki} The efficiency of solution space traversal is greatly enhanced when using the linear assignment representation instead of the quadratic assignment representation. While four cell switches are needed in the quadratic assignment A quench is a cooldown starting at a very low temperature. 7 viz., in 96% of all cases, instead of the 20% implied by their account, implying a transition probability of 0.5 irrespective of temperature. that is, in case i = j xor k = I.
138 representation to exchange two cities in a tour, four cell switches in the linear assignment representation are sufficient to "cut loose" a city from the tour, and insert it elsewhere at any desired position. In the quadratic assignment representation, such an operation would require a number of cell switches between four and about half the total number of cities, due to the fact that each of the cities between the old and the new position has to shift one position to the left. Other rearrangement operations, like 2-exchanges, also require less cell switches in the linear assignment representation. Therefore, it is not too surprising that the representational shift appeared to be very proficient to the performance of Hopfield-Tank networks on the TSP (Joppe et al., 1990). Unfortunately, the improvement is not nearly as impressive for the Boltzmann machine (Aarts and Korst, 1989a). As for the reason why, we can only guess. The small difference between the representation on the Hopfield-Tank network and that on the Boltzmann machine10 may be responsible for the discrepancy. Alternatively, the cell switch neighbourhood structure, which we remain condemned to, may, in combination with the probabilistic nature of the search, prohibit any representational shift to render more than marginal improvement. A linear assignment representation for TSPs on the Boltzmann machine is less attractive for other reasons than lack of impact on performance. Like the quadratic assignment representation, the linear assignment representation does not provide a surjective mapping onto the configuration space. Again, many configurations do not correspond to solutions, in this case because of the fact that closed subtours may occur. Unlike the non-solutions in the quadratic assignment representation, however, these subtour-configurations can not be suppressed by adjusting the weight matrix. Information on the presence of subtours is essentially non-local. Aarts and Korst appear to be rather lighthearted with respect to this disadvantage. To cope with subtours, they simply propose the addition of a second network of connections to the Boltzmann machine. The added connections do not contribute to the consensus function like the "normal" connections do. They only serve as communication channels for subtour detection. Once detected, subtours are combined (and thus resolved) by tampering directly with transition probabilities, that is, circumventing the consensus function. In our view, this amounts to proposing a special purpose Boltzmann-like machine for TSPs. We feel that such a Boltzmann machine variant is bound to be commercially unattractive, unless it is shown to perform well on a much larger class of problems than TSP. In any case, the fact that architectural changes are required in shifting from a quadratic to a linear assignment representation implies that performance comparisons between the quadratic and linear assignment representations are not suited to investigate the impact of representation on performance. The Hopfield-Tank representation involves adjacency, whereas the Boltzmann representation involves successorship.
139 5.3 Coping with the Size of the Configuration Sspace In view of the tour densities in table 1, it appears that the encoding of a TSP on a Boltzmann machine transforms the original problem of finding the largest needle in a pin-cushion to the problem of finding the largest needle in a haystack. Essentially the same trouble is also encountered in other neural networks for combinatorial optimization. As a remedy for the Hopfield-Tank network, one conceived "graded neurons", a translation of the concept of "Potts-glass model" from thermodynamics. The remedy consists of adjusting the equations determining the state evolution in the Hopfield-Tank network so as to confine the evolution of the (real-valued!) cells 2¾ to a relatively small subspace defined by (Peterson and Soderberg, 1989; see also the contribution by Postma) {(*tf) I (VO J>; = 1} 3 In the case that cell values are either 0 ("passive") or 1 ("active"), this enforces each row of the configuration matrix to contain exactly one active cell. This adjustment appeared to incur a substantial improvement of solution quality, the resulting quality matching that of optimally tuned simulated annealing algorithms. Unfortunately, it can not be translated straightforwardly into the formalism of the Boltzmann machine due to the digital and probabilistic nature of the latter. However, we can try to approximate the "Potts-glass approach", by a suitable choice of permutation weights in the connection matrix. We observe that the solution space of the TSP on the Boltzmann machine is embedded in the subspace of the configuration space comprising the configurations in which at most one cell is active in each row and column of the configuration matrix: {(*«) I (Vj) 2>; <= 1 A (Vi) $>,• <= 1} i 3 It is important to note that, contrary to the solution space itself, this subspace is completely accessible with respect to the "bit switch" neighborhood structure imposed by the Boltzmann machine. In other words, if the Boltzmann machine configuration is confined to the subspace, it is still capable of reaching any solution state. We shall henceforth refer to this subspace as the civilized search space (CSS), and to its complement as the jungle search space (JSS). The CSS is much smaller than the entire configuration space. Instead of 2" elements, it contains only " / \ 2 The formula can be easily derived upon realizing that a construction procedure for an arbitrary configuration with k empty columns is: Starting with the n x n unit matrix,
140 1. remove k unit vectors (( , I possibilities) 2. permute the remaining n — k unit vectors ((n — k) ! possibilities) 3. insert k zero vectors (again I , 1 possibilities) For n = 10, this comes down to 9,864,101 configurations instead of 1.27 x 1030, corresponding with a solution density in the CSS of about 0.36. It would be somewhat misleading to estimate the relative efficiency of CSS- restricted search by bluntly comparing the CSS solution density with the overall solution density 2 x 10-24 in table 1, because of the fact that the permutation weights will push the Boltzmann machine configuration towards the solution subspace in any case, as the temperature decreases. Monitoring the consensus during cooldown, however, has demonstrated to us that, in the usual11 quadratic assignment encoding, highly negative consensus values dominate during a substantial part of the cooldown, indicating that the machine is wandering around elaborately into the JSS12. In view of the above, it may prove worthwhile to restrict the evolution of the Boltzmann machine to the CSS. The question, of course, is "how?". Whereas a strict confinement is hard to conceive within the probabilistic context of the Boltzmann machine, an approximate one can be achieved by enlarging the permutation weights. This will speed up convergence from a random initial configuration towards the solution space at high temperatures. Furthermore, it will suppress excursions into the JSS at lower temperatures. Unlike the Potts- glass confinement in the Hopfield-Tank network, however, the adjustment applies to the "test" part of a generate-and-test cycle. It suppresses excursions into the JSS by stiffening the criteria for transition acceptance, rather than by actively proposing more promising transitions. Consequently, the weight adjustment in the Boltzmann machine is bound to be less effective than the shift to a Potts-glass model in the Hopfield-Tank architecture. This was confirmed by our experiments. 6 Summary and Conclusions We have postulated three major causes for the disappointing performance of the Boltzmann machine on a TSP in comparison with sequential simulated annealing. For each of these, we proposed, analyzed, and tested representational variations that could be expected to lead to some improvement. Whereas the experiments indeed indicate some improvement, the outcome is not overwhelming. Changes to the representation seem to sort much less effect on the Boltzmann machine than they do on the Hopfield-Tank network. Nevertheless, a conclusion 11 Usually, the city- and permutation weights are chosen as low as possible under the inequality constraints imposed by the feasibility condition. 12 It can be inferred that consensus is definitely non-negative in the CSS.
141 that the Hopfield-Tank network is better suited to deal with combinatorial optimization problems than the Boltzmann machine would be premature. Firstly, Boltzmann machine simulations did perform well on other combinatorial optimization problems than the TSP. Secondly, there is, as yet, considerable uncertainty with respect to the speed and construction cost that can be expected of different kinds of neural network hardware in the future. At present, it is expected (Aarts and Korst, 1989b) that Boltzmann machines will be easier to implement in hardware than Hopfield-Tank networks. We hope to have shown that massive parallelism in neural networks, promising as it may be, is not guaranteed to solve computational problems, that representation does matter in neural networks, and that finding a better representation is difficult. In this respect, we remark that the TSP should be qualified as a relatively easy problem in comparison with, for instance, job-shop scheduling problems. Attempts to solve nontrivial scheduling problems on the Boltzmann machine would stumble upon the same representation problem we encountered with the TSP, only (much) more vehemently (c/., Sadeh, 1991; p. 8). Appendix: Experimental results The TSP that was used in the experiments involved 10 cities in Holland. The associated distance matrix is shown in table 2. All of the experimental results pertain to a Boltzmann machine featuring unlimited synchronous parallelism. This was simulated by allowing the cells of the Table 2. The distance matrix of the 10-city TSP / 0 19 57 186 87 215 38 99 111 184 \ 19 0 42 182 103 231 54 115 136 209 57 42 0 134 104 221 61 116 156 229 186 182 134 0 145 244 170 207 261 334 87 103 104 145 0 125 53 62 131 204 215 231 221 244 125 0 181 153 221 294 38 54 61 170 53 181 0 62 93 166 99 115 116 207 62 153 62 0 67 140 111136 156 261131221 93 67 0 74 \ 184 209 229 334 204 294 166 140 74 0/ machine to attempt a state transition in sweeps. In each sweep, each individual cell is selected with probability 2/3 to attempt a transition. The transition probabilities of selected cells are computed on the basis of the configuration prior to the sweep. This simulation practice was derived from (Aarts and Korst, 1989b).
142 The probability of 2/3 was chosen in view of the fact that the machine tends to get stuck in "blinking" (alternation between "all cells on" and "all cells off' states) for (substantially) higher values (Aarts, 1992). The corollary for the information backlog incurred by this value is that at high temperatures13 two third of the information on cell values is presumed "potentially outdated". The results of the experiments which we performed with the three representations for TSPs on the Boltzmann machine are summarized in figure 2. tour length («1000) 1.563 - - 1.363 - - 1.163-- 0.963 -I 1 I I I Mill 1 I I I IIIH 1 I I I III!1 100 1000 10000 100000 #sweeps representation ~^~ normal I egalized ~~*~ Potts Fig. 2. Solution quality of a 10-city TSP for various representations Each (marked) data point in figure 2 represents the outcome of a "sample" of 50 computations on the Boltzmann machine. The mean number of sweeps in the sample is indicated (logarithmically) along the horizontal axis, whereas the mean tour length of the final solutions are indicated along the vertical axis. The range of the vertical axis corresponds with the actual range of tour lengths in the entire "population" of tours in the TSP that was used in the experiment, that is, the best (shortest) tour has length 963 and the worst one has length 1734. The data points of the three curves do not have common "mean-sweeps" values, since it is not possible to preset the number of sweeps in advance: The Boltzmann machine must be allowed to settle down in a final state in its own 13 At low temperatures, the expected number of sweeps between consecutive switches of an individual cell is much higher than one, diminishing the probability that other cells switch erroneously.
143 time. In our experiments, we used "no consensus change during 100 sweeps" as a criterion for quiescence. As for the conclusions we can draw from the three curves in figure 2, it seems warranted to state that the normal (quadratic assignment) representation performs worst, the Poiis-glass representation somewhat better, and the egalized representation performs best. However, in view of the fact that all three curves lie well above the best tour level (963), it seems questionable whether the improvement induced by replacing the normal by the egalized representation is worth the trouble. After all, the problem is so small (ten cities) that we would expect the result to be near-optimal. This is hardly true, at least not with respect to the mean value of the solution lengths. In most cases, the best solution in a sample of 50 computations is near-optimal. The resulting impression that performance is poor with each of the three representations is strengthened if we compare the frequency distribution of tour length in a sample of Boltzmann solutions to that in the entire population of tours. is 16 14 12 - 10 8 6 - 4 2 A (•1000) frequency(population) frequency(sample) 0 EP A ESI El [J3 CD 03 [] □ a ls> - 2 fy 0 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65 1.75 tour length (*1000) population ' sample Fig. 3. Distribution of tour length in the TSP and a Boltzmann solution sample thereof Figure 3 shows the distribution of tour length in the sample associated with
144 data point (5928,1226) in figure 2 (the fifth data point of the normal representation) in comparison with the distribution of tour length in the total population of tours. Frequency data of the Boltzmann solution sample are marked with small, solid dots. The corresponding frequency values are indicated at the rightmost vertical axis. Frequency data of the total population are marked with open squares, with associated frequency values at the leftmost vertical axis. Though the figure shows the number of near-optimal tours to be low, there appear to be thousands of tours with a length below 1200, whereas more than half of the solutions resulting from Boltzmann machine simulation are longer tours. In other words, the performance of the Boltzmann machine on this 10-city TSP is rather disappointing. References E. Aarts and J. Korst (1989a) Boltzmann Machines for Traveling Salesman Problems. European Journal of Operational Research 39, 79-95. E. Aarts and J. Korst (l989b) Simulated Annealing and Boltzmann Machines. John Wiley and Sons. E. Aarts (1992) personal communication B. Andresen (1991) Parallel Implementation of Simulated Annealing Using an Optimal Adaptive Annealing Schedule. Proceedings of the European 1991 Simulation Multiconference, 296-300. K.M. Gutzmann (1987) Combinatorial Optimization Using a Continuous State Boltzmann Machine. Proceedings of the IEEE First International Conference on Neural Networks, Vol. Ill, 721-734. A. Joppe, H.R.A. Cardon and J.C. Bioch (1990) A Neural Network for Solving the Traveling Salesman Problem on the Basis of City Adjacency in the Tour. In: Proceedings of the International Neural Network Conference, Paris, Vol. 1, 254-257. C. Peterson and B. Soderberg (1989) A New Method for Mapping Optimization Problems onto Neural Networks. International Journal of Neural Systems, Vol. 1, No. 1, 3-22, World Scientific Publishing Company. N. Sadeh (1991) Look-Ahead Techniques for Micro-Opportunistic Job Shop Scheduling, Report CMU-CS-91-02, Carnegie-Mellon University.
Optimisation Networks E.O. Postma Department of Computer Science, University of Limburg, Maastricht 1 Introduction A particular class of optimisation problems has a solution time that grows exponentially with the problem size. For large problems within this class, exhaustive search for the optimal solution becomes infeasible. Therefore, knowledge concerning the structure of these problems is often exploited to perform an intelligent search for a solution. Nevertheless, computation time may still grow out of bounds for sufficiently large problems. For these cases, a method yielding good solutions (not necessarily the best) within a limited time may provide an attractive alternative. In this paper a neural-network approach initiated by Hopfield and Tank (1985) is discussed that does just this. Hopfield and Tank's (1985) network is a fixed-weights version of the adaptive- weights network proposed earlier by Hopfield (1982, 1984). Although Hopfield was not the first to study such a network (see Cohen and Grossberg, 1983, and Grossberg, 1982; 1988, for an overview), his papers initiated a widespread interest in fully-connected networks, especially in the physics community. Hopfield pointed out that fully-connected networks can be analysed with mean-field theory, a standard statistical-mechanics technique. With the application of mean- field theory many properties of fully-connected networks have been established (see, e.g., Amit, 1989; Hertz, Krogh, and Palmer, 1991). In section 2 the structure and dynamics of the Hopfield-Tank network are outlined. Section 3 discusses the main shortcomings of the network and provides an overview of some recent approaches that try to deal with these shortcomings mostly by applying ideas from statistical mechanics. Finally, Section 4 concludes with some remarks. 2 The Hopfield-Tank Network: Structure and Dynamics In this section, we start with a description of the architecture of the Hopfield- Tank network. Then, the relation of network dynamics and a global energy function is pointed out. Subsequently, an example of how a specific optimisation problem can be mapped onto a fully-connected network is described. Finally, the performance and nature of computation in Hopfield-Tank networks is discussed. 2.1 Basic Structure The Hopfield-Tank (HT) network consists of a certain number of elements or neurons. This network is fully connected: each neuron is connected by links with
146 all other neurons in the network. These links subserve the signal transmission in the network. Associated with each link is a connection strength called the weight. Weights can be set to zero to limit the connectivity of the network. 2.2 Activation Dynamics At each moment in time, the state of a neuron is characterised by an activation value. The following differential equation describes the activation dynamics of a single neuron i: d N C-Vi(t) = -vi(t) + £>;,- /(¾(0) + Ih (1) i with / representing a sigmoid function (see below), Vi the activation of the i-th neuron, wji the weight of the link from neuron j to neuron i, N the number of neurons in the network and I{ an externally applied input to neuron i. C is a positive constant. As is evident from this equation, at each point in time the activation value of a neuron is determined by its autonomous decay term —v{(t), the weighted sum of the output signals of all neurons V- Wji f(vj(t)), and an externally applied input I{. Take note of the constant C mediating the velocity by which the input signals affect the activation value. For small values of C this effect is (almost) instantaneous, whereas for larger values the "charging time" of the neuron increases. 2.3 The Sigmoid Function A crucial feature of the HT network is the function / in equation (1). This function describes the neurons' mapping of input signals onto its output. It appears (see Section 2.7) that a good choice for / is a sigmoid function /(*) = t~^- (2) Here, A £ (0, oo) is a parameter controlling the "steepness" (or gain, cf. Hop- field and Tank, 1985) of the sigmoid function. For small values of A, function / approximates a step function (high gain: abrupt transition from 0 to 1) whereas for large values of A it approaches a linear function (low gain: smooth transition from 0 to 1). Figure 1 shows the decreasing steepness of the 0-to-l transition accompanying an increasing value of the parameter A in (2). For small values (A —► 0) of this parameter (2) approaches a step function (left). For large values (A —► oo) it approaches a linear function (right).
147 X Fig. 1. The decreasing steepness of the 0-to-l transition. 2.4 Energy Function The states Vi = f(vi) of the neurons in a fully-connected network can be expressed in a state vector V. (V = {Vi, V2,..., Vjv}.) An energy (or cost) function can be defined that specifies a scalar energy value E(V) for each possible network state V (Amari, 1977). The energy function can be thought of as spanning a curved surface (in N dimensions). The shape of this surface is determined by the energy values specifying the height at any point in the landscape. Valleys and hills represent minima and maxima of the energy function. Like physical systems, properly-defined network dynamics follow the negative energy gradient to end up in a (local) minimum of the energy function. In the landscape metaphor these network dynamics can be thought of as a ball that is subject to gravitational force and rolls downhill into a valley. Hopfield (1984) showed that in the high-gain limit, i.e., A —► 0, the stable states of a network with dynamics (1) are the minima of N N N ^(V) = -2 EE^^ - E^ (3) i j i provided that Wij = Wji for all i, j and wa = 0 for all f. (It should be remarked that in the high-gain limit, the Vi can be interpreted as binary variables, i.e., Vi = 0 or V{ — 1. For other gain values (1) leads to stable states that are minima of an energy function with an additional term. ) To solve an optimisation problem with a fully-connected network of neurons with dynamics (1) an appropriate energy function has to be defined and cast into the form (3). 2.5 The Task Assignment Problem The mapping of an optimisation problem onto a fully-connected network is best explained by considering a specific example. We take the problem of assigning
148 six assistants to the task of shelving six collections of books as an example (cf. Tank and Hopfield, 1987). Each assistant differs in the rate at which he or she can shelve books of a particular collection. For instance, as can be seen in Figure 2a (showing the shelving rates per minute for each assistant on a particular collection), Sarah is superior in shelving geology books (shelving rate 10 per minute) while she performs badly on art books (shelving rate 1 per minute). The problem is to find the highest shelving rate by assigning each assistant to a single collection. The total number of solutions to this problem is 720 (6 factorial). Hopfield and Tank proposed using a network in which neurons were organized in a square. In this square the rows represent the assistants and the columns Sarah Jessica George Karen Sam Tim Geology 10 6 1 5 3 7 Physics 5 4 8 3 2 6 Chemistry 4 9 3 7 5 4 History 6 7 6 2 6 1 Poetry 5 3 4 1 8 3 Art 1 2 6 4 7 2 rate = 40 rate = 44 a Fig. 2. The task assignment problem, (a) Shelving rates per minute for each assistant-task combination, (b) Two solutions found by the network. represent the collections. An active neuron in this square network indicates the assignment of an assistant to a particular collection. The basic idea is that the constraints of the problem are encoded in the inputs and the weights of the links connecting the neurons. We now proceed by defining an appropriate energy function for the task assignment problem. Subsequently, by defining the inputs and weights in a particular way, we show that this energy function can be written in the form (3). Consequently, a network with these inputs and weights and
149 neuron dynamics (1) evolves towards minima of the task assignment energy function. We define nat as the neuron that represents the assignment of the a-th assistant to the 6-th book collection (a, 6 € {1,2,...,K}). The shelving rate of the a-th assistant on the 6-th collection is denoted as raj. Solutions of the task assignment problem obey the following hard constraints: — a single assistant should be assigned to each book collection, and — a single book collection should be assigned to each assistant. The hard constraints ensure that a valid solution is obtained. To get good solutions the additional soft constraint — assistants should be assigned to tasks they perform well has to be imposed. The following energy function incorporates the hard and soft constraints: a b e^a a b d^a +lfe^Vai~K) ~pEEV^- (4) \ a b / a b The first two right-hand terms represent the hard constraints. The first right- hand term of (4) becomes positive if and only if more than one assistant is assigned to each book collection. The second right-hand term becomes positive if and only if more than one book collection is assigned to each assistant. Because the first two terms do not ensure that there is any assignment at all, the conservation term \ (J2a J2b ^b ~ -^-) 's added that becomes positive if the number of assignments is not exactly K. The last term represents the soft constraint (p > 0 is a parameter that balances the soft constraint against the hard constraints). Combined, these terms ensure that minima of Etap(V) correspond to states V that represent (good) solutions to the task assignment problem. As a first step to bring (4) in the form (3) we rewrite it as w(v) = I e e e vaiVct + \ e y, E y«y«+\ (e E ^) a b c^a a b d^b \ a b / +Ik2-kEEv^-pEI1v^^ (5) a b a b The basic idea is to incorporate all quadratic terms in (5) in the weights and all linear terms in the external inputs. To do this we define wab,cd, the weight of the link connecting neuron nab with neuron nca- (nab ^ ncd), as Wab,cd = —^Jd(l — &ac) ~ 8ae(l — Sbd) ~ 1,
150 where Sxy — 1 if x = y and 8xy = 0 otherwise. The three right-hand terms defining each weight account for the three respective quadratic right-hand terms in (5). In the network, the three terms represent inhibition between neurons within each row (i.e., assistants), inhibition between neurons within each column (i.e., collections), and a global inhibition for each pair of neurons, respectively. The external inputs Iat incorporate the last two (Jinear) terms of (5): lab = K + prab ■ In the network, the right-hand terms represent, for each neuron, an excitatory bias and a "data" term (i.e., shelving rates), respectively. With these definitions, (5) can be written as etAP(v) = -^ EE E E ^,^,1¾ - e E ^.» + \R2- (6) abed a b Except for the constant term and a change in indices, this energy function has the same form as (3). Remark that minimizing (6) corresponds to minimizing (3), since the constant term does not change the energy gradients. Therefore the activation dynamics forcing the network to minima of Etap(V), i-e., to valid solutions of the task assignment problem, is given by C—vah(t) = -vab(t) + Y^YlWah-ed^VedW> + Iab' (7) c d which is (1) with new (double) indices. The above analysis indicates that a HT network with the activation dynamics of (7), moves towards a state that represents a solution to the task assignment problem. Figure 2b shows two solutions that were found by such a network. Filled squares represent activated neurons. The left panel shows a good solution with a total shelving rate of 40 (obtained by adding the individual rates in (a) that correspond to the filled squares). The right panel shows the optimal solution that was also found by the network: total shelving rate of 44 (cf. Tank and Hopfield, 1987). 2.6 Performance The appealing feature of the HT network is its speed. After initializing the network (by applying the input activations to the neurons) it settles very quickly into a stable state. But what about the quality of solutions found by the network? Hopfield and Tank (1985) mapped the Travelling Salesman Problem (TSP, i.e. the problem of finding the shortest tour which visits each of a given number of cities once, and ends at the starting point) onto their network. They employed a representation similar to the one used for the task assignment problem. Instead of assigning each assistant to a task, in their TSP network each city was assigned to a position in the tour. In contrast to the task assignment problem, the distances between cities (the constraints) were encoded in the weights and not in the initial
151 activations. Twenty simulation runs on a 10 x 10 HT network were performed by Hopfield and Tank. In 16 runs, tours within about the 500 best (shortest) ones were found. One of the two best tours (the shortest and next to shortest tour) was found in about half of the simulations. Given the total number of 181,440 valid tours for the 10 cities TSP, the network seems to have a strong tendency to select one of the best ones. For 4 simulation runs,, the HT network did not converge to a valid solution (see below). 2.7 Analogue computation of discrete problems Below we discuss three features of the HT network that elucidate how it performs its computations: the continuous activation values, the sigmoid nonlinearities, and the interpretation of network states. Continuous activations The state space of the HT network is formed by an ./V-dimensional hypercube. For small values of A in (2), individual neurons behave approximately like binary elements. As a result, possible network states are on the (2-^) corners of the hypercube. In the task-assignment problem, valid solutions are binary vectors. Why, then, not use truly binary neurons? The main point is that the actual computation occurs within the continuous 0 to 1 range. Neurons connect to other neurons in effect to communicate their probability of being part of the final solution. This probability is expressed in their activation value (cf. Hopfield and Tank, 1985). Restricting activation values to binary values seriously hampers the possibility of neuron-to-neuron communication. In fact, Hopfield and Tank (1985) found a network that operated only on the corners of the hypercube (binary activation values) to perform little better than random. In contrast, when network dynamics are stochastic instead of deterministic, binary neurons are able to communicate intermediate values by their time-averaged state. In fact, the HT network may be cast in such a stochastic form by using Glauber dynamics (see, e.g., Hertz, Krogh, and Palmer, 1991). Sigmoid nonlinearity In addition to graded output values, in the HT network, a sigmoid (nonlinear) output function is employed. The reason for using such a function can be intuitively understood as follows. On the one hand each neuron has to decide whether it will be "on" or "off" in the final solution. This requirement suggests a step function. On the other hand, however, neurons need to communicate with each other about their activation value within the 0 to 1 range. This requirement suggests a continuous increasing function. Both requirements can be dealt with by using a sigmoid function for the input-output mapping of neurons. (For a formal treatment of sigmoid functions in WTA networks, the interested reader is referred to Grossberg (1973; 1982).) Interpretation of network states Although stable states of the HT network can be readily identified with solutions
152 of the task assignment problem (or TSP), the interpretation of the relaxation towards these states is not that apparent. As suggested in our discussion of continuous activations, intermediate activation values might be loosely interpreted as representing probabilities. For instance, an activation value of 0.5 for a neuron coding the assignment of assistant X to task Y corresponds to the fifty percent probability that this assignment is part of the final solution. The winner-take- all interactions within the network tend to normalize the total activation within rows and columns to 1.0, as is required for the probability interpretation to hold. 3 Limitations and Alternative Approaches There are two major obstacles for serious application of the HT network. One concerns the validity of solutions obtained. The other occurs when applying the network to problems of a large size. The HT network has no means of avoiding local minima (i.e., apparently good solutions) that are not global minima (i.e., optimal solutions). In the landscape analogy, a local minimum represents a hanging valley while a global mimimum is the lowest point in the landscape. The network dynamics in principle always leads to a minimum. However, there is no guarantee that the minimum found is also a global minimum. This entails that the HT network does not yield the optimal solution in all cases. More importantly, due to practical constraints on implementation, it might end up in a situation that does not represent a solution at all. The finding that four of the twenty simulations performed by Hopfield and Tank (1985) represented invalid solutions is a case in point. While the HT network can yield solutions very quickly for problems of a moderate size (e.g., TSP up to 30 cities), its applicability to larger problems has been questioned (Wilson and Pawley, 1988). With larger problems the HT network often does not converge to a valid solution of the optimisation problem. This failure to scale up well is due to the practical problems associated with the number of comparisons each neuron must participate in. This represents a serious problem, because it is in conflict with the main advantage of the HT network (i.e. processing speed). When the network performs badly at larger problems this advantage becomes valueless. 3.1 Current Research Given the serious limitations of the HT-approach, many researchers have attempted to modify and improve upon the original approach. In the following several current lines of research are discussed. Simulated annealing In order to deal with the problem of local minima, a technique called simulated annealing (Kirkpatrick, Gelatt and Vecchi, 1983) can be employed. This optimisation technique entails the use of a global "temperature" that is gradually lowered (see the contribution by Crama et al.). In Hopfield and Tank (1985) the
153 slope of the sigmoid function / is argued to represent the deterministic analogue of the temperature. Hopfield and Tank (1985) reported that their interpretation of simulated annealing, i.e., slowly increasing the steepness of the sigmoid during relaxation, yields better results. Representing the temperature by a noise-level is more appropriate in stochastic networks. Akiyama, Yamashita, Kajiura and Aiso (1989) proposed the Gaussian machine. This network combines graded output responses of the neurons in the HT network with stochastic properties of the Boltzmann machine (see the contribution by Spieksma). By introducing random (Gaussian) noise superimposed on the input of individual neurons in combination with an annealing procedure, the Gaussian machine performs better than either the Boltzmann machine or the HT network. Several similar approaches have been proposed, e.g. the Cauchy Machine (Takefuji and Szu, 1989) and Annealing networks (Van den Bout and Miller III, 1989). Alternative mapping The way in which an optimisation problem is mapped onto a network may affect its efficiency to a large extent. An interesting example is the mapping proposed by Joppe, Cardon and Bioch (1990). These researchers employed an alternative mapping of the TSP problem onto the IIT network. Instead of representing a city-position combination, in their network each neuron denotes the adjacency of two cities. In each row of their network the number of active neurons equals two when a solution is found (each city in a closed tour has two neighbouring cities). This alternative mapping is based on the observation that insertion of a city between two adjacent cities in a partially established tour might result in a shorter tour-length. In terms of the original mapping, insertion of a city requires a rather large increase in energy because all cities in the partially established tour have to be shifted one position. When mapping in terms of adjacency only the three involved cities have to change positions. To cope with the occurrence of a solution involving closed subtours, Joppe et al., (1990) added a second layer to the network that detects such invalid solutions. This approach has several advantages when compared to the original approach of Hopfield and Tank. Firstly, the energy function is simplified. Secondly, convergence towards a solution turns out to be faster. Thirdly, larger problems can be solved. Finally, in contrast to the original TSP implementation, the distances do not have to be represented in weights but can be externally applied as inputs. Potts networks Probably the most promising line of research in applying neural networks to optimisation problems is based on "Potts" networks. The basic notion underlying these (and related) networks is that in many optimisation problems the hard constraints can be translated into a conservation of the total activity of a subset of neurons in the network to one. For example, in the TAP energy function (4), the first three right-hand terms conserve the total activity in the network. If the network dynamics are defined in such a way that the total activation within each row (or column) is always equal to one, one of the first two and the last of these
154 three terms are always zero. Consequently, the complexity of the TAP energy function is reduced to two terms which may lead to improved solution quality and speed. In Potts networks an entire row (or column) is represented by a single Potts neuron. (The Potts neuron is based on the Potts spin from the statistical mechanics of Ising models.) The state of such a neuron is defined as follows: gUab v°> = ^7J- (g) where Uai is a local field defined as ^ = -^- (9) The hard (conservation) constraint implicit in the definition of the Potts neuron reduces the N-dimensional state space to an (N-l)-dimensional hyperplane (Peterson and Soderberg, 1989). A benchmark study (Peterson, 1990) compared the performance of the Potts network with several other approaches on a 50- cities TSP. For this 50-cities TSP, the average tour length was 6.61 for the Potts network while a length of 6.80 was obtained using simulated annealing (Kirk- patrick, Gelatt and Vecchi, 1983). For a 200-cities TSP, average tour lengths of 12.66 and 12.79 were obtained for the network and simulated annealing approaches, respectively. Evidently, the network seems to perform rather well on larger problems. The Potts neuron equation (8) applies only to conservation of the total activation to one. However, some optimisation problems require the total activation to be conserved to a value K ^ 1. For these cases, a special type of dynamics, called activity-conserving dynamics, may be employed (see, e.g., Postma, van den Herik, and Hudson, 1993) An approach closely related to the approach described above (Simic, 1990) is the elastic net (Durbin and Willshaw, 1987). Application of the elastic net to TSP problems constitutes a mapping of a circle to points in the plane. Initially, the projection of the circle on the plane is positioned near the centroid of the cities. In the course of processing, the projection is expanded (hence the name) to go through all cities. In the end all cities are contacted and the projection constitutes a solution to the TSP. In the benchmark studies cited above, the elastic net performs slightly better than the (non-optimized) Potts network (Peterson, 1990). 4 Conclusions As may be evident from the previous section, optimisation networks provide an attractive method for solving combinatorial optimisation problems. Although the original HT network has a limited applicability, new formulations may extend this applicability considerably. Combination with traditional approaches
155 (e.g., Burke, 1994) may also improve solution quality and speed. The combination of the Potts networks with simulated annealing seems to provide the best of both worlds: improved speed over the HT network and the performance (ability to escape from local minima) of simulated annealing techniques (e.g., as in Boltzmann Machines). The Potts network and elastic net approaches show good performance up to large-sized problems. In addition they can be treated formally in terms of statistical mechanics (Peterson and Soderberg, 1989; Simic, 1990). Further improvements of these optimisation networks are to be expected in the near future. Additionally, the availability of parallel hardware may facilitate real-time application of optimisation networks (e.g., Wang, 1994). References Y. Akiyama, A. Yamashita, M. Kajiura and H. Aiso (1989) Combinatorial optimization with Gaussian Machines. In: Proceedings of the International Joint Conference on Neural Networks, Vol. I, San Diego, CA, 533-540. S-I. Amari (1977) A neural theory of association and concept formation. Biological Cybernetics 26, 175-185. D.J. Amit (1989) Modeling brain function: The world of attractor neural networks. Cambridge University Press, Cambridge. L.I. Burke (1994) Neural methods for the traveling salesman problem: insights from operations research. Neural Networks 7, 681-690. M.A. Cohen and S. Grossberg (1983) Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transaction on Signals, Machines and Cybernetics 13, 815-826. R. Durbin and D. Willshaw (1987) An analogue approach to the travelling salesman problem using an elastic net method. Nature 326, 689-691. S. Grossberg (1973) Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics LII, 213-257. S. Grossberg (Ed.) (1982) Studies of mind and brain: neural principles of learning, perception, development, cognition, and motor control, Boston, Reidel Press. S. Grossberg (1988) Nonlinear neural networks: principles, mechanisms, and architectures. Neural Networks 1, 17-61. J. Hertz, A. Krogh, and R.G. Palmer (1991) Introduction to the theory of neural computation. Redwood City CA: Addison-Wesley Publishing Company. J.J. Hopfield (1982) Neural networks and physical systems with emergent collective computational properties. Proceedings of the National Academy of Sciences U.S.A. 79, 2554-2558. J.J. Hopfield (1984) Neurons with graded response have collective computational properties like those of two-s'tate neurons. Proceedings of the National Academy of Sciences U.S.A. 81, 3088-3092. J.J. Hopfield and D.W. Tank (1985) "Neural" computation of decisions in optimization problems. Biological Cybernetics 52, 141-152. A. Joppe, H.R.A. Cardon and J.C. Bioch (1990) A neural network for solving the travelling salesman problem on the basis of city adjacency in the tour. In: Proceedings of the International Neural Network Conference, Paris, Vol. 1, 254-257. S. Kirkpatrick, CD. Gelatt and M.P. Vecchi (1983) Optimization by simulated annealing. Science 220, 671-680.
156 C. Peterson (1990) Parallel distributed approaches to combinatorial optimization: Benchmark studies on Traveling Salesman Problem. Neural Computation 2, 261- 269. C. Peterson and B. Soderberg (1989) A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems 1, 3-22. E.O. Postma, H.J. Van den Herik, and P.T.W. Hudson (1993) Activity-conserving dynamics for neural networks. In S. Gielen and B. Kappen (Eds.), Proceedings of the International Conference on Artificial Neural Networks, ICANN'93 Springer- Verlag, London, 539-544. P. D. Simic (1990) Statistical mechanics as the underlying theory of'elastic' and 'neural' optimisations. Network 1, 89-103. Y. Takefuji and H. Szu (1989) Design of parallel distributed Cauchy Machines. In: Proceedings of the International Joint Conference on Neural Networks, Vol. I, San Diego, CA, 529-532. D.W. Tank and J.J. Hopfield (1987) Collective computation in neuronlike circuits. Scientific American 257 (6), 62-70. D.E. Van den Bout and T.K. Miller III (1989) Graph partitioning using annealing neural networks. In: Proceedings of the International Joint Conference on Neural Networks, Vol. I, San Diego, CA, 521-528. J. Wang (1994) A deterministic annealing neural network for convex programming. Neural Networks 7, 629-641. G.V. Wilson and G.S. Pawley (1988) On the stability of the traveling salesman problem algorithm of Hopfield and Tank. Biological Cybernetics 58, 63-70.
Local Search in Combinatorial Optimization Y. Crama1, A.W.J. Kolen2, E.J. Pesch3 1 Department of Economics and Business Administration, University of Liege 2 Department of Quantitative Economics, University of Limburg, Maastricht 3 Department of Economics and Business Administration, University of Bonn 1 Introduction Consider the minimization problem min{f(x)\x G S} where / is the objective function and S is the set of feasible solutions of the problem. One of the most intuitive solution approaches to this optimization problem is to start with a known feasible solution and slightly perturb it while decreasing the value of the objective function. In order to operationalize the concept of slight perturbation let us associate with every x G S a subset N(x) of S, called neighbourhood oix. The solutions in N(x), or neighbours of x, are viewed as perturbations of x. Now the idea of a local search algorithm is to start with some initial solution and move from neighbour to neighbour as long as possible while decreasing the objective value. This local search approach can be seen as the basic principle underlying many classical optimization methods, like the gradient method for continuous nonlinear optimization or the simplex method for linear programming. More importantly, maybe, in connection with the main topic of this book, it also best explains the dynamics of many classes of neural networks, like e.g. the sequential iterations of Hopfield nets. In this framework, the objective function corresponds to the energy (Lyapunov) function of the network, the feasible solutions are the different configurations, and two configurations are neighbours if they differ in the state of exactly one neuron (that is, the neuron is excited in one of the configurations and inhibited in the other). Some of the important issues that have to be dealt with when implementing a local search procedure are how to pick the initial solution, how to define neighbourhoods and how to select a neighbour of a given solution. In many cases of interest, finding an initial solution creates no difficulty. But obviously, the choice of this starting solution may greatly influence the quality of the final outcome. Therefore local search algorithms are usually run several times on the same problem instance, using different (e.g. randomly generated) initial solutions. Whether or not the procedure will be able to significantly ameliorate a poor solution often depends on the size of the neighbourhoods. Small neighbourhoods (in the limit, empty ones) are easy to search, but offer little room for improvement. Large neighbourhoods (in the limit; encompassing all solutions) raise the odds of reaching an optimal solution, but may be very tedious to explore. The choice of neighbourhoods for a given problem is conditioned by this trade-off between quality of the solution and complexity of the algorithm, and is generally to be resolved by experimentation. Another crucial issue in the design of a local search algorithm is the selection of a neighbour which improves the
158 value of the objective function. What neighbour should be picked? The best one (greedy strategy)? Or the first one found in the search of the neighbourhood and improving upon the current solution? Or still some other candidate? This question is rarely to be answered through theoretical considerations. In particular, the effect of the selection criterion on the quality of the final solution, or on the number of iterations of the procedure is often hard to predict (although, in some cases, the number of neighbours can rule out an exhaustive search of the neighbourhood, and hence, the selection of the best neighbour). Here again experimentation with various strategies is required in order to make a decision (see the vast literature on the selection of entering variables in the simplex method). The attractiveness of local search procedures stems from their wide applicability and (usually) low empirical complexity (see Johnson et al. (1988) and Yannakakis (1990) for more information on the theoretical complexity of local search). Indeed, local search can be used for highly intricate problems, for which analytical models would involve astronomical numbers of variables and constraints, or about which little theoretical knowledge is available. All that is needed here is a reasonable definition of neighbourhoods, and an efficient way of searching them. When these conditions are satisfied, local search can be implemented to quickly produce good solutions for large scale instances of the problem. Running the procedure many times, with various initial solutions, adds to its quality and flexibility. These features of local search explain that the approach has been applied to a wide diversity of situations (see Pesch and Vo8(1995) for applications to real world problems). This will be illustrated, in the next section, on combinatorial optimization problems arising in the area of scheduling. Nevertheless, local search also knows its drawbacks. Most notably, the procedure stops as soon as it encounters a local optimum, i.e., a solution x such that f(x) < f(y) for all y in N(x). In general, such a local optimum is not a global optimum. Even worse, there is usually no guarantee that the value of the objective function at an arbitrary local optimum comes close to the optimal value. This inherent shortcoming of local search can be palliated in some cases by the use of multiple starts. But, because NP-hard problems often possess many local optima, even this remedy may not be potent enough to yield satisfactory solutions. In view of this difficulty, several extensions of local search have been recently proposed, which offer the possibility to escape local optima by accepting occasional degradations of the objective function. This is the case for certain types of neural networks, e.g. of Boltzmann machines with probabilistic update rules. In Sections 3 and 4, we discuss two other successful approaches based on related ideas, namely simulated annealing and tabu search. Another interesting extension of local search works with a population of feasible solutions (instead of a single one) and tries to detect properties which distinguish good from bad solutions. These properties are then used to construct a new population which hopefully contains a better solution than the previous one. This technique, known under the name of genetic algorithm will be discussed in Section 5. But before this, we will first illustrate the concepts introduced above for a few well-known combinatorial optimization problems.
159 2 Combinatorial Optimization Problems In combinatorial optimization problems the set S of feasible solutions is finite. The problem is to find an element s in S of minimum objective function value, i.e. f(s) = min{/(«)| £ S}. Usually the number of elements in S (the cardinality of S) is extremely large so that complete enumeration is computationally impossible. Combinatorial Optimization is the field of mathematics which tries to solve combinatorial optimization problems by exploiting their structure as much as possible in order to make them computationally tractable. Example 1. The Traveling Salesman Problem Consider n jobs which have to be processed on one machine which can handle only one job at a time. Let pj denote the processing time of job j, j = 1,..., n. Furthermore, assume there is a switch-over time (¾ required between jobs i and j, f, j = 0,1,..., n, where 0 corresponds to the rest state of the machine. The objective is to complete all jobs as soon as possible. An instance with n = 4 jobs is presented in Figure 1. C02 \V2 C21 pi Cli pi Ci3 p3 C30 Fig. 1. A one machine schedule for 4 jobs. Since the sum of the processing times J2j=iPj ls always included in the total processing time, the latter is determined by the switch-over time. Therefore the problem can be viewed as the problem of finding a permutation 7r : {0, 1,..., n} —> {0,1, ..., n) which minimizes n-l f(n) = 2-j C<i)<i+1) + cir(n>r(0) over the set S of all permutations. The latter combinatorial optimization problem is called the (symmetric) traveling salesman problem; when caj(= Cj,-) is viewed as the distance between two cities i and j the problem translates into finding the shortest tour which visits each city exactly once. The tour correspondirig to the schedule in Figure 1 can be represented by the edges [0, 2], [2,1], [1, 4], [4, 3], and [3, 0], as is illustrated in Figure 2. Probably the best known neighbourhood structure for the symmetric traveling salesman problem is determined by the concept of r-exchange. Two tours are neighbours with respect to an r-exchange if they differ in exactly r edges. Figure 3 describes a 2-exchange where the edges [1,2] and [3,4] are replaced by
160 2 C21 1 C02/ \ C14 o/ \ 4 C3b\ /C43 3 Fig. 2. The tour corresponding to the schedule of Figure 1. the new edges [1, 3] and [2, 4]. The change in the length of the tour is easy to calculate as C13 + C24 — C12 — C34. The deletion of three edges does not uniquely determine three new edges which have to be inserted in order to get a feasible tour. In Figure 4 four possible 3-exchanges result when the edges [1, 2], [3,4], and [5,6] are replaced by three new ones. Newly introduced edges are either [1,3], [2,5], [4,6], or [1,4], [2,5], [3,6], or [1,4], [2,6], [3,5], or [1,5], [2,4], [3,6] Example 2. Job Shop Scheduling A job shop consists of a set of different machines that perform operations on jobs. Fig. 3. A 2-exchange of edges [1,2], [3,4] by [1,3], [2,4]. Each job has a specified processing order through the machines; that is a job is an ordered list of operations, each of which is determined by the machine it requires and by its processing time. Operations cannot be interrupted (non-
161 11111 Fig. 4. All possible 3-exchanges for the edges [1,2], [3,4], and [5,6]. preemption), each machine can handle only one job at a time, and each job can be performed on only one machine at a time. The operation sequences on the machines are unknown and have to be determined so as to minimize the makespan, i.e. the time required to complete all jobs. An illuminating problem representation is the disjunctive graph model due to Roy and Sussman (1964). Let V = {0,1,..., n} denote the set of operations where 0 and n are considered as dummy operations "start" and "end", respectively. Let M denote the set of machines; A is the set of pairs of operations constrained by the precedence relations for each job. For each machine k, the set £& describes the set of all pairs of operations to be performed on machine k, i.e. operations which cannot overlap. In the disjunctive graph there is a vertex for each operation i £ V and vertices 0 and n representing the start and the end, respectively, of a schedule. For every two consecutive operations of the same job there is a directed arc; the start vertex 0 is considered to be the first operation of every job and the end vertex n is considered to be the last operation of every job. For each pair of operations {i,j} € Ek that require the same machine there are two arcs (i,j) and (j,i) with opposite directions. Thus, single arcs between operations represent the precedence constraints on the operations and opposite directed arcs between two operations represent the fact that each machine can handle at most one operation at the same time. Each arc (i, j) is labeled by a positive weight pi corresponding to the processing time of operation i. All arcs from 0 have a label 0. Figure 5 illustrates the disjunctive graph for a problem instance with 3 machines Ml, M2, M3 and 3 jobs Jl, J2, J3. The machine sequences of job Jl, J2, and J3 (see the directed start and end connecting paths of continuous arcs of Figure 5) are Ml -+ Ml -+ M3, M3 -+ M2, and M2 -+ Ml -+ M3, respectively. Broken arcs join operations competing for the same machine. The vertex label indicates the machine (index) on which the vertex corresponding operation has to be processed. The processing times are presented in Table 1. The job shop scheduling problem requires to find an order of the operations on each machine, i.e. to select one arc among all opposite directed arc pairs
162 such that the resulting graph is acyclic (i.e. there are no precedence conflicts between operations) and the length of the maximum weight path between the start and end vertex is minimal. The length of a maximum weight (or longest) Table 1. Processing times of a 3 job 3 machine instance. Ml M2 MS 3 2 3 /1 - 4 3 /2 3 6 2 /3 Fig. 5. The disjunctive graph for the problem instance of Table 1. path determines the makespan. In order to improve a current schedule, we have to modify the machine order of jobs (i.e. the sequence of operations) on longest paths. Therefore a neighbourhood structure can be defined by (i) reversing an edge (between operations on the same machine) on a longest path in the graph, and (ii) reversing an edge on a longest path in the graph such that this edge is incident to an edge of the arc set A. For details we refer to Matsuo et al. (1988), Aarts et al. (1994), van Laarhoven et al. (1992), Dell'Amico and Trubian (1993). For the problem instance of Figure 5 let us consider the schedule defined by the jobs processing sequence J\ —> J3 on machine Ml, and Jl —> 31 —> J3 on machine Ml and M3, see the arc selection in Figure 6. Hence all jobs are lying on a longest path of length 26. Reversing the processing order of jobs J2 and J3 on machine Ml yields a reduced makespan of 16 for the new schedule.
163 Fig. 6. A schedule with makespan 26. Example 3. Minimizing the Sum of Weighted Completion Times Consider n jobs which have to be processed on one machine. Each job has a processing time pj and a weight (e.g., a cost factor) Wj, j = 1,...,n. All jobs are available at the start of the planning period, say at time 0. The completion time Cj of job j, j — 1,..., n, is defined as the time at which the processing of job j is finished. The objective is to find a sequence of the jobs which minimizes the weighted sum of completion times ^27=1 Wj ■ Cj. Consider an instance involving 5 jobs, the processing times and weights of which are presented in Table 2. A schedule with job sequence 1,2,3,4,5 is represented in the Gantt-chart of Figure 7. Its objective function value is 2 ■ 8 + 3 ■ 2 + 6-9 + 10-10 + 13-5 = 241. Two job sequences are defined to be neighbours if one can be obtained from the other by interchanging two consecutive jobs. If job i is an immediate predecessor Table 2. Processing times and weights of a 5 job problem instance. job Pj Wj 1 2 8 2 1 2 3 3 9 4 4 10 5 3 5 of job j then interchanging i and j does not affect the completion time of the other jobs. Therefore the new schedule, where j precedes i, is an improvement if Wi ■ (T + p^ + Wj ■ (T + pi + pj) > wj ■ (T + pj) + Wi ■ (T + pj + pi), where T is the starting time of job i in the first schedule. Thus the interchange gives an improvement if Wj/pj > Wi/pi. For instance, interchanging jobs 2 and 3 in the schedule of Figure 7 leads to an improvement since 9/3 > 2/1 (the schedule 1,3,2,4,5 has an objective function value of 238). As a matter of fact, an optimal solution can always be found by ordering the jobs in non-increasing order of
164 1 2 3 4 5 job —t 0 2 3 6 10 13 time Fig. 7. A schedule for the problem instance of Table 2. Wj/pj (this is Smith's ratio rule (1956)). Hence, in contrast with the preceding examples, the neighbourhood structure defined above guarantees that every local minimum is a global minimum, and that local search always leads to an optimal solution. 3 Simulated Annealing Simulated annealing was proposed as a framework for the solution of combinatorial optimization problems by Kirkpatrick, Gelatt and Vecchi (1983) and, independently, by Cerny (1985). It is based on a procedure originally devised by Metropolis et al. (1953) to simulate the annealing (or slow cooling) of solids, after they have been heated to their melting point. In simulated annealing procedures, the sequence of solutions does not roll monotonically down towards a local optimum, as was the case with local search. Rather, the solutions trace an up-and-down random walk through the feasible set S, and this walk is loosely guided in a "favourable" direction. To be more specific, let us now describe the ib-th iteration of a typical simulated annealing procedure, starting from a current solution x £ S. First, a neighbour of x, say y £ N(x), is selected (usually, but not necessarily, at random). Then, based on the amplitude of A = f(x) - f(y), a transition from x to y (i.e., an update of x by y) is either accepted or rejected. This decision is made nondeterministically: the transition is accepted with probability pk(A), where pk is a probability distribution depending on the iteration count k. The intuitive justification for this rule is as follows. In order to avoid getting trapped early in a local optimum, transitions implying a deterioration of the objective function (i.e., with A < 0) should be occasionally accepted, but the probability of acceptance should nevertheless increase with A. Moreover, the probability distributions are chosen so that pk+i(A) < pk{A). In this way, escaping local optima is relatively easy during the first iterations, and the procedure explores the set S freely. But, as the iteration count increases, only improving transitions tend to be accepted, and the solution path is likely to terminate in a local optimum. The procedure stops if the value of the objective function remains constant in L (a termination parameter) consecutive iterations, or if the number of iterations becomes too large.
165 In most implementations, and by analogy with the original procedure of Metropolis et al. (1953), the probability distributions p& take the form: ,„ (I if Zi > 0 where c&+i > c& > 0 for all k, and c& —> oo when k —> oo. A popular choice for the parameter c& is to hold it constant for a fixed number (T) of consecutive iterations, and then to increase it by a constant factor: Ci-T+t = a* ■ Co for t = 1, 2,..., T and 2 = 0,1,2,... Here, Co is a small positive number, and a is slightly larger than 1. It is clear that the choice of the termination parameter and of the distributions Pk(k = 1,2,...) (the so-called cooling schedule) strongly influences the performance of the procedure. If the cooling is too rapid (e.g. if T is small and a is large), then simulated annealing tends to behave like local search, and gets trapped in local optima of poor quality. If the cooling is too slow, then the running time becomes prohibitive. Under some reasonable assumptions on the cooling schedule, theoretical results can be established concerning convergence to a global optimum or the complexity of the procedure (see van Laarhoven and Aarts (1987), Aarts and Korst (1989)). In practice, determining appropriate values for the parameters is part of the fine tuning of the implementation, and still relies on experimentation. We refer to the extensive computational studies by Johnson et al. (1989, 1991) for the wealth of details on this topic. Simulated annealing has been applied to several types of combinatorial optimization problems, with various degrees of success (see van Laarhoven and Aarts (1987), Aarts and Korst (1989), and Johnson et al. (1989, 1991)). In particular, many researchers tested the performance of simulated annealing approaches to the traveling salesman problem (the seminal papers by Kirkpatrick et al. (1983) and Cerny (1985) already handled this problem; see van Laarhoven and Aarts (1987) and Johnson (1990) for an overview of the literature). The neighbourhood structure used in these implementations is generally based on 2- or 3-exchanges (see Section 2). This seems to lead to algorithms which are more effective than repeated applications of simple local search, but less than the Lin-Kernighan heuristic (1973). Simulated annealing has also been applied to job shop scheduling, with neighbourhood structures of the type described in Section 2 (Matsuo et al (1988), Aarts et al. (1994), and van Laarhoven et al. (1992)). The resulting algorithms perform again better than multiple-start local search or simple-minded heuristics. Given sufficient (very high) running time, they can produce better solutions than the efficient bottleneck procedure due to Adams et al. (1988). As a general rule, one may say that simulated annealing is a reliable procedure to use in situations where theoretical knowledge is scarce or appears difficult to apply algorithmically. Even for the solution of complex problems, simulated annealing is relatively easy to implement, and usually outperforms a local search procedure with multiple starts.
166 4 Tabu Search Tabu search is a general framework for the solution of discrete optimization problems, which was originally proposed by Glover, and subsequently expanded in a series of papers (Glover (1977, 1986, 1989, 1990), Glover and McMillan (1986), etc.) One of the central ideas in this proposal is to guide deterministically the local search process out of local optima (in contrast with the non-deterministic approach of simulated annealing). This can be done using different criteria, which ensure that the loss incurred in the value of the objective function in such an "escaping" step is not too important, or is somehow compensated for. For instance, assume that several numerical criteria, say /i,..., /m are relevant to evaluate the quality of candidate solutions. A weighted combination of these criteria, f = J2wi • fi> can then be used as objective function in a classical local search procedure to produce a local optimum x with respect to /. If the weights Wi are now modified, another combination of the criteria, say /', is obtained, for which x is (in general) no longer a local optimum. Local search can then proceed with this new objective function, to generate alternative solutions to the problem. This type of approach was used by Glover and McMillan (1986) in their solution of very large employee scheduling problems. Another, more straightforward criterion for leaving local optima is to replace the improvement step in the local search procedure by a "least deteriorating" step. One version of this principle was proposed by Hansen (independently of Glovers's work on tabu search), under the name steepest descent mildest ascent (see Hansen and Jaumard (1990), as well as Glover (1989)). In its simplest form, the resulting procedure replaces the current solution a; by a solution y £ N(x) which maximizes A = f(x) — f(y). If during L (a termination parameter) iterations no improvements are found, the procedure stops. Notice that A may be negative, thus resulting in a deterioration of the objective function. Now, the major defect of this simple tabu search procedure is readily apparent. If A is negative in some transition from x to y, then there will be a tendency in the next iteration of the procedure to reverse the transition, and go back to the local optimum x (since x improves on y). Such a reversal would cause the procedure to oscillate endlessly between x and y. To prevent this phenomenon (which is likely to occur in every version of tabu search), Glover and Hansen propose to maintain throughout the search a (dynamic) list of forbidden transitions, called tabu list (hence the name of the procedure). The purpose of this list is not to rule out cycling completely (this would in general result in heavy bookkeeping and loss of flexibility), bu,t at least to make it improbable. In the framework of the steepest descent mildest ascent procedure, we may for instance implement this idea by placing a solution a; in a tabu list T after every transition away from x. In effect, this amounts to deleting x from 5, But, for reasons of flexibility, a solution would only remain in the tabu list for a limited number of iterations, and then should be freed again. Another possible implementation would be to create a tabu list T(y) for every solution y £ 5. After a transition from x to y, x would be placed in the list T(y), meaning that further transitions from y to x are forbidden (in effect,
167 this amounts to deleting x from N(y)). Here again, x should be dropped from T(y) after a number of transitions. For still other possible definitions of tabu lists, see e.g. Glover (1986, 1989), Glover and Greenberg (1989), Hansen and Jaumard (1990), Hertz and de Werra (1990). Tabu search encompasses many features beyond the possibility to avoid the trap of local optimality and the use of tabu lists. Even though we cannot discuss them all in the limited framework of this survey, we would like to mention two of them, which provide interesting links with artificial intelligence and with genetic algorithms (to be discussed in the next section). In order to guide the search, Glover suggests to record some of the salient characteristics of the best solutions found in some phase of the procedure (e.g., fixed values of the variables in all, or in a majority of those solutions, recurring relations between the values of the variables, etc.). In a subsequent phase, tabu search can then be restricted to the subset of feasible solutions presenting these characteristics. This enforces what Glover calls a "regional intensification" of the search in promising "regions" of the feasible set. An opposite idea may also be used to "diversify" the search. Namely, if all solutions discovered in an initial phase of the search procedure share some common features, this may indicate that other regions of the solution space have not been sufficiently explored. Identifying these unexplored regions may be helpful in providing new starting solutions for the search. Both ideas, of search intensification or diversification, require the capability of recognizing recurrent patterns within subsets of solutions. Techniques developed in the fields of pattern recognition or learning may clearly be relevant for this purpose (Glover (1986, 1989, 1990)). Variants of tabu search have been sucess- fully applied to a large diversity of optimization problems: scheduling, clustering, generalized bin packing (see Glover (1986, 1990), Glover and McMillan (1986)), graph coloring (Hertz and de Werra (1987)), maximum satisfiability (Hansen and Jaumard (1990)), etc.. Sophisticated neighbourhood structures for some scheduling problems are proposed by Brucker et al (1993, 1994) and Glass et al (1992). Applications to the traveling salesman problem are reported in Malek et al. (1989); various neighbourhood structures are presented by Glover (1991, 1992). Taillard (1994) has implemented a parallel tabu search approach for the job shop scheduling problem, while Widmer (1991) applied tabu search to a generalized version of the same problem. Taillard uses a neighbourhood structure defined by reversing an edge between operations on the same machine on the longest path in the disjunctive graph, see Section 2. Such a reversed edge becomes tabu for a certain number of iterations. Widmer, on the other hand, defines a neighbour of the current schedule by selecting a machine M and an operation J, and by shifting the position of J in the operations sequence of machine M. If J is shifted from position i to position k in the sequence, then the couple (J, i) is included in the tabu list, meaning that operation .7 may not return to position i in the operations sequence for machine M (until (J, i) is removed from the tabu list). Ever more powerful methods are described by DelPAmico and Trubian (1993) as well as Nowicki and Smutnicki (1993). Like simulated annealing (or, maybe, more than it), tabu search has estab-
168 lished itself as a successful general-purpose heuristic for combinatorial optimization problems. But the full potential of the method, as well as its theoretical properties, largely remain to be understood. 5 Genetic Algorithms As the name suggests, genetic algorithms are motivated by the theory of evolution; they date back to the early work of Rechenberg (1973), Holland (1975), and Schwefel (1977), see also Goldberg (1989a,b), Michalewicz (1992) and Liepins and Hilliard (1989). They have been designed as general search strategies and optimization methods working on populations of feasible solutions. Working with populations permits to identify and explore properties which good solutions have in common (this is similar to the regional intensification idea mentioned in our discussion of tabu search). Solutions are encoded as strings consisting of elements chosen from a finite alphabet. Roughly speaking, a genetic algorithm aims at producing near-optimal solutions by letting a set of strings, representing random solutions, undergo a sequence of unary and binary transformations governed by a selection scheme biased towards high-quality solutions. Therefore the quality or fitness value of an individual in the population, i.e. a string, has to be defined. Usually it is the value of the objective function or some scaled version of it. The transformations on the individuals of a population constitute the recombination steps of a genetic algorithm and are performed by three simple operators. The effect of the operators is that implicitly good properties are identified and combined into a new population which hopefully has the property that the value of the best individual (representing the best solution in the population) and the average value of the individuals are better than in previous populations. The process is then repeated until some stopping criteria are met. It can be shown that the process converges to an optimal solution with probability one (cf. Eiben et al. (1991)). The three basic operators of a genetic algorithm when a new population is constructed are reproduction, crossover and mutation. Via reproduction a new temporary population is generated where each member is a replica of a member of the old population. A copy of an individual is produced with probability proportional to its fitness value, i.e. better strings probably get more copies. The intended effect of this operation is to improve the quality of the population as a whole. However, no genuinely new solutions and hence no new information are created in the process. The generation of such new strings is handled by the crossover operator. In order to apply the crossover operator the population is randomly partitioned into pairs. Next, for each pair, the crossover operator is applied with a certain probability by choosing a position randomly in the string and exchanging the tails (defined as the substring starting at the chosen position) of the two strings (this is the simplest version of a crossover). The effect of the crossover is that certain properties of the individuals are combined to new ones or other properties are destroyed. The mutation operator which makes random changes to single elements of the string only plays a secondary role in genetic algorithms.
169 Mutation serves to maintain diversity in the population (see the previous section on tabu search). Besides unary and binary recombination operators, one may also introduce operators of higher arities such as consensus operators, that fix variable values common to most solutions represented in the current population. Selection of individuals during the reproduction step can be realized in a number of ways: one could adopt the scenario of Goldberg (1989a) or use deterministic ranking. Further it matters whether the newly recombined offspring compete with the parent solutions or simply replace them. The traditional genetic algorithm, based on a binary string representation of solutions, is often unsuitable for combinatorial optimization problems because it is very difficult to represent a solution in such a way that substrings have a meaningful interpretation. So it is no surprise that the first attempt by Grefenstette et al. (1985) (and Grefenstette (1987)) to solve the traveling salesman problem by a traditional genetic algorithm based on a so-called ordinal representation of solutions led to solutions as far as 25% above the optimum, even for small problem sizes. However, choosing a more natural representation of solutions, for instance, a permutation of the cities for the traveling salesman problem or a list of operations sequences per machine for job shop scheduling, involves more intricate recombination operators, in particular crossover operators, in order to get feasible offspring; this tradeoff has been, for instance, noticed by Aarts et al. (1994) for the job shop scheduling problem and by Miihlenbein et al. (1987, 1988), Gorges-Schleuter (1989), or Kolen and Pesch (1994) for the traveling salesman problem. The construction of a crossover operator should also take into consideration that fitness values of offspring are not too far from those of their parents, and that offspring should be closely genetically related to their parents. Let us illustrate this discussion on some examples. For the traveling salesman problem, the Grefenstette-crossover (Grefenstette (1987)) constructs one new tour from two parent tours as follows, (i) Randomly choose a city as the current city of the tour and label it "visited". (ii) Consider all the edges incident to the current city in both parents and choose among these edges a shortest one leading to an unvisited city. If all edges lead to an already visited city, randomly choose an edge (which is not in one of the parents) to one of the unvisited cities. Say j is the unvisited endpoint of this edge. Label j "visited", and repeat (ii) with j as the new current city, until all cities have been visited. The procedure can be repeated to generate two offspring from the two parents. Variations are possible, for instance in step (ii) we may select edges at random or with a probability inversely proportional to their length. The Miihlenbein-Gorges-Schleuter-crossover (Miihlenbein et al. (1988) and Gorges-Schleuter (1989)) chooses a path in one of the parents and incorporates this path in the other parent while leaving as many as possible of the edges undisturbed. The length of the path is randomly chosen within the interval [n/3, n/2]; the first vertex of the path is also randomly chosen. We illustrate the Miihlenbein-Gorges-Schleuter-crossover by an example. Assume that we wish to
170 implant the path (1,2,3) from parent (1,2,3,4,5,6,7,8) into parent (1,8,4,6,3,5,2,7), called the receiving parent. The first step to perform is to create a new tour such that both endpoints of the path, city 1 and city 3 in our case, are adjacent. Adjacency can be reached by either of two 2-exchanges (see Figure 3). In the first one, edges [1, 8] and [3, 5] are replaced by the new edges [1, 3] and [5, 8], while in the other 2-exchange the edges [1, 7] and [3, 6] are replaced by the new edges [1, 3] and [6,7]. Thus two tours are obtained where cities 1 and 3 are adjacent. In both of them all cities of the path that has to be implanted are removed from their positions while the order of all other cities remains untouched. In our case city 2 will be dropped from both tours and an edge [5, 7] is introduced in both. Finally the path is implanted between the two endpoints, i.e. city 2 becomes adjacent to cities 1 and 3 in both tours. Hence, we get two new tours (1,2,3,6,4,8,5,7) and (1,2,3,5,7,6,4,8), the best of which is chosen as a result of the crossover. Similarly we get the second offspring when we start choosing a path in the other parent. The crossover operator used by Aarts et al (1994) in case of job shop scheduling is also based on a natural solution representation. The idea is to implant a subset of edges from one parent into the receiving parent. More specifically, an arc (i, j) sequencing two jobs on the same machine in the first parent is randomly chosen. If this arc occurs on a longest path in the receiving parent, then it is reversed in the latter and the longest paths are recomputed. This process is repeated k times where k is at most the number of operations in the underlying job shop scheduling problem. A second offspring is obtained by interchanging the roles of the parents. Problems from combinatorial optimization are well within the scope of genetic algorithms and early attempts closely followed the scheme of what Goldberg (1989a) calls a simple genetic algorithm. Compared to standard heuristics, for instance for the traveling salesman (cf. Lawler et al. (1985), Grotschel and Holland (1991)) or the job shop scheduling problem (cf. Baker (1974), French (1982), Adams et al. (1988)), "genetic algorithms are not well suited for fine-tuning structures which are very close to optimal solutions" (Grefenstette (1987)). Therefore it is essential, if a competitive genetic algorithm is desired, to compensate for this drawback by incorporating (local search) improvement operators into the basic scheme; see Miihlenbein et al. (1987, 1988), Miihlenbein (1989), Jog et al. (1989) and Suh and Van Gucht (1987). The resulting algorithm has then been called genetic local search heuristic or genetic enumeration; for the traveling salesman we refer to the papers of Ulder et al. (1991), Johnson (1990), and Kolen and Pesch (1994); for the job shop scheduling problem we refer to Dorndorf and Pesch (1994). For instance, local search improvement algorithms that were used for the traveling salesman problem and applied to some or all of the individuals in the population are 2-opt, i.e. repeated most improving 2-exchanges, tabu search, and the varying r-exchange algorithm of Lin and Kernighan (1973). Each individual of the population is then replaced by a locally improved one or an individual representing a locally optimal solution, i.e. an improvement procedure is applied to each individual either partially (to a certain number of iterations) or completely.
171 Some type of improvement heuristic may also be incorporated into the crossover operator (see Kolen and Pesch (1994)). In any case the improvement step as well as the crossover operator heavily depend on the representation of the solution. Usually a simple representation requires more sophisticated recombination operators and vice versa. To overcome these difficulties Dorndorf and Pesch (1995) proposed a completely different encoding scheme for the job shop scheduling problem. In this scheme, each individual of the population is a string of n — 1 entries (pi,P2, ■ ■ -,Pn-i) where n — 1 is the number of operations in the underlying problem instance. The entry pi represents a rule from a set of priority rules (see Panwalkar and Iskander (1977)); this rule is then used to determine the 2-th operation to be processed. Such a solution representation enables to use the simplest type of crossover as well as to incorporate problem specific knowledge, i. e. as Davis (1985) claimed "to examine the workings of a good deterministic program in that domain"; the resulting algorithm is competitive with special purpose heuristics. Putting things in a more general framework, a genetic meta-strategy controls a sequence of local decisions (such as priority rules or even more complicated ones, see Dorndorf and Pesch (1995)) in order to find best combinations. References E.H.L. Aarts and J. Korst (1989) Simulated Annealing and Boltzmann Machines. John Wiley and Sons, Chichester. E.H.L. Aarts, P.J.M. van Laarhoven, J.K. Lenstra and N.L.J. Ulder (1994) A computational study of local search algorithms for job shop scheduling. ORSA Journal on Computing 6, 118-125. J. Adams, E. Balas, and D. Zawack (1988) The shifting bottleneck procedure for job shop scheduling. Management Science 34, 391-401. K.R. Baker (1974) Introduction to Sequencing and Scheduling. Wiley, New York. P. Brucker, J. Hurink and F. Werner (1993) Improving local search heuristics for some scheduling problems. Working paper, University of Osnabrflck, Discrete Applied Meth. (to appear). P. Brucker, J. Hurink and F. Werner (1994) Improving local search heuristics for some scheduling problems, Part II. Working paper, University Osnabriick. V. Cerny (1985) Thermodynamic^ approach to the traveling salesman problem; an efficient simulation algorithm. Journal of Optimization Theory and Application 45, 41-51. L. Davis (1985) Job shop scheduling with genetic algorithms. Proc. an Int. Conf. Genetic Algorithms and Their Applications (J.J. Grefenstette, ed.), Lawrence Erlbaum Ass., 136-140. M. DeE'Amico and M. Trubian (1993) Applying tabu-search to the job shop scheduling problem. Annals of Operations Research 41, 231-252. U. Dorndorf and E. Pesch (1995) Evolution based learning in a job shop scheduling environment. Computers & Operations Research 22, 25-40. A.E. Eiben, E.H.L. Aarts and K.H. van Hee (1991) Global convergence of genetic algorithms: a Markov Chain analysis. Proc. 1st. Int. Workshop on ParaEel Problem Solving from Nature (H.-P. Schwefel and R. Manner, eds.), Lecture Notes in Computer Science 496, 4-9.
172 S. French (1982) Sequencing and Sheduling: An Introduction to the Mathematics of the Job Shop. Wiley, New York. C.A. Glass, C.N. Potts and P. Shade (1992) Genetic algorithms and neighbouhood search for scheduling unrelated paraEel machines. Working paper, University of Southampton. F. Glover (1977) Heuristic for integer programming using surrogate constraints. Decision Sciences 8, 156-160. F. Glover (1986) Future paths for integer programming and links to artificial intelligence. Computers and Operations Research 13, 533-549. F. Glover (1989) Tabu Search-Part I. ORSA Journal on Computing 1, 190-206, F. Glover (1990) Tabu Search-Part II. ORSA Journal on Computing 2, 4-32. F. Glover (1991) Multilevel tabu search and embedded search neighbourhoods for the traveling salesman problem. Working paper, University of Colorado, Boulder. F. Glover (1992) Ejection chains, reference structures and alternating path methods for traveling salesman problems. Working paper, University of Colorado, Boulder. F. Glover and H.J. Greenberg (1989) New approaches for heuristic search: A bilateral linkage with artificial intelligence. European Journal of Operational Research 13, 119-130. F. Glover and C. McMillan (1986) The general employee scheduling problem: an integration of MS and AI. Computers and Operations Research 13, 563-573. D.E. Goldberg (1989a) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading. D.E. Goldberg (1989b) Zen and the art of genetic algorithms. Proc. 3rd Int. Conf. Genetic Algorithms (J.D. Schaffer, ed.), Morgan Kaufmann Publ. 80-85. M. Gorges-Schleuter (1989) ASPARAGOS, a paraEel genetic algorithm and population genetics. Proc. 3rd Int. Conf. Genetic Algorithms (J.D. Schaffer, ed.), Morgan Kaufmann Publ., 422-427. J.J. Grefenstette (1987) Incorporating problem specific knowledge into genetic algorithms. Genetic algorithms and simulated annealing (L. Davis, ed.), Pitman, 42-60. J.J. Grefenstette, R. Gopal, B. Rosmaita, and D. van Gucht (1985) Genetic algorithms for the traveling salesman problem. Proc. 1st. Int. Conf. Genetic Algorithms and their Applications (J.J. Grefenstette, ed.), Lawrence Erlbaum Ass., 160-168. M. Grotschel and O. Holland (1991) Solution of large-scale symmetric travelling salesman problems. Math. Programming 51, 141-202. P. Hansen, and B. Jaumard (1990) Algorithms for the maximum satisfiability problem, Computing 44, 279-303. A. Hertz and D. de Werra (1987) Using tabu search techniques for graph coloring. Computing 39, 345-351. A. Hertz and D. de Werra (1990) The tabu search metaheuristic: How we use it. Annals of Math, and Artificial Intelligence 1, 111-121. J.H. HoEand (1975) Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor. P. Jog, J.Y. Suh, and D. van Gucht (1989) The effects of population size, heuristic crossover and local improvement on a genetic algorithm for the traveling salesman problem. Proc. 3rd. Int. Conf. Genetic Algorithms (J.D. Schaffer, ed.), Morgan Kaufmann Publ., 110-115. D.S. Johnson (1990) Local optimization and the traveling salesman problem. Proc. 17th Colloq. Automata, Languages, and Programming, Springer-Verlag, 446-461.
173 D.S. Johnson, C.R. Aragon, L.A. McGeoch, and C. Schevon (1989) Optimization by simulated annealing: An experimental evaluation; Part I, Graph partitioning. Operations Research 37, 865-892. D.S. Johnson, C.R. Aragon, L.A. McGeoch and C. Schevon (1989) Optimization by simulated annealing: An experimental evaluation; Part II, Graph coloring and number partitioning. Operations Research 39, 378-406. D.S. Johnson, C.H. Papadimitriou, and M. Yannakakis (1988) How easy is local search? J. Computer System Sci. 37, 79-100. S. Kirkpatrick, CD. Gelatt Jr., and M.P. Vecchi (1983) Optimization by simulated annealing. Science 220, 671-680. A. Kolen and E. Pesch (1994) Genetic local search in combinatorial optimization. Discrete Applied Mathematics 48, 273-284. P.J.M. van Laarhoven and E.H.L. Aarts (1987) Simulated Annealing: Theory and Applications. Reider, Dordrecht. P.J.M. van Laarhoven, E.H.L. Aarts, and J.K. Lenstra (1992) Job shop scheduling by simulated annealing. Operations Research 40, 113-125. E.L. Lawler, J.K. Lenstra, A.H.G. Rinnooy Kan, and D.B. Shoys (eds.) (1985) The Traveling Salesman Problem. John Wiley and Sons. G.E. Liepins and M.R. Hilliard (1989) Genetic Algorithms: foundations and applications. Annals of Operations Research 21, 31-57. S. Lin and B.W. Kernighan (1973) An effective heuristic algorithm for the Traveling Salesman Problem. Operations Research 21, 498-516. Z. Michalewitcz (1992) Genetic Algorithms -f- Data Structures = Evolution Programs. Springer, Berlin. M. Malek, M. Guruswamy, M. Pandya, and H. Owens (1989) Serial and parallel simulated annealing and tabu search algorithms for the traveling salesman problem. Linkages with Artificial Intelligence (F. Glover and H.J. Greenberg, eds.) Annals of Operations Research 21, 59-84. H. Matsuo, C.J. Suh, and R.S. Sullivan (1988) A controEed search simulated annealing method for the general jobshop scheduling problem. Working paper 03-04-88, Department of Management, University of Texas, Austin. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. TeEer, and E. TeEer (1953) Equation of state calculations by fast computing machines. Journal of Chemical Physics 21, 1087-1092. H. Muhlenbein (1989) Parallel genetic algorithms, population genetics and combinatorial optimization. Proc. 3rd Conf. Genetic Algorithms (J.D. Schaffer, ed.), Morgan Kaufmann Publ., 416-421. H. Muhlenbein, M. Gorges-Schleuter, and O. Kramer (1987) New solutions to the mapping problem of paraEel systems: the evolution approach. Parallel Computing 4, 269-279. H. Muhlenbein, M. Gorges-Schleuter, and O. Kramer (1988) Evolution algorithms in combinatorial optimization. Parallel Computing. 7, 65-85. E. Nowicki and C. Smutnicki (1993) A fast taboo search algorithm for the job shop problem. Working paper, Technical University of Wroclaw. S.S. Panwalkar and W. Iskander (1977) A survey of scheduling rules. Operations Research 25, 45-61. E. Pesch and S. Vofi, eds. (1995) Applied Local Search. OR Spektrum (special issue, to appear). I. Rechenberg (1973) Optimierung technischer Systeme nach Prinzipien der biologis- chen Evolution. Problemata, Frommann-Holzboog.
174 B. Roy and B. Sussman (1964) Les problemes d'ordonnancement avec contraintes dis- jonctives. SEMA, Note D.S. No. 9., Paris. H.-P. Schwefel (1977) Numerische Optimierung von Computer-Modellen mittels der Evolutionsstrategie. Birkhauser Basel. W.E. Smith (1956) Various optimizers for single-stage production. Naval Research Logistics Quarterly 3, 59-66. J.Y. Suh and D. van Gucht (1987) Incorporating heuristic information into genetic search. Proc. 2nd Int. Conf Genetic Algorithms (J.J. Grefenstette, ed.), Lawrence Erlbaum Ass. 100-107. E. TaiUard (1994) Parallel taboo search technique for the job shop scheduling problem. ORSA Journal On Computing 6, 108-117. N.L.J. Ulder, E.L. Aarts, H.-J. Bandelt, P.J.M. van Laarhoven, and E. Pesch (1991) Genetic local search algorithms for the traveling salesman problem. Proc. 1st. Int. Workshop on Parallel Problem Solving from Nature (H.-P. Schwefel and R. Manner, eds.), Lecture Notes in Computer Science 496, 109-116. M. Widmer (1991) Job shop scheduling with tooling constraints: a tabu search approach. J. Operational Research Society 42, 75-82. M. YannakaMs (1990) The analysis of local search problems and their heuristics. Proc. 7th. Annual Symposium on Theoretical Aspects of Computer Science (C. Choffrut and T. Lengauer, eds.). Lecture Notes in Computer Science 415, 298-311.
Process Identification and Control P. Boekhoudt Department of Mathematics, University of Limburg, Maastricht, The Netherlands 1 Introduction The control of processes is common in nature and technology. In the human body for example the blood pressure, the diameter of the eye pupil, the blood glucose level, the pH, etc. are controlled by biological mechanisms. Technological examples of control are the regulation of the room temperature and the automatic flight control of an airplane. Natural processes usually (but not necessarily) are autonomously controlled, i.e., without human interference. Technological processes often are directly or indirectly affected by human action. For example the flight of an airplane may be controlled by the pilot (direct control) or the autopilot (indirect control). Man has an irresistible need to understand and to control natural and technological processes. He is searching uncessantly for more or less automated means of control. The automatic control of the just forementioned processes usually needs a model. This model may either be physical (a scale model) or mathematical, i.e., a description of the process variables in terms of mathematical expressions; we will focus on the latter in the sequel. Before stating a model formulation, a profound study of the process usually needs to be performed. This study consists of collecting physical data of the process and measurements of the process variables. The first step in model formulation is the determination of the different process variables and their mutual interactions. The next step is to establish the extent to which the process variables interact, such that model and reality fit as much as possible. These steps towards a mathematical model formulation of the process are part of what is called process identification. In this treatise many aspects of identification and control of processes pass in review. The aim of this paper is firstly to give a survey of "traditional" methods of identification and control and secondly to point out where neural networks might prove useful. In order to avoid a high degree of abstraction, we illustrate the different features for an interesting application. The application concerns the modelling, simulation and regulation of the disease diabetes mellitus. Before discussing the identification and control aspects, we first give a concise view of the process (diabetes) under study. 2 A Simple Diabetes Model The disease diabetes mellitus is characterized by a disturbed regulation of the blood glucose level. The hormone insulin, which is secreted by the pancreas, plays a major role in the control of the blood glucose level. One of the functions
176 of insulin is influencing the entry of glucose into cells. When there is a shortage of insulin, glucose is unable to enter the cells and cannot be utilized. This leads to an excess of sugar in the blood (hyperglycemia) with a consequent excretion of large volumes of urine, which leads to dehydration and intense thurst. Even though the blood glucose level is elevated, glucose is unable to enter the appetite regulating cells of the hypothalamus. A diabetic person therefore tends to be eating constantly. The deficiency of the insulin response function of the pancreas is nowadays considered as the main contributor to diabetes. There are two major forms of diabetes: juvenile-onset diabetes (I) This form of diabetes is estimated to afflict 1 in every 600 children, and appears mostly before age 20. The cells of the pancreas that manufacture insulin are destroyed. maturity-onset diabetes (II) This form of diabetes usually arises in adults and is frequently related to the obesity of the individual. It is the most common form of diabetes. The cause of type I diabetes is unknown; a viral infection or an auto immune reaction are considered to be possible. Type II diabetes manifests itself by normal (or even high) insulin concentrations. The surface membranes of the cells are, however, less sensitive to insulin. A therapy for individuals with type I diabetes is to provide them with insulin, extracted from cattle and pig pancreatric tissue. The use of nonhuman insulin may cause allergic reactions. Promising results are reported on gene splicing (recombinant DNA) for the production of human insulin. Of the type II diabetes patients, 90% of those who lose weight do not require any medication to control their disease. In the sequel we focus on type I diabetes and in particular on the regulation of the blood glucose level. Knowledge of the process that controls the blood glucose level already contributed to establishing dose strategies for portable insuline pumps and the development of an artificial pancreas. We will first present a simple mathematical model, which describes the relations between the insulin and glucose concentrations. This model is the starting point for a guided tour through the field of process identification and control. The mathematical model describes the time courses of the insulin and blood glucose concentrations. The aim of the model is to learn the dynamics of the process and to get to suitable model-based insulin injection strategies. Define the time dependent variables G(t) = blood glucose concentration at time t H(t) = insulin concentration at time t. Assume, without precisely specifying how, that we know a relation which describes the changes in time of the glucose and insulin concentrations. That is,
177 assume that the change of the glucose concentration depends on the glucose concentration and the insulin concentration: G(t) = /i(G(t),ff(0)- (1) Analogously, the change of the insulin concentration depends on the glucose concentration, on the insulin concentration, and on the rate u(t) of insulin that is administrated, i.e., H(t) = f2(G(t),H(t)) + u(t). (2) Although we have no precise description of /i and /2, at this point we assume this to be the case. Later we will see that a precise description of /1 and /2 is irrelevant for the formulation of a useful model. We first define the level at which no change in the glucose and insulin concentrations Go and Ho occurs (fasting levels). Then apparently fi(G0,H0) = f2(G0,Ha) = 0. (3) The concentrations Go and Ho are equilibrium concentrations. The difference g(t) between G(t) and the equilibrium level Go we define as g(t) = G(t) - Go. (4) The difference h(t) between H(t) and the equilibrium level H0 we define as h(t) = H(t) - H0. (5) As long as g(t) and h(t) are small enough, that is, as long as G(t) and H(t) are close to their equilibrium values, it follows from (1) and (2) after linearization that g(t) = -mig(t)-m2h(t) (6) h(t) = -m3h(t) + mAg(t) + u(t)» (7) where ml = ^-(G0,H0) , m2 = -||(Go,#o) m3 = -||(Go, H0) , m4 = ||(Go, H0). (8) The constants mi, m^, m3 and 7714 are all positive. The system of linear differential equations (6), (7) (which we will refer to as "the model") may be interpreted as follows:
178 the change of the glucose concentration is proportional to the glucose concentration. Or, in other words, the more glucose there is in the blood the more glucose will be metabolized. This explains the term —mi g(t) in (6). Similarly, glucose metabolization is proportional to insulin concentration, i.e., the higher the insulin concentration, the faster the glucose concentration decreases. This explains the term —m,2h(t) in (6). A high insulin concentration implies a faster metabolism of the insulin, explaining the term —rri3h(t) in (7). A normal functioning of the blood glucose and insulin regulation implies that an increase in the glucose concentration is followed by a higher insulin production, explaining the term mAg(t) in (7). Finally the term u(t) is, as defined before, the amount of insulin that is injected per time unit. For diabetic individuals there is an impaired ability to produce endogenous insulin and in this situation the parameter 7714 is usually assumed to have the value zero. An alternative representation of the model (6), (7) is or more compactly where 9 \ _ I -mi -m2 h J I m4 -m3 + x = Ax + Bu y =Cx x(0) = x0 (9) (10) u(t) x(t) X0 *i(t)\ - (9(t) *2(0 h(t) *i(on /*(on x2(0)J \h(0)J the "input signal" the "state" (11) the "output signal" (12) the "initial state". (13) The set of differential equations (10) is called a linear system (referring to the proportionalities in the differential equations). The matrices A, B and C in (10) are, as is easily verified, given by: A = —mi — rri2 0 -m3 ,B = ,C = 10 0 1 (14) where we assumed 7714 = 0. Note that the state x(t) and the output y(t) are identical here. This needs not always to be the case, as we will see later when we discuss the state estimation problem.
179 At this point the model (6), (7), or equivalently (10), only gives a global relationship between the process variables g(t) and h(t). Not before the parameters mi, m2 and m$ are known, the model may be used for simulation and for determining an input signal u(t) (the insulin injection rate) which effects a desired time-course of the output signals (the insulin and glucose concentrations). For more detailed information on the model we refer to Swan (1984). In the next section we discuss how to determine the unknown parameters in the model. 3 The Parameter-Identification Problem The starting point is the model x — Ax + Bu y =Cx (15) Xq - x(0), where the matrix A contains the unknown parameters mi, m2 and m3. It is common sense to choose the parameters such that the difference between model and reality is as small as possible. Let the real (measured) glucose- and insulin concentration be »<*> = (£[?))' (16) where the subscript r refers to the real process. Let the model outcome be where the subscript m refers to the model. We define an error function e(t)=yr(t)-ym(t), (18) which is the difference at time t of reality and model outcome. The "total error" is the sum of errors at all time instants. To avoid annihilation of positive and negative errors, we usually sum squares of errors. As the time t is a continuous variable, we do not compute a sum but an integral, viz1. J* CO E= eT(t)e(t)dt, (19) Jo which seems to be a reasonable measure of error. The problem of determining the parameters mi, m2 and m3 now is one of determining mi, m2 and m3 such that E in (19) is minimized. 1 T denotes transpose.
180 A method that may be used, is to keep all parameters but one constant and to vary the remaining parameter such that the minimum of (19) is found with respect to this parameter. This process may be repeated for all parameters, until E is minimal with respect to all parameters. This method is computationally extremely demanding, rather unsystematically and not guaranteed to converge to the optimal parameter values (due to so-called local minima). More advanced methods for minimizing (19) turn out to be even more demanding. To solve the minimization of (19), we choose to use a discretized version of (15). Assume that u(t) in (15) is an arbitrary input signal, then (cf. Friedland (1986)) x(t) = eAtx0 + I eA^-T^Bu(T)dT Jo y(t) = CeAtx0 + C [ eA^-T^Bu(T)dT, (20) Jo where the time dependent matrix eAt is defined as A2f2 A3f3 eAt = I + At + ^- + ^± + .... (21) Assume furthermore, that u(t) is "piecewise" constant, i.e., ud(t) = u(kT) kT<t<kT + T. (22) In other words, the signal u(t) is considered at discrete time points (lying T time units apart). At intermediate time points the signal is assumed to be constant. As the sampling time T is chosen smaller, the original signal u(t) is better approximated by the discrete time signal ud(t). In other words: as the sampling frequency (= 1/T) is higher, the signal ud(t) more closely follows the original signal u(t). Sampling of electrical signals in practise is accomplished by zero- order-hold devices. From (20) it follows, after applications of (22), that f x(kT + T) = eATx(kT) + /QT eAr>dr)Bu(kT) m) \y(kT) =Cx(kT). { > Now define2 Ad = eAT,Bd= / e^drjB, (24) Jo then from (23) (after omitting T), we have / x(k + 1) = Adx(k) + Bdu(k) . . \y(k) =Cx(k). W The model (25) is a discrete time ("sampled data") system, in contrast with the continuous time system (15). the subscript d refers to discrete time.
181 For the diabetes model (9), the discretized version is of the form (25), where lU(*)J"w)J"lw (26) with From (14), and from Ad = eAT=(<hlt2 \021 ^22 it follows that 0n > 0, 022 > 0 and 02i = 0. For the moment the problem seems to be extended, since now we have 5 unknown parameters (0n, 0i2, 022, 7i and j2) instead of the original 3 unknown parameters mi, m2 and m^ to be determined. The reason, however, to rewrite the model is that the parameter estimation problem may be converted to a standard "least squares problem", as will be explained next. We note that from (26) it follows that xx(k + 2) = «Jiizi(* + 1) + <^i2*2(* + 1) + Jiu(k + !) = 011^1 (* + 1) + 012(02lZl(*) + <t>22X2{k) + 72«(*)) + liu(k + 1), but also ru\ - gi(* + l) _ 011^(^) - liu(k) x2{K) - , 012 so that, since x\(k) = yi(k), it is easily seen that yi(k + 1)- am(k + 1)- a2yi(k) - blU(k + 1)- b2u(k) = 0, (27) with «1 = 011 + 022 «2 = —011022 + 012021 h = Ti h = 01272 -02271, (28) and 0ii > 0, 022 > 0, 02i = 0. The parameter identification problem now is transformed to the identification of (27). The model (27) is of the so-called ARM A-type3, which is also widely used for economical time-series analysis. 3 ARMA=autoregressive moving average.
182 Define the parameter vector 01 = (a1,a2,b1,b2)T, (T denotes transpose) then (27) reads as 2/i(*) = V>iT(*)0i+ei(Mi),' where the entries of the vector V>iT(*) = (yi(k - 1), yi(k - 2), ui(* - 1), ui(fc - 2)) (29) are the observations (measurements). For a certain choice of the parameter vector 0i, there is a difference between the real yi(k) and #^(£)0^ this difference defines the error ei(k; 0i). Taking the process variable j/i, at a number of (discrete) time points, it follows immediately from (29) that J/i(n) = 1>i(n)6i + ei(n; 0i) yi(n + 1) = ipj(n + l)0i + ei(n + 1; 0i) or (more concisely) with yi(N) = xl>?(N)$1+e1(N;61)i yi=!?i£i+£i(^;ffi), / yi(") \ ,#i / V-iT(") \ MN\0) = WW/ We now want to minimize the error AT Jx(0!) = J] c2(fc; 0i) = eJ(N; 0i)£i(^; 0i) k=n /ci(n)\ (30) (31) with respect to the parameter vector 0i. This is a standard least squares minimization problem, where the optimal parameter vector 0, is found from the so-called normal equation jpTjM^jpTy^ (32) Provided the matrix #^#1 is non-singular (i.e., has an inverse), the optimal parameter vector 0i follows from 0i = (^r^Yi. (33) For determining the optimal parameter vector, measurements of the real process need to be available. These data are collected in the matrix #i and the vector
183 For the diabetes model the measurements are related to the inputsignal u (insulin injection rate) and the signal y\ (the glucose concentration). Not every choice of the input signal, however, is guaranteeing the matrix #^#1 'n (32) to have an inverse. In literature this problem is related to so-called "persistant excitation", which means that the system has to be activated at a sufficiently high level in order to learn its characteristics. For this, sinusoid inputsignals are usually taken. More details on the system identification problem may be found in Franklin and Powell (1980). In the next section we return, after this more or less general discussion of the identification of models of the type (15), to the diabetes model. 4 Parameter Identification and Simulation For the parameter identification of the diabetesmodel (9) we apply a rather arbitrary inputsignal 27T 27T u(i) = 200+ 100(sin— i + sin— t), t>0, [mg/dl per minute] (34) 3 5 which excites the system sufficiently. The sampling time is chosen T = 1 minute. The signals g and u, are measured for four minutes, giving g(0) = 300, g(l) = 299.4245, g{2) = 297.9771, g(3) = 295.7406 g(4) = 293.1109 [mg/dl] u(0) = 200, u(l) = 381.7082, u(2) = 172.1760, u(3) = 141.2215 «(4) = 191.4969 [mg/dl per minute]. The measurements are depicted in Figure 1. Using the method in Section 3, we find 0-2 k ( 1.9584\ -0.9585 -0.0015 \-0.0015/ (35) From (28) it follows, that ^u = 0.9991, ^i2 = -0.0030, ^2i = 0, <f>22 = 0.9593 Ti = -0.0015, 72 = 0.9795, so that For 0.9991 -0.0030\ _ /-0.0015\ 0 0.9593 J ' d ~ \ 0.9795 ) ' Ad = eAT = eA (36)
184 1.5 2 2.5 3 Fig. 1. The measurements. 3.5 4 time [min] it follows that , , , , -0.0009-0.003l\ A = \ogAd=[ Q _0Q415j We furthermore have Bd= I eA"dr]B, Jo from which we may derive that B = (eA-I)-1ABd=(0\ For the diabetes model we thus find -0.0009 -0.0031 0 -0.0451 + (37) u. (38) (39) We note, without further explanation, that we did not use the insulin concentration h(t) for the identification of the diabetes model. The identified model may be used for simulation on a digital computer. For this, a simulation package is needed, which in fact generates a numerical solution to differential equations like (39). We used MATLAB for these simulations. To illustrate matters we perform two simulation experiments. Assume, for instance, that the initial glucose concentration is 300 mg/dl and that the insulin
185 concentration is 0 mg/dl (note: these concentrations are with respect to fasting levels). At this moment we relinquish from insulin injection, so that g(0) = 300, /»(0) = 0, u(t) = 0. Figure 2 shows that the glucose concentration slowly decreases. Next we study the effect of a constant insulin injection rate of 100 mg/dl per minute, or u(t) = 100. We take g(0) = 300 and fc(0) = 0. The result of simulation of (39) is shown in Figure 3. The validity of the model is limited and, as always, one needs to be careful with the interpretation of simulation results. tT 300i ~~~---l—~—*—■ ■ ■ ■ ' ■ ■ ' ^250- ~ ■ .o c-i 1 200- o C o o 150- 100- 50- 0 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 2. The glucose concentration for u(t) = 0. In the next section we address the problem of determining an input signal u(i) which effects a desirable system response. 5 Process Control by Pole Placement In nature and in technology the control of processes is usually based on feedback. The feedback principle is easily explained by means of a blockdiagram as in Figure 4. In this blockdiagram u is the input signal and y the output signal. The dynamic behaviour of the system (the plant P) is for example described by a set of differential equations like (39)- We pose the question how to choose u, such that
186 -1000 0 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 3. The insulin and glucose concentrations for u(t) = 100. U Fig. 4. A blockdiagram of the system. the output signal y has a desired shape, say r(i). It seems reasonable to adjust the input signal on basis of the error e(t) — y{t) — r(t). Now consider Figure 5. Here the input signal y is compared with the (reference) signal r. The difference e is an input signal for the so-called controller K (this is also a system!), which generates an input signal u for the plant P. In Figure 5 we see that the output signal y is used to establish the input signal u, which explains the feedback principle (in fact, the output is fed back to the input). The problem of determining a suitable input signal u now boils down to the design of a controller K. We return to the diabetes model (39): -0.0009 -0.0031 \ 0 -0.0451 j + (40)
187 -3M> -,1 K U Fig. 5. The feedback system. or (cf. (15)) x — Ax + Bu y = Cx. (41) The "speed" (bandwidth) of the output signal y depends on the eigenvalues of the matrix A. For the diabetes model these eigenvalues are Ai = —0.0009 and A2 = —0.0451, implying that the outputsignal y consists of terms e~0-0009* and e-0.0451* These terms decrease very slowly for increasing values of t, so that the uncontrolled system has a slow response (cf. Figure 2). Now assume that in Figure 5 the reference signal is taken r(t) = 0, the outputsignal is the state x (i.e., y = x, or C = I in (41)) and the controller K is a constant matrix (to be determined). It follows from u = Ke= -Ky = -Kx, (42) and (41), that x = (A y BK)i x (43) The speed of the outputsignal y is now determined by the eigenvalues of the matrix A — BK. By making an appropriate choice for K, we can (under some additional controllability conditions) locate the eigenvalues of the feedback system at desirable places. Algorithms for determining the control matrix K, given the desired eigenvalues of A — BK, are found in the literature (cf. Friedland (1986)). This method of changing the eigenvalues of the open loop system matrix A via feedback is called pole placement. By taking K = (0 0), the eigenvalues of A — BK just equal those of A, i.e., for the diabetes model Ai, = —0.0009 and A2 = —0.0451. If we decide to choose the eigenvalues of A — BK equal to Ai = —1 and A2 = —2, it follows (by use of a pole placement algorithm), that K = Ki = (-664.2906, 2.9576). If we take, as before, for the initial conditions g(0) = 300 and h(0) = 0, we find (cf. Figure 6), that g and h rapidly decrease to a constant zero level.
188 xlO3 0 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 6. Response of the feedback system with controEer K\. In other words the reference signal r = 0 is tracked rapidly by the closed-loop system. There is however a price: the inputsignal (the insulin injection rate) takes excessive large values, up to about 2 • 105 mg/dl insulin per minute! This does not seem to be a very realistic choice for the inputsignal. It is even possible to destabilize the diabetes model (note that the diabetes model by itself has negative eigenvalues and, hence, is stable). If we choose K such that the eigenvalues of A - BK, are Ai = 0.01 and A2 = -0.01, then the outputsignal will have terms e001* and e-001*; the first term will grow without bound. For this choice of K, the system is unstable. The pole placement algorithm yields K = K2 = (0.0320, -0.0424). The result is plotted in Figure 7, which shows the instability of g and ft. This design clearly is not a realistic control for the feedback system. The first design (Ki) resulted in a stable, fast feedback system, but the excessive inputsignal values made the design useless. We now try to design a slower (stable) feedback system, by choosing the eigenvalues smaller in magnitude, for instance Ai = -0.05 and A2 = -0.1. The pole placement algorithm gives K = K3 = (-1.5696, 0.1076). The simulation result is given in Figure 8; the maximum value of the inputsignal is about 471 mg/dl per minute, which is a much better result than for the first controller design. The question arises whether it is possible to choose K in some sense optimally. Of course, what is optimal needs to be defined. To this end, we
189 define a performance criterion, which measures the performance of the system. The aim is to choose a controller K which accomplishes an optimal (usually: minimal) performance. For the diabetes model, assume that we want to keep the glucose concentration close to a constant level (say gd,), without administering excessive quantities of insulin. The criterion could take the following form -a c .2 '-+J 03 C <u o 5 o 1UUU 500 0 -500 ■1000 -1500 -2000 -9<;nn - - - - i 1 1 1 1 1 1 1 g - h - u -. _____ ""-..„. "■-..„. *--.._ "■*.„ --^ ""X \ 1 1 1 1 I 1 1 ' - - - - - N 0 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 7. Response of the feedback system with controller K2 -f Jo [(9 ~ 9df + pu2]dt, (44) where the weighting factor p determines the extent to which large volumes of insulin are penalized. We want to determine the inputsignal u such that the criterion c in (44) is minimal. More generally, the problem of determining an optimal control law for a system as in (41) is formulated as /•00 min/ [(x — Xd)TQ(x — Xd) + uT Ru]dt, u Jo (45) where Q and R are suitably chosen weighting matrices and x^ is the desired level of the state variable x. The solution of this so-called linear quadratic optimal control problem (the regulator problem), turns out to be a feedback control law: «oPt(<) = -Kx(t) -k, (46)
190 0 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 8. Response of the feedback system with controller K3- where K = R~1BTP, k = R-1BT(AT - PBR~1BT)xd, and P a positive definite solution of the algebraic Riccati equation: PA + ATP - PBR~1BTP + Q = 0. (47) (48) As an example we take in (44) p = 10 and gd = 100 mg/dl. From (47) and (48) it follows that '58.5694 -2.9909 \ -2.9909 0.1830 ) Kopt - (-0.2991 0.0183) k = 31.5998. 0, we find a result as given in Figure 9. This 100 for large values of t and u(t) remains Starting with g(0) = 300 and h(0) result is highly satisfactory: g(t) : within reasonable bounds. The control law we used in (46) is of a feedback type, which presumes the availability of measurements of all the state variables. In practise this is hardly ever the case. For the diabetes model it is conceivable that g(t) is available for measurement but h(t) is not. The question is how to control the process (optimally) under this restriction. An answer to this question is given in the next section, where we deal with state estimators (observers).
191 r^i -a 60 e c o -^ est C-i -^ C <u 700 600 500 400 .1001 ■- g - h - u -. _i i_ 0 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 9. Optimal response of the controEed system. 6 State Estimation Consider again the diabetes model (39) and assume that the outputsignal y is the glucose concentration, i.e. -0.0009 -0.0031 \ g\ /0' 0 -0.0451 U + l'U y = d 0)(1) 9- Note that now C = (1 0) in (15). We introduce, related to the state equations (41), the state estimation equation (observer) £(t) = Ax(t) + Bu{t) + L(y(t) - Cx(t)). (49) x(t) is an estimation of the state x(t). The term y(t) — Cx(t) is the difference of the outputsignal and the estimated outputsignal. The error, that is the difference of real and estimated state, then is e(t) = x(t) - x(t). (50) Combining (41), (49) and (50) it follows immediately that the error signal e(t) satisfies the differential equation i{t) = (A-LC)e(t). (51)
192 Note that the dynamics of the error signal are determined by the matrix A — LC and in particular by its eigenvalues. As with the pole placement principle of Section 5, we can (under certain observability conditions) locate the eigenvalues of A—LC at desirable places. For example, assume that the eigenvalues of A—LC are placed at —0.1 and —0.2. Then the pole placement algorithm gives r-r -( °'2576 ^ For our simulation experiment we choose an arbitrary inputsignal and we assume that the (unknown) initial condition is given by ,.(0) = 300 500/ ' The second component of the state is unknown; we take as an estimation of the initial condition /300 Xo={ 0 The result of the state estimation is shown in Figure 10. Likewise, we can place the eigenvalues of A — LC at —0.4 and —0.5, by taking 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 10. Estimation of the concentrations with Observer 1. T - T - f 0'8076 ^ L ~ L* ~ I -47.2410 ) ■
193 With the same inputsignal, the same initial conditions and the same initial estimate as in the previous design, we find what is plotted in Figure 11. Observe that the second observer is faster than the first. This is obvious, since the r^i -a C o -^ «3 C-i C <u C o <J 900 800 700 600 500 400 300 200 100 0 -100 -t r- i ■ i g--h-. : 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 11. Estimation of the concentrations with Observer 2. second observer has "faster" eigenvalues. It is therefore natural to ask whether we can design an arbitrarily fast observer (by choosing negative eigenvalues with sufficiently large magnitude). The answer is for at least two reasons negative: 1. a fast observer requires sampling of the signals at a high rate, which is practically limited (hardware limitations). 2. a fast observer amplifies model and measurement errors (noise). For these reasons a limited speed (bandwidth) of the estimator is required. In the next section we consider the modelling of error sources and their implications on the design of an optimal state estimator. 7 Stochastic Systems This section deals with modelling of possible differences between model results and reality. These differences may be caused by the following error sources: - changing process characteristics
194 - unmodelled non-linearities - changing process parameters - sensor (measurement)errors and other disturbances. We expand the linear model x = Ax + Bu y = Cx • (52) by adding noise terms, v and w, to the state differential equation and the output equation, respectively, so that { x = Ax + Bu + v . y=Cx + w ' {06) where v and w are stochastic processes . Though mathematically far from trivial, it is assumed for computability reasons that the processes v and w are stationary, gaussian distributed and mutually uncorrelated white noise processes. The statistical properties of these processes are assumed to have zero mean and intensities (say, variances) Qv and Qw. Furthermore the initial state x0 of the stochastic system has a mean ~Xq and a variance P0, and the initial state is uncorrelated with v and w. Summarizing these assumptions we have5 6 E{v(t)} = 0, E{w(t)} = 0, Rv(t) = E{v(t)vT(r -1)} = QvS(t) , Rw(t) = E{w(T)wf(T -1)} = Qw6(t) {0V E{x(0)} = xo, E{(x(0) -xo)(x(0) -x0)T} = P0. A further explanation of the characteristics of the stochastic processes v and w is beyond the scope of this treatise; we refer for more details to Kwakernaak and Sivan (1972). For the diabetes model we choose ^^(Tioo) en£» = 2500> which gives a time course of the glucose and insulin concentrations as in Figure 12. The state x and the outputsignal y are disturbed under the influence of The main characteristic of a stochastic process is that the present knowledge of the state of the process is not enough to predict the future evolution of the process in time. For lack of better, stochastic processes are often modelled by statistical parameters such as the mean and the variance (or better: the mean vector and the covariance matrix). The reason is that it is not possible to obtain the probability densities or distributions in practice. Often, "gaussian" processes are assumed, for which the above mentioned parameters apply. E{v(t)} denotes the expected value of the process v at time t. If pv(t)(v) is the probability density function of the stochastic process v, then E{v(t)} = J^° vpv^(v)dv. 6(-) is the Dirac-delta function. For more details we refer to Kailath (1980).
195 the noise terms, which impedes adequate estimation with an estimator as described in Section 6. The disturbed state now is to be estimated on basis of a disturbed outputsignal y, which is available for measurement. Actually, we are searching for an estimator x(t) of the state that minimizes the difference between the real state x(t) and x(t). Since x and y are stochastic processes, it is the minimization of the expectation of the difference, given the measured outputsignal. Apparantly, the problem is the minimization of the conditional variance of the estimation error E{(x(t) - £(*))(*(*) - x(t))T \y(r),r < t}. (55) R.E. Kalman, one of the founders of modern control theory, found in the early 60's, that under certain conditions (not explicitly stated here), the state estimator is of an observer type as discussed in Section 6. This state estimator, which is called the Kalman filter, is described by ( x ~ Ax + Bu + Lopt(y - Cx) „„, l*(0)=5o, ( j where Lopt = PCTQ-\ (57) and where P is the non-negative definite solution of the Riccati equation 0 = AP + PAT-PCTQ-1CP + Qv. (58) We apply this result to the process as plotted in Figure 11. From (57) and (58) we find _ ( 0.1992 \ ivopt - ^_0.0062j ' In Figure 13 and 14 the effect of the Kalmanfilter is clear: noise influences are strongly filtered. The next question is, whether it is possible to control a system, which is afflicted by noise, optimally. In analogy with (45) we define a stochastic optimal control problem min^i J [(x~xd)TQ(x-xd) + uTRu]dt\. (59) The solution to this problem is surprisingly simple. This is due to the so-called separation principle: 1. compute an optimal control law for the system while neglecting the noise terms. Differently stated: determine an optimal control law based on (52) instead of (53). The optimal control law with respect to (59) is given by (46), (47) and (48). Note that we use state feedback in (46) . 2. estimate the state x with the Kalmanfilter and use the estimation x for the optimal control law in 1.
196 £^ -a W) s c o -^ «) 14 c <u 1000 800 600 400 200 -200 %--. : h- . w&i& II TRr^sxr 0 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 12. Noise-disturbed glucose and insulin concentrations. 5" 400 c o c -200 20 40 60 80 100 120 140 160 180 200 time [mini Fig. 13. The real and the estimated glucose concentration.
197 -a C o c o 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 14. The real and the estimated insulin concentration. Thus, the stochastic system x = Ax + Bu + v y = Cx + w (60) is optimally controlled by ^opt = -Koptx - k (61) where x = Ax + Buopt + Lopt(y - Cx) x(0) = x0. (62) Thus we find for the diabetes model, after combination of the optimal controller of Section 6 and the Kalmanfilter of the present section, a result as in Figure 15. The glucose level is reasonably close to 100 mg/dl, as desired. The input signal remains within acceptable bounds. In this and the previous sections we studied the identification and control of processes for which it was possible to formulate a reasonably adequate model, in terms of linear differential equations (stochastic or not). In the next section we present some other identification and control principles.
198 -a C o c <u o a o o 0 20 40 60 80 100 120 140 160 180 200 time [min] Fig. 15. Optimal response of the stochastic system. 8 Other Types of Process Control In the previous sections we discussed how to control a process, based on a linear model of the dynamics of the process. So far, the main control objectives were stability of the feedback system and reference signal tracking. In this section we give a concise view on other types of process control. Many details will be missing, for which we refer to the references. In practice, the control of (mainly chemical) processes is seldomly based on a model of the process. Instead, the process variables are tuned individually (that is, independently of each other) by so-called PID-controllers"''. Such controllers produce a control signal u for the process, which is based on the error e (the difference of the process variable and a desired setpoint), according to the equation: u(<) = KPe{i) + Ki j e(T)dT + KDe(i). (63) Jo The parameters KP, Ki and Kd in (63) are tunable and they are to be determined on basis of the specifications and characteristics of the process. The term Kpe(i) in (63) describes a proportional action, which is similar to that imposed by pole placement as described in Section 5 (cf. for example (42)). Large values of KP may destabilize the system. Destabilization is avoided by including a "Proportional-Integral-Derivative" controllers are tunable controllers with one input signal and one outputsignal.
199 derivative action, which is the term Kue{t) in (63). The derivative action might prevent the error to become zero, which would cause a permanent difference between setpoint and process variable. This effect is controlled by adding an integral action, represented by the term Kj JQ e(r)di in (63). The parameters of the PID-controller are such that the specifications as, for example, the set time (time to reach the setpoint) overshoot (largest error) and stability margins (how far is the system from instability?) are met. Industrial PID-controllers usually apply the Ziegler-Nichols procedure for tuning the parameters Kp, Kj and Kq. PID-controllers are commercially available in digital form. Often, systems specifications are easier to state in terms of the frequency response function of the feedback system. The frequency response function of a (linear!) system can be obtained, for example, by measuring the response of the system to different sinusoid like signals. The input-output behaviour of the system then is not described by a model like (10), but by its frequency response function. Based on the frequency response of the open loop system many (mainly graphical) tools exist to design stabilizing controllers. Also other specifications, like the bandwidth (speed) of the feedback system, are easily translated to specifications for the open loop frequency response. For "shaping" a frequency response, simple tunable dynamic controllers such as "lead" and "lag" controllers are commonly used. The frequency response techniques belong to the domain of "classical control" , since these techniques were used on a large scale far before the advent of "modern control", as described partly in the previous sections. The design of so-called robust controllers renewed the interest in modern control for classical techniques. A robust controller is a controller which is based on a nominal model of the system to be controlled. This controller stabilizes the system even under significant model uncertainties. These model uncertainties are usually easier modelled in terms of frequency response functions. The field which is concerned with the development of robust controllers is denoted by "Woo optimal control". It is in fact a symbiosis of classical and modern control: "neoclassical control". Another branch of control technology deals with the design of adaptive controllers. These controllers act on (linear) systems that operate in a large uncertainty band (caused by, for example, changing parameters). For this type of system a unique parameter identification is insufficient for obtaining satisfactory controllers. In adaptive control one distinguishes two approaches to the control problems: 1. Indirect control: the parameters of the system are estimated on line, and control parameters are tuned accordingly. 2. Direct control: there is no parameter identification. The parameters of the controller are adapted according to a criterion. Widely used adaptive control systems are so-called model reference adaptive control systems (MRAS) and self-tuning controllers. Model reference adaptive control systems are based on a mathematical model of the desired systems behaviour. The parameters of the controller are adapted such that the behaviour
200 of the real system is directed towards the desired (reference) behaviour. Self- tuning controllers are essentially are indirect controllers, for the parameters of the controller being adapted according to on-line parameter identification. Here, at every instant, the control is based on some optimal or other control principle. For the control of robots, where it is possible to derive reasonably detailed nonlinear kinetic equations, results from nonlinear control are applied. Linearization of these equations for this kind of applications usually is unsatifactory, because of the large range of the process variable values. Much attention has been paid to linearizing controllers (dynamic or not), that make the feedback system behave linear. For the control of these linearized feedback systems the linear methods previously discussed are available. So far we discussed control and identification methods that reached to a certain extent maturity. In the next section we go, in bird's-eye view into the role that neural networks may play in process control and identification. As this research area just recently came to development, we restrict ourselves to a brief discussion of some striking journal papers. 9 Identification and Control by Neural Networks Apart from other types of artificial intelligence, like expert systems and the use of fuzzy sets, neural networks have in recent years obtained an increasing interest of control engineers. This interest may primarily be explained from the sometimes restricted usefulness of more traditional control and identification methods. Traditional methods fail where the process is difficult to model, as in vision, speed and pattern recognition. These tasks are easily performed by a human being, but are difficult to embody in an algorithm. There is a notable difference between human control and machine control. The human being uses an enormous amount of sensorial information to plan and execute his control tasks. This is in contrast to industrial controllers. The reason is not a lack of suitable sensors, but the restricted processing capacities of the industrial controllers. Furthermore, a human being is able to process with high speed enormous amounts of information in parallel. Industrial controllers are sequential and slow. Finally, and this is perhaps the most striking difference between man and machine, human control is based on learning, whereas machine control is based on a predefined algorithm. The design of an algorithmic controller demands a thorough knowledge of the process under control, which in practise often is difficult to obtain. Neural Networks were developed in analogy to the functioning of the human brain. The following three factors of an artificial neural network are particularly interesting for identification and control: 1. its capacity to process large amounts of sensorial information 2. parallel processing, and 3. adaptation (learning).
201 In the previous section we gave a brief discussion of adaptive controllers. These adaptive controllers usually assume that the system to be controlled and the controller are linear. A neural controller, as a matter of fact, is a nonlinear adaptive controller. It is nonlinear for being built up in layers of nonlinear elements (neurons). It is adaptive, for the parameters (the weights of the interconnects) being adapted according to a learning rule. In order to give a flavour of the applications of neural networks for identification and control, we give a concise view on a number of recent publications in this field. Of special interest are those in the special issues of the IEEE Control Systems Magazine, April 1988, 1989, 1990 and 1992. In Kraft and Campagna (1990) a comparison is made of a special type neural network (cerebellar model articulation controller, CMAC), a selftuning adaptive controller and a model reference adaptive controller (MRAC). A superior controller is not found in this comparitive study. A positive characteristic of the neural controller is its robustness in the presence of model errors (like nonlin- earities). A negative characteristic of the neural controller, in comparison with the other adaptive controller, is its slow learning of the systems behaviour. In Chu et al. (1990) examples of system identification are given for the Hop- fieldmodel. In Bavarian (1988) also a Hopfield network is applied for implementation of analog to digital signal conversion (this A/D-conversion is very common in control technology). Most publications on applications of neural networks for identification and control are based on the backpropagation algorithm. An extremely illuminative presentation of the algorithm may be found in Narendra and Parthasarathy (1990), where many simple non-linear identification and control problems are solved with neural networks. In Li et al. (1989) a neural network is learned how to control the shape of a robothand for grasping an object. The learning is based on objects characteristics like width and diameter. In Passino et al. (1989) the significance of neural networks for discrete-event systems is discussed, stressing the conversion of numeric to symbolic data. In Nguyen and Widrow (1990) the backing of a trailer truck is investigated. First, the dynamics of the truck and the trailer are learned by the neural network (identification). Next a neural controller is learned to back up the truck and trailer correctly. In Bhat et al. (1990) some nonlinear static and dynamic chemical processes are modelled with neural networks. The book "Neural Networks for Control" by Miller et al. (1990) is a collection of papers that are organized in three major sections: General Principles, Motion Control and Application Domains. The emphasis of the book is on artificial neural network methods for optimization over time and on reinforcement, with applications to control. The focus is mainly on robotic control, however, other domains are covered as well. Very recently, a special issue of the IEEE Control Systems Magazine (1992), appeared consisting of papers that present as varied and current as possible a picture of the research in the field. The papers are introduced by Antsaklis (1992).
202 In all the publications just mentioned the benefits of the neural networks became apparent, viz., their robustness against model uncertainties, their ability to generalize and their speed (provided they are implemented on parallel hardware). As a serious objection against neural networks, many publications stress the lacking theoretical grounds. As a matter of fact a guarantee that the network does what it should do is generally missing. This is a consequence of the complicated analysis of nonlinear dynamic systems. Furthermore, the convergence of the weights in the network is usually slow and often cannot be guaranteed in advance. The optimization of the weights often stops in a non-global minimum. So far, the use of neural networks for process identification and control is rather arbitrary; a general concept does not exist (yet). It is too early to estimate the significance of neural networks for process identification and control. They however are very challenging for researchers, laying the message of Antsaklis (1990) to heart: "Neural networks in control must be studied by using mathematical rigor in the tradition of our discipline [control theory]. Only in this way can we harvest the full benifits of these powerful new tools. Only in this way can we create something lasting and useful for the years to come." 10 Conclusion In this treatise we considered many aspects of process identification and control. The identification problem is one of determining an appropriate mathematical model of the process under study and of determining the parameters of the model. Often, one assumes that the process (which generally may be modelled as a set of nonlinear differential equations) operates in the neighbourhood of an operating (equilibrium) point. In this neighbourhood the process behaves linearly. By linearization about an operating point, the nonlinear differential equations convert to linear differential equations: a linear system. The identification and control of linear systems is well-understood. The parameter identification often follows from a signal discretization and the solution of a least squares problem- An identified model (linear or not) may serve as a basis for simulation. Furthermore, linear system control laws may be designed to influence the systems behaviour. Feedback is a widely used principle, to control the system behaviour. Under certain conditions, pole placement might prove useful for achieving desirable stability, bandwidth and response properties of the feedback system. A feedback system with a favourable response (for instance high speed reference tracking) has, however, its price: the control action may cause unacceptably high control signal values. This leads to the formulation of a mathematical optimization problem which is related to the underlying optimal control problem. In this mathematical optimization problem, the performance criterion consists of weighted control variables and controlled variables. This criterion is to be minimized with respect to the control variables. Optimal control is based on
203 availability of all state variables, which condition is not always met in practice. In this case observers are applied to estimate the state variables. The speed of an observer is limited by possible amplification of model and measurement errors. This necessitates the modelling of these errors, by using white noise processes with specific stochastic properties. The control of linear stochastic systems turns out to be surprisingly simple as a consequence of the. separation principle. The system is controlled by a "non stochastic" optimal controller which uses optimal state estimation from the KaimanfUter. We mentioned different types of control, like PID control, frequency response control, robust control, adaptive control and nonlinear control. Finally, we glanced at the use of neural networks for process identification and control. Many professional journals have neural networks in their focus. Some promising results have been reported already. It is, however, too early to give a final judgement of the impact that neural networks may have on process identification and control. References P.J. Antsaklis (1990) Neural Networks in Control Systems. IEEE Control Systems Magazine 10 (3), 3-5. P.J. Antsaklis (1992) Neural Networks in Control Systems. IEEE Control Systems Magazine 12(2), 8-10. B. Bavarian (1988) Introduction to Neural Networks for Intelligent Control. IEEE Control Systems Magazine 8 (2), 3-7. N.V. Bhat, P.A. Minderman, Jr., T. McAvoy and N.S. Wang (1990) Modeling Chemical Process Systems via Neural Computation. IEEE Control Systems Magazine 10 (3), 24-30. S.R. Chu, R. Shoureshi and M. Tenorio (1990) Neural Networks for System Identification. IEEE Control Systems Magazine 10 (3), 31-35. G.F. Franklin and J.D. Powell (1980) Digital Control of Dynamic Systems. Addison- Wesley, Reading etc. B. Friedland (1986) Control System Design: an Introduction to State-Space Methods. McGraw-Hill, New York etc. T. Kailath (1980) Linear Systems. Prentice-Hall, Englewood Cliffs, N.J. L.G. Kraft and D.P. Campagna (1990) A Comparison Between CMAC Neural Network Control and Two Traditional Adaptive Control Systems. IEEE Control Systems Magazine 10 (3), 36-43. H. Kwakernaak and R. Sivan (1972) Linear Optimal Control Systems. Wiley- Interscience, New York etc. H. Li, T. Iberall and G.A. Bekey (1989) Neural Network Architecture for Robot Hand Control. IEEE Control Systems Magazine 9 (3), 38-43. W.T. Miller, R.S. Sutton, and P.J. Werbos (eds.) (1990) Neural Networks for Control. MIT Press, Cambridge, MA. K.S. Narendra and K. Parthasarathy (1990) Identification and Control of Dynamical Systems Using Neural Networks. IEEE Trans. Neural Networks 1 (1), 4-27. D.H. Nguyen and B. Widrow (1990) Neural Networks for Self-Learning Control Systems. IEEE Control Systems Magazine 10 (3), 18-23.
204 K.M. Passino, M.A. Sartori and P.J. Antsaklis (1989) Neural Computing for Numeric- to-Symbolic Conversion in Control Systems. IEEE Control Systems Magazine 9 (3), 44-52. G.W. Swan (1984) Applications of Optimal Control Theory in Biomedicine. Marcel Dekker, Inc., New York etc. D.A. White and D.A. Sofge (eds.)(1992) Handbook of Intelligent Control: Neural, Fuzzy, amd Adaptive Approaches. Van Nostrand, New York.
Learning Controllers Using Neural Networks W.T.C. van Luenen Unilever Research Laboratorium Vlaardingen 1 Introduction In the following sections the applicability of neural networks for control of dynamic systems is considered. It should be clear to the reader that neural networks are one of the research topics in which a large interest exists at this moment. However, despite the fact that promising results are claimed, most of them are preliminary. The methods and algorithms presented here are not (yet) ready for practical applications in industry. All work is experimental and conducted in research laboratories. It will still take a lot of research on both theoretical and practical topics to solve major problems in this field. Such problems are long learning times, computational capabilities, proofs of convergence and proofs of stability. However, the following sections give an introduction to an interesting new research area with possibly great aspirations in the future. 1.1 Motivation Conventional control algorithms are based on the use of mathematical models of the process which needs to be controlled (see the contribution by Boekhoudt). Creating such a model for a complex system takes time. Conventional control theory also puts restrictions on the models used in controller design. The model should for instance be linear, contain gaussian noise and quadratic performance criteria should be used. However, often processes are non-linear and the parameters may even not be time invariant. As a result of this, for some processes, only ill-defined models are available to the control engineer. Despite such problems, these processes need to be controlled in some way. Neural networks have learning capabilities and they can be used to realize nonlinear mappings. These are attractive features which could make them useful building blocks for non-linear adaptive controllers. Neural networks may be useful here because they are able to learn a non-linear control law. They do not require us to fully define the structure of the process or the structure of the controller on beforehand. However, some amount of a priori knowledge is required to design a neural network controller. After learning, the controller structure is represented by the structure, the non-linearities and the weights of the neural network. The weights of the neural network may be considered as the parameters of the controller. Until now, only well-known processes have been used in research to test learning controllers. The reason for this is that much of the behaviour of neural networks is unknown. Convergence and stability of network learning algorithms
206 have not been proved. The simple problems regarded in research are needed to study learning behaviour and to interpret the knowledge obtained by learning. This is necessary in order to estimate the complexity of problems which can be solved using neural control techniques. 1.2 How to use Neural Networks in Control The use of neural networks in control of dynamic systems can be explained easiest by regarding the conventional approach. Figure 1 depicts a conventional control system. The process is controlled by means of the controller output u. The process outputs, in the best case the states of the process, are measured and used as inputs for the controller together with a set-point (if provided). Imagine the control of a simple servo-system, e.g. a DC-motor driving a rigid manipulator arm. In that case the position and velocity of the motor may be used for feed back. Notice that the structure of this conventional linear controller has much in common with the model of a neuron in a neural network. The controller output is created using a sum of weighted inputs fed through an activation function. In fact, this so-called state feedback controller is equal in structure to an adaptive linear neuron (Adaline) as proposed by Widrow and Stearns (1985). set-point " ^ Kl fcsl "X -£>-"EE _A r u Dynamic process process outputs Fig. 1. Conventional controller configuration. Suppose that a neural network replaces the controller in Figure 1. In order to make the neural network controller learn (by means of supervised learning), the desired output of the neural network should be known. However, in control applications, the desired output of the controller is generally unknown. Instead the desired output of the process may be given (explicitly) as a reference trajectory (e.g. the desired motion of a motor). A reference trajectory, specifying the desired process outputs, may be used in a performance measure for the learning controller. It is possible to obtain the error signal between the reference trajectory and the outputs of the process (controlled by the neural network controller). This error signal is available at the process output but not at the controller output. The error signal can therefore not be used for learning by the neural network controller. The first question clearly is how to translate the
207 output error into a controller error. This is shown in Figure 2. If this can be achieved, the next question is how to design a proper neural network controller. These two questions will be treated separately in the next sections. set-point reference trajectory process outputs Fig. 2. The learning problem for neural controllers. 1.3 Neural Network Design The design of a neural network controller is a topic of current research. So far there is no design strategy, just some rules of thumb. The choice of the appropriate neural network structure heavily depends on available a priori knowledge about the process. Notice that the structure of a neural network is determined a priori, while learning concerns the adjustment of the weights in the network. In the most simple case the process which needs to be controlled is linear and all its states are available for the controller. In that case a linear controller will be able to solve the problem. A single linear neuron may be used which generates a controller output signal out of a linear combination of its inputs. It does not make sense to build networks of linear neurons because the overall network output would still remain a linear combination of its inputs. The number of weights in the linear neuron is equal to the number of controller inputs (states and reference value). A more complex case appears when the process is known to be linear but not all the states are available to the controller. In that case extra information should be made available. In such a case conventional control algorithms use observers to estimate the unmeasured states. These observers use the known states and (in digital implementations) the time delayed values of these known
208 states to estimate the unknown states. Therefore it seems reasonable to use time delayed values of the measured states as inputs of the neural network controller. In this way the neural network receives data in which information about the unmeasured states is contained. If the process regarded is non-linear (or it is not known to be linear), the use of a non-linear neural network for the controller could be considered. In the case that all states of the process are measured, they can be used as inputs of the neural network. If not all states are measured, time delayed values may be used as inputs. The type of network could be a single- or a multi-layer feed forward network with neurons containing differentiable activation functions. The latter type is often used in combination with backpropagation. The multi-layer network may be taught a non-linear mapping between state space and controller output space. According to Kolmogorov's theorem (Kolmogorov, 1957), this network needs two layers of non-linear neurons. The number of neurons in the output layer is determined by the number of controller outputs. The (single) hidden layer should be sufficiently large, but there are no practical rules saying how large. The use of single layer networks using for instance non-linear Gaussian or polynomial functions has the advantage of fast learning algorithms while requiring more neurons. The proper size of a non-linear neural network is hard to determine a priori. A network which is too small for its task will not be able to learn the task properly. If the network is too large, its learning speed will slow down and the learning procedure may not converge. Results of trial and error experiments show that the performance of neural networks in learning a mapping show a graceful degradation of performance if an initially large network is made smaller. Therefore, it is wise to choose the network a little oversized and try smaller and larger versions to see whether this results in convergence and faster learning behaviour. Note. Neural network learning algorithms are used today with roughly two different approaches, either considering them as an associative memory, or as a function approximation. The first approach is common in pattern recognition, where networks should recognize a restricted number of patterns. The second is common in system identification and control. There, networks are used to approximate a particular function or process. In the memory approach, each pattern is learned separately, and there is a serious demand that the network should remember all prelearned patterns after learning a new one. In the approximation approach, a series of data comes in, (sometimes in real time), which is used for approximation, in which interpolation is important. 1.4 Learning Strategies Once the neural network structure has been chosen, the next step is to find a strategy for learning the correct weight values in the network. Since this problem has been investigated intensively, it deserves considerable attention.
209 When neural networks are learning in a dynamic environment, two aspects of learning need to be separated. The first aspect is structural learning. Suppose that actions of neurons in a neural network controller result in a successful output action. How should the credit for this action be distributed among the various neurons in the network? A solution to this problem is given by for instance gradient techniques (like the backpropagation algorithm) or correlation techniques (like reinforcement algorithms). The second aspect of learning is temporal learning. Suppose a series of output actions of a neural controller results in achieving a certain goal successfully over a period of time. To which actions should the credit for the success be attributed? An algorithm which combines both structural and temporal learning is the so called adaptive heuristic critic (AHC) algorithm, a reinforcement learning algorithm which will be explained later on in this text. Another example of a temporal learning algorithm is dynamic backpropagation. Two learning strategies will be considered here. The first method uses supervised learning and a model of the process for learning the neural controller. The second method uses reinforcement learning. With some versions of this method, including the one treated here, there is no need for a process model. 1.5 Neural Network Control Using Identification The first strategy, using a gradient learning algorithm, consists of two stages. In the first stage a dynamic model of the process is identified, this model is used in the second stage to determine a controller. The structure of the model may be either a conventional model structure (e.g. a differential equation) or a neural network. In the literature, the use of a neural network controller is generally combined with a neural network model. If a neural network is chosen to model the process, the identification may take place by means of (dynamic) backpropagation, a supervised learning algorithm. Identification with neural networks requires a series of input signals which are used as input for the real process. The output signals of the process as a result of these inputs are measured and stored. If the same input signals are used for a neural network, the outputs of the neural network can be compared with the outputs of the real process resulting in an error signal. This error between outputs of the real process and the outputs of the neural network model is used in a dynamic backpropagation procedure to find weight values which make the neural network behave like the real process. The weights in this neural network model may be considered as the parameters of the model, however, these parameters do not have a physical meaning. Once the parameters of the neural network model have been identified, the configuration of Figure 3 is used for learning the controller. The basis of this procedure is the calculation of a gradient. The error at the output of the process, obtained by comparing the process ouputs with the reference trajectory, is back- propagated through the neural network model towards the input of the neural network model without adjusting the weights of the model network. This requires
210 the calculation of the sensitivity of model outputs with respect to changes in the model input, involving dynamic sensitivity models. neural network model set-point process outputs reference trajectory Fig. 3. Indirect learning using supervised learning. The error at the model input is interpreted as the error at the output of the controller and is used to backpropagate it through the controller neural network. The weights of the controller network are again adjusted using (dynamic) back- propagation. Note. For identification and control of dynamic processes, the use of tapped delay lines, dynamic backpropagation and sensitivity models is needed. For various reasons these topics have not been treated here. The reader may wish to refer to Narendra and Parthasarathy (1990) for more details. 1.6 Reinforcement Learning The second strategy uses reinforcement learning and is depicted in Figure 4. In this case the behaviour of the process is evaluated by a critic. The result is a so called reinforcement signal which gives an indication of the performance quality. The reinforcement signal may be compared best with the outcome of a criterion function of a dynamic programming problem as described in optimal control theory. The reinforcement signal r is used in a reinforcement learning algorithm to adjust the weights in the neural network controller. There is an essential difference between reinforcement and gradient learning. The latter strategy uses information about the size and the direction in which learning has to take place. This information is present in the error signal (its size and its sign). The reinforcement strategy only gives information about the quality in an absolute way.
211 The algorithm has to find out for itself (by trial and error) in what direction the learning should go. The critic may have various shapes. Essential parts within the critic are an evaluation of the current state of the process and a predictor of future evaluations. The evaluation may be carried out by means of a traditional (differ- entiable!) criterion as it is used for optimal control. However, it may also be a (non-differentiable!) range detector, as will be treated later. The predictor within the critic is often implemented by a neural network, just like the controller. However, other representations like tables and fuzzy logic have been used as well. It learns by the method of temporal differences (Sutton, 1988) as will be treated later. set-point process outputs Fig. 4. Reinforcement learning control. 1.7 A Priori Knowledge At this stage it is important to notice that learning and a priori knowledge are strongly connected. Learning controllers can not be designed without a priori knowledge of the process. A priori knowledge is for instance the assumption that a process model is linear or that its structure is known. A more advanced type of a priori knowledge is a process model, used in the controller design. An important piece of a priori knowledge is the structure of the controller. This may be a linear structure, a table or a neural network. The choice of a particular controller structure determines in a major sense the type of learning which is possible. If for instance our controller allows gradient calculation (that is it contains differentiable elements) the backpropagation algorithm may be used. If, however, the controller contains a hard limiter, this is not possible. If knowledge of the process is available, this knowledge may be used to speed up learning. One possibility is to use the a priori knowledge in the structure of a
212 model or in the structure of the controller. The presence of all the process states or a set of incomplete states is a simple example of a variation of such knowledge. It is possible that the structure of the model is known but the parameters are not. In that case it is preferable to use this model and identify its parameters instead of trying to identify a neural network model which does not use the available model structure at all. Another example is that a non-linear part of a model, e.g. a fiction curve, is not well known or time varying. In all cases, the introduction of a priori knowledge reduces the complexity of the learning problem a great deal. Therefore, it will also help in obtaining faster learning and convergence to the appropriate solution. If we consider the two approaches to neural network control, differences appear in the requirements for using them. The identification approach requires us to create a model of the process by means of experiments with the real process. Usualy, extensive sets of measurements are required for this purpose. Therefore, this approach will not always be flexible in practice. The reinforcement learning approach does not need a model. However, trial and error learning can be even more troublesome in practice. The motivation for using reinforcement learning may be found in the fact that human learning is like trial and error learning. Humans do not learn a model from which they derive a control strategy. By subsequently trying something and observing the effects, they learn a direct relation between control actions and process outputs. This is what reinforcement learning approach does, and why it is under investigation in the remainder of this paper. 2 Reinforcement Learning Reinforcement learning is a technique which is well known in the field of learning automata (Narendra and Thathachar, 1989). It also appears in some early papers on learning control (Mendel and McLaren, 1970). This type is called nonassocia- tive reinforcement learning (NRL). In later literature (e.g. Barto and Anandan, 1985) different approaches to reinforcement learning appear in the context of neural networks. These are denoted as associative reinforcement learning algorithms (ARL). Figure 5 shows a block diagram for nonassociative reinforcement learning. The objective in this scheme is to let the automaton select a single action u which optimizes a reinforcement signal r generated by a critic. The reinforcement signal is generated after evaluation of the response x of the environment. As a result of this learning procedure, the automaton learns a single action or probability distribution which results in an optimal reinforcement. If the environment is deterministic, its output depends on the action of the automaton only. In that case the learning procedure is in fact a function optimization problem. It may be compared with hill climbing. If the environment is not deterministic but stochastic, the learning procedure becomes a stochastic optimization problem in which the expectation of r or the success probability is to be maximized.
213 critic r Learning automaton u —» environment Fig. 5. Nonassociative reinforcement learning. The second type of algorithm, associative reinforcement learning as proposed by Barto et al. (1981), is shown in Figure 6. The essential difference with the nonassociative version is that the automaton receives both a reinforcement signal and a so called context input. The context input is in general equal to the response of the environment. The automaton has to learn to associate between the response of the environment and the reinforcement signal which it obtains. The objective here is to learn a mapping from the response space to the space of actions while optimizing the reinforcement signal. As a result of this learning procedure an optimal action u8- is learned for each response X{. The ARL automaton can therefore be regarded as a set of NRL automata, each learning an optimal action Ui for a particular response X{. critic Learning automaton environment Fig. 6. Associative reinforcement learning. ARL algorithms exist in two versions. The first type maximizes the expectation of the reinforcement signal at each time step. In this case each time step, an
214 evaluation of the current state of the process is generated by the critic by means of a criterion or range detector. This evaluation is only meaningful in a static sense (the current state). The second version maximizes the cumulative value of the reinforcement over time. In this case the critic is provided with a predictor which forecasts future evaluations. This type of evaluation is meaningful in a dynamic sense. In the next section this will be shqwn by an ARL algorithm which maximizes the cumulative reinforcement over time. Reinforcement algorithms originally were developed and used in problems where the response space consisted of a finite number of elements (responses) Xi. Therefore these learning algorithms were first used together with table-lookup type of data storage (Barto et al., 1983). This has been called the memory approach of knowledge storage in Section 1.3. The AHC algorithm described in Section 3 is an example of such a case using a so called state space decoder which divides state space in a finite number of subspaces. Later versions of these ARL algorithms use continuous mappings (Anderson, 1987) in which case multi-layer feed forward networks and backpropagation are used. The underlying need of the learning algorithm, approximation of the control or critic function, makes it possible to use various types of structures. It may for instance use single layer neural networks or even fuzzy logic based structures. 3 The Adaptive Heuristic Critic (AHC) Algorithm The AHC learning algorithm has been proposed by Barto et al. (1983). The algorithm presented by Barto et al. had not been developed primarily for use in the control of dynamic systems under the conditions usually assumed by control engineers. However, it indicated the capabilities of reinforcement learning neural networks for such control tasks using a minimum amount of a priori knowledge. The article by Barto was one in a series of publications in which several similar approaches towards "learning control" were described. The most well known of these papers are the "boxes approach" (Michie and Chambers, 1968) and the continuous multi layer neural network approach of the cart-pole balancing problem by Anderson (1987). The first one presented a pattern recognition approach to the cart-pole balancing problem, the latter described the implementation of the AHC algorithm by means of multi layer feed forward networks and back- propagation. It was Werbos (1990) who made the connection to the classical optimal control problem (see also the contribution by Boekhoudt). In this section the AHC algorithm will be described and analyzed. This is done using the table-look-up (memory approach) version of the algorithm, but it is similar for versions using a neural network. The reason for using the table- look-up version is the real time constraint, which had to be satisfied when the algorithm was tested on an experimental setup. 3.1 Global Description A block diagram of the AHC algorithm as originally proposed and used for pole balancing is outlined in Figure 7. The process is controlled by the AHC output
215 u while the state vector of the process, x_, is measured and used as input for the AHC. The AHC algorithm itself can be divided into three parts: the action network (called Adaptive Search Element or ASE by Barto), the evaluation network (Adaptive Critic Element or ACE) and a range detector. Finally, in the original publication a state space decoder has been added. The state space decoder is typical for the table-look-up version. The parts of the block diagram will be described below. Sofar the AHC algorithm has been shown to work for the inverted pendulum (Barto et al., 1983). Other applications are under investigation. There are a few restrictions when applying the AHC algorithm. First of all, the process should contain a single input signal. This demand is caused by the relation between the reinforcement r and the AHC output u (as will be explained). If more output signals are to be generated, by the action network, its structure needs to be changed. The number of process outputs is not limited. However, it is known from classical control theory that the process should satisfy certain demands in order to be controllable. An important demand, if the control law should determine the process behaviour, is that all states of the process need to be available to the controller. It requires some a priori knowledge of the process to determine the states and to include them in the vector x_. frH as w 8 o w Q 1 f Network A r Action Network r u Range detector PROCESS i— %— X Fig. 7. Block diagram of the AHC controEer algorithm. As stated, the decoder is not a part which is required by the AHC algorithm However, there are advantages in using it. The decoder divides the range of each of the measured state variables in a number of intervals. In this way the state space is effectively divided into a number of non-overlapping subspaces. The process state can therefore only be in one subspace at the same time. As a result
216 of this, the vector of measured state variables is converted into one binary valued vector indicating which subspace is currently visited. The determination of the correct subspace takes time. However, the binary character of the converted state vector simplifies the calculations in the remainder of the algorithm. The algorithm learns an action u for each subspace. The division into subspaces can be made arbitrary fine, approximating the continuous case (without decoder) to an arbitrary degree of accuracy. However, due to computational limits, the size of the decoder and therefore its accuracy is limited in practice. The action network (ASE) in the AHC algorithm calculates the control action u. Its inputs are the state vector a; and a scalar reinforcement signal r. The reinforcement is an externally provided signal which criticizes the systems performance in order to optimize the ASE's actions. A detailed analysis of its working will be given later. Because the aim is to build a learning system using a minimum amount of a priori knowledge, a simple and intuitive reinforcement signal is used. The reinforcement is generated by means of a range detector. The process should remain in a certain part of the state space, say A, and as long as the process remains within A, r is zero. Upon failure, caused by the process leaving A, a reinforcement is given and r becomes unequal to zero as in relation (1). The value —1 is due to Barto et al. (1983), its choice is arbitary. Reinforcement r : < ~ _ (1) Failure happens after a number of actions generated by the action network. The goal of the action network is to generate output signals u in such a way that a reinforcement (failure) is reprieved more and more. In the end this should result in an optimally controlled behaviour, optimal according to the reinforcement signal. The algorithm can thus be regarded as a heuristic optimization algorithm. Barto et al. (1983) first tried to reach this goal with a single action network and the external reinforcement of relation (1). However, this did not result in good learning behaviour. This can be explained with reference to Section 2. This evaluation is only meaningful in the static sense, while the system which we like to control is a dynamic system. For this reason we need to maximize the cumulative reinforcement over time. The evaluation network (ACE) has been introduced in the algorithm to improve the learning behaviour. It acts as a predictor for the cumulative reinforcement. The idea is to let the ACE use the external reinforcement r to calculate an improved internal reinforcement r. This signal r is a prediction of cumulative reinforcements r. The ACE uses a Temporal Difference (TD) method (Sutton, 1988) to learn from its own predictions. A detailed analysis will be given later. Briefly, the ACE should learn to predict the value of r that will eventually be received, if the action of that subspace is carried out. The difference between successive predictions is used to calculate f. The ACE in fact learns to generate an improved reinforcement which is no longer incidental (present just in case of a failure) but continuously present. This gives the ASE better information on how to learn. Because of the predictive
217 meaning of f, the AHC algorithm is able to optimize its actions from the start of a trial until failure. Therefore, exchanging the ACE with a reinforcement signal containing more information than (1) (using for instance an optimal control criterion) is not equivalent. Barto et al. (1983) reported that the use of the ACE significantly improved the learning behaviour of the system. In the next sections the details of the AHC algorithm will be described. At first reading, the rather technical Sections 3.2 to 3.4 may be skipped and the reader may continue with Section 4. 3.2 Action Network The structure of the action network (ASE) is given by equations (2) through (5) which will be explained below. The equations have been transformed into block diagrams as well, because this provides a better visual understanding of the algorithm. Action network eligibility: ei(k + 1) = 6ei(k) + (1- 6)u(k)xi(k) (2) Action network weight factors: Wi(k+ 1) = Wi(k) + ar(k)ei(k) (3) Action network output: n u(k) = F{^2Wi(k)xi(k) + noise (Jb)} (4) j=i Threshold function: w( , J +1 if x > 0 , . *W = \-lif*<0 (5) Where: a learning factor 8 determines the decay rate of the eligibility The meaning of the parameters will be explained with reference to the block diagram in Figure 8 which gives the information flow of the action network. Because of the discrete time character of the algorithm, a special notation has been used in which the forward-shift operator is denoted as q. Its inverse is called the backward-shift operator denoted by q'1. This operator has the following properties: «/(*) = /(* + l)^-1/^) = /(* - 1) The operator is related to the complex variable z in the z-transform, well known from digital control theory (e.g. Franklin and Powell, 1980). In the action network the controller output u is calculated by taking the dot product of the input state vector x and the weight factor vector w of the network.
[fHEh q - 8 40) 218 =MxHN3=*~ q - 1 S(0) =^0-^6- Fig. 8. Block diagram of the action network. The result, to which noise is added, is fed through a threshold function F which results in either a positive or a negative output u. The noise term has been added for probing. It helps to explore state space while searching for the correct weights. Notice that the controller lacks a reference signal which could be used as a learning stimulus. At the start of the learning procedure, the initial weights are zero valued and the noise results in random actions. These actions lead to failure, a reinforcement is given and the weights are adjusted. As the weight values grow, the actions related to the weights become more likely. In this way the influence of the probing signal is large at start and becomes less during learning as the weights become larger in magnitude and more deterministic. The learning procedure of the algorithm should lead to an optimal set of weight values. The reinforcement signal r determines when the weights are optimal. The problem of finding the optimal control law can thus be formulated as the problem of finding an optimal set of weights which implements the desired controller, i.e. when f is zero. The quality of the learning procedure, and thus the quality of f, determines the usefulness of the control law after learning. Learning takes place by means of periodic updates. A general learning rule, for the equation of a weight update procedure, is given by relation (6). Vi(k + 1) = Wi(k) + a Aw (6) Here, the learning factor a plays a role in the adjustment speed of the weights and the convergence of adjustment. In supervised learning (e.g. the delta-rule by Widrow and Stearns (1985) or backpropagation) usually a steepest descent approach is used, in which Aw stands for the gradient. In the case of reinforcement learning a gradient can not be calculated. Therefore a correlation technique is used. In the action network Aw is equal to the product of the input vector x, the (controller) output u (this product is filtered) and the reinforcement r (see relation (2) and (3)). If none of these signals is equal to zero, a nonzero correlation is obtained. Because of the threshold F, the output u just influences the sign of Aw. The usual expression for the gradient Aw in the delta rule is
219 Aw — (d — u)x, where (d — u) is the error between the desired and the actual output and x is the input. We see that in the reinforcement algorithm the error is replaced by the reinforcement signal r while the proper sign of Aw is obtained by multiplication with u. The block diagram of Figure 8 and relations (2) and (3) show that the product of a and u is filtered before it is used for the calculation of Aw. This filtering element (l—6)/(q — S) delivers an output e which is called an eligibility by Barto (1983). The reason for the introduction of the eligibility will not be treated here but has been extensively motivated in its originally biological context (Sutton and Barto, 1981; Klopf, 1988). Here we shall concern ourselves with its functionality. This filter can be interpreted as a low pass filter. Its use can be motivated by the use of the non-overlapping subspaces and the reinforcement learning rule which is based on correlations. In the AHC algorithm with decoder, the process state follows a certain path in state space visiting several subspaces after one another. Each subspace corresponds to an input of the action network. A visit to a subspace therefore corresponds to a pulse on the corresponding input of the ASE (xi and xi in Figure 9). However, at the same time all other inputs X{ are zero. Therefore, without the filter, the correlation product Aw — xuf would be zero for all sub- spaces, except the one where the process is in at that time. As a result of this, only the weight belonging to the current subspace would be adjusted. This is precisely what we do not want to happen. Because due to the dynamic character of the process, actions in the past were (each for some part) responsible for the presence of the process in the current subspace. And therefore the current reinforcement f is an evaluation of these actions in the past. By low pass filtering the product of x^ and u, an output e8- results which remains non-zero for a while and non-zero correlations become possible, even when the process is not in the related subspace. This eligibility e8- exponentially builds up when a subspace is entered and decays after the subspace has been left. The building and decay rates depend on the parameter S of the filter. The effect of the filter is shown in Figure 9. Here xi and £2 indicate subsequent visits of two mutually disjoint subspaces, as a function of time. The reinforcement signal r becomes active after £2 has become zero. It should be clear that the correlations between X\ and r and between z2 and r are zero (the control signal u is not shown for simplicity). On the right hand side of Figure 9, the eligibilities e\ and e2 are shown. Note that ei is the output resulting from filtering the prouct of Xi and u. The correlations between t\ and r and between e2 and r are non-zero (indicated by the black area). The area t\ ■ r is smaller compared with the area e2 • r. This indicates a weaker correlation. This is intuitively pleasing because x\ happened longer ago than X2 and therefore x\ is held less responsible for causing the reinforcement signal to occur. One should be aware that the presence of the filters is connected to the use of a discrete state space. If the decoder is not used, and the measured states x_ are directly used as input for the ASE, the filters are not necessary (see Anderson
220 ] ! e\r \e2-r 1 time —► time Fig. 9. Correlation as a result of eligibilities. (1987)). Regarding the eligibility as the result of a low pass filter results in an important conclusion about the parameter 8 which determines its cut-off frequency. The cut-off frequency should not be chosen lower than the bandwidth of the incoming signals because this would result in a loss of information. If the cut-off frequency is chosen too high, the effects of discontinuities due to the division of the state space may become too strong. As can be seen in relation (6), the weight factors are calculated from a periodical update and an old value. At the start of the learning procedure an initial value needs to be chosen. This is a problem on its own. In neural networks usually the weights are initiated at a zero value or a small random value. For the AHC algorithm the zero value is used. In the context of neural control a more appropriate choice would be to use a priori knowledge of the process (if available) to initiate the weights. For some subspaces the sign of the correct action is not hard to determine. This is possible only because the weights of the one layer architecture allow interpretation. The integration of a priori knowledge in neural networks is much more difficult in practical engineering environments, especially when complex multi layer networks are used. For the interpretation of the knowledge in the single layer AHC algorithm, relation (4) should be regarded. Suppose the decoder and the threshold device are not used and the inputs and output are continuous real valued variables (see Anderson, 1987). In that case the input vector x of the algorithm is the real valued vector of the state variables. In the dot product of the state vector and the weight vector, the weight vector can just as well be interpreted as the vector of state feedback controller parameters. This implies that if the state vector
221 contains all state variables and if correct values of the weight factors are found, the classical state feedback control law may be realized with a linear action network. If a decoder is used in the algorithm, a controller output is learned for each of the subspaces. In the case of a differentiable function F, a real valued output is learned for each subspace. Because the AHC algorithm uses a threshold function for F, it just learns to take a decision on steering plus or minus for each subspace. Assuming that the weight factor of a particular subspace has a large enough value (either positive or negative), the noise term in the action network can be neglected. In that case the decision taken in this subspace has become deterministic and only depends on the sign of the weight and not on its absolute value. This has important consequences for the learning behavior. Because the algorithm only uses the sign of the weights, the factor a may be chosen relatively large. This will, after a few updates, result in a large value of the weight and thus a deterministic action u. 3.3 Evaluation Network The structure of the evaluation network (ACE) is given by equations (7) through (10). The block diagram of the evaluation network is shown in Figure 10. It can be seen that both action (Figure 7) and evaluation networks are conceptually equal. Evaluation network eligibility: Xi(k + 1) = Xxi(k) + (1 - \)xi(k) (7) Evaluation network weight factors: vi(k + 1) = »{(*) + (3f(k)xi(k) (8) Evaluation network product: n p{k) =Y,n{k)xi{k) (9) ! = 1 Evaluation network output: r(k + 1) = r(k) + 1P(k) - p(k - 1) (10) Where: 0 learning factor A determines the decay rate of the eligibility 7 determines the prediction horizon The dot product of the state vector x_ and the weight vector v is determined in order to calculate a prediction p of the cumulative external reinforcement r over time. Since the external reinforcement r is always zero except on failure, the extreme values of this prediction are zero and the value of r on failure.
222 The adaptation of v happens in a way similar to the action network. Here the filter has only x as input. Comparing the functionalityof Figure 7 and Figure 10, 6, e, a, and w in Figure 7 correspond to X,x_,0 and v respectively in Figure 10. Because the absolute magnitude of the external reinforcement is bounded, the magnitude of its prediction p should also be bounded. Therefore the learning factor 0 should be chosen small compared to the magnitude of r. As stated before, the evaluation network uses a Temporal Difference (TD) method (Sutton, 1988) to calculate an internal reinforcement f. An extensive explanation of TD methods is beyond the scope of this paper. In short, the TD method uses an infinite horizon prediction in which predictions are calculated of the reinforcement r in the future. With the parameter 7 a form of exponential discounting is realized. The value of this parameter, 0 < 7 < 1, determines the effective length of the prediction horizon that is used. The algorithm calculates the weights in such a way that p(t) approximates r(t+l)+jp(t+l). In this way it produces an early internal indication of the chance that an external reinforcement (equivalent to failure) is to be expected in the future. If it has a positive value the system performs better than it expected and gives a reward. A negative result means the system performs worse than expected. With 7, the prediction method prevents the extinction of the internal reinforcement in case of prolonged correct behaviour of the system. q-X m Mxj=Ni]= q-1 T m &^ P(0) w-1 -6 Fig. 10. Block diagram of the evaluation network. 3.4 Implementation Dependent Timing The weight factors in the algorithm are obtained by correlating various signals. In order to obtain correct weights, the timing of the signals is important as will be explained. Suppose the AHC algorithm is implemented in a computer controlled system. On a certain moment in time the process is sampled and the state vector #(fc) is obtained. The presence of a decoder does not matter here. In the ideal case the computer outputs the calculated action u(k) at the same instant while the evaluation r(k + 1) of this action is calculated one time step later. As a result of this, the multiplication of x_(k), u(k) and r(k) in Figure 7
223 and Figure 8 is not correct. The correct implementation is to multiply x(k) and u(jb) with r(k + 1) and therefore x and u(k) should be delayed by one time step. In practical implementations the computer takes almost the entire sample time to calculate its response. In that case a sample x(k) results in an action u(k+ 1) one time step later and its evaluation r(k + 2) two time steps later. Hence, it is important to account for the corresponding time delays. 3.5 Cooperation in Learning of Action and Evaluation Network Within the AHC algorithm, action and evaluation network cooperate in learning. In fact these two networks each have a separate learning mechanism, but there is mutual interference. Starting the learning (with no a priori knowledge), the action network will try to control the process. At the start, bad actions will cause repeated external reinforcements. These will not result in major changes in the action network. However, the evaluation network will learn to predict the external reinforcements. Only as a result of the improvement of these predictions, the action network is able to improve its actions. The action network will be able to keep the process within the allowed range A longer and longer. Suppose the process has remained within the allowable range A of state space and there has not been an external reinforcement for a while. For that part of the state space the weights of the evaluation network will exponentially decay and become zero. This only happens if the actions performed by the action network result in optimal behavior according to the external reinforcement r. For if they are optimal, r will remain zero and consequently the prediction of the ACE should also be zero. As a result of this, f will become zero and the action network will stop learning. This ends the learning procedure, until a disturbance or change in the process makes additional learning necessary. In that case the algorithm continues the described procedure. 4 Application: the Inverted Pendulum Various papers have been published in which the AHC algorithm has been demonstrated in a simulation applied to a cart pole system. The beauty of this type of process is its instability on which a learning control algorithm can be demonstrated nicely. Therefore, in the research presented here, a similar process has been used. Because of practical reasons, the actual device has been constructed slightly different. The device used in the experiments is shown in Figure 11. It consists of two connected links of which one is driven on a rotating shaft by a motor and the second one is able to rotate freely. The aim is to drive the first link in such a way as to balance the second link. A special construction is used to pick up the second link if the controller fails to balance the second link. In that way the pendulum can be used in learning experiments as will be described later.
224 Fig. 11. The inverted pendulum. 4.1 System Architecture A system has been built that enables both simulation and practical experiments. It is shown in Figure 12. A PC~AT is provided with a number of transputers (Bakkers and Van Amerongen, 1990) in order to provide sufficient computational resources. Both the simulation of the pendulum and the calculations for the controller are performed on separate transputers. The PC acts as a user interface and as a timer. The PC, the simulation transputer and the controller transputer communicate by means of a third transputer, the guard, which performs signal scaling, error checking and data routing. In real time operation, the PC communicates between the pendulum and the guard transputer by means of interrupts. Both in simulation and in real time, the pendulum state variables and trial length statistics are visualized on the PC screen. For the controller a special monitor and transputer graphics card have been added to visualize the weight factors and eligibilities. The system is able to perform real time simulations. It can also execute the AHC algorithm for the practical pendulum in real time. 5 Experimental Results A number of experiments have been carried out to investigate the capabilities of the AHC algorithm. In order to test the algorithm in simulations, a model of the pendulum has been made (Oosterveen, 1990). Based on this model, a state feedback controller has been realized to investigate the possibilities for the control of the pendulum. This state feedback controller required the nonlinear model of the pendulum to be linearized.
225 -- graphlcal monitor T4 T1 TO guard (T800) ' T1 controller (T800) 1 T2 simulation (TBOO) T4 graphics (T414) PC/AT T2 Inverted Pendulum Fig. 12. Hardware system architecture. This first step was useful because the complexity of the control problem became clear: the pendulum contains a number of non linearities. The state feedback controller showed that stabilization and control of the pendulum was possible using this model and the pole placement technique (see the chapter by Boekhoudt). When the AHC algorithm is used in combination with a state space decoder, it is always a question whether the accuracy of the state space division is enough to enable stabilization. The reason for the development of a detailed model is that little is known about the capabilities of neural networks and learning algorithms. Since convergence and stability of these algorithms have not been proved, experiments are needed to study learning behaviour and to interpret the control results obtained by learning. This is necessary in order to estimate the complexity of problems which can be solved using neural control techniques.
226 5.1 Experiment Design The basic experiment which has been carried out is directed towards the learning behavior of the system. It is identical to the experiment presented by others for pole balancing. The weights and eligibilities of the controller are initiated at zero values. The inverted pendulum is put in the upright position. The system is started by setting free the second link of the pendulum. The link starts to fall and the controller performs (initially random) actions. At the start this leads to an early failure. The link is considered to be fallen when either the driving link or the second link has moved out of the allowable range A. This is the end of a learning trial and a reinforcement is given upon this failure to enable learning. The fallen link is picked up again and the second trial starts. As the system proceeds and learns the trials should last longer and longer until, eventually, the link is balanced. The experiment described above has been carried out in simulation as well as in practice. During these experiments the influence of several parameters of the AHC algorithm has been investigated. Because the experiments contain a certain randomness (due to noise added to the output of the controller) a single experiment is not representative for the average behavior. Therefore series of experiments had to be carried out in order to obtain reliable data. 5.2 Simulation The first goal of the simulations was to investigate the effect of the parameters in the AHC algorithm on the learning behaviour. The second aim was to obtain a set of parameters which would yield successful learning sequences in simulation, before starting the experiments on the experimental setup. An important aspect of the algorithm is the decoder, especially the way in which the state space is divided into subspaces. In Section 3.1 some (dis-)advantages of the decoder have been enumerated. Simulations show that the learning behaviour of the algorithm is sensitive towards the choice of subspaces. This can be expected because a coarse division of state space was used in order to limit the computational effort required for the algorithm. A division which resulted in appropriate learning is shown in Figure 13. This division was found with the help of the monitor on which the weights of the network were displayed. The cut-off frequencies of the low pass filters needed to be tuned by means of the parameters 6 and A. Experiments showed that the choice of these parameters is important to obtain the required learning behavior. Having obtained a correct value, however, a 10% change did not dramatically effect the results. The parameters were tuned by regarding the values of the eligibilities on the screen during operation. If the pendulum starts in the upright position and falls out of the range A, the eligibilities of the subspaces the pendulum has passed should show an exponential decay from the border of the allowed range A to the upright position point. The second investigation concerned the learning factors a and 0. Here it should be taken into consideration that the maximal weight magnitude is un-
227 1—1 III I '-—h-4- -12 -6 -10 1 6 12 -°° -5 5 t—*+H J-—1-4 -60 -15 -5 5 15 60 ^=^ -8 8 Fig. 13. The decoder boundaries. bounded for the action network and bounded for the evaluation network (see Section 3). An important parameter related to a is the noise variance a. The probing function of the noise at the start should decrease as the weights become more deterministic (larger). Therefore the value of the noise variance should be such that the learned weight value becomes noticeable in the performed action within a few updates of a weight. The relation between the two has not been studied in experiments so far. It appeared that for the action network a could be chosen between 102 and 104 with a value for a equal to 0.01. For the evaluation network the value of 0 could be chosen between 1% and 10% of the reinforcement magnitude. If the learning factors were chosen too small, little or no learning took place. If they were chosen too large, the weights became too large and their values showed oscillating behavior. In order to improve learning, a number of experiments with minor modifications of the algorithm have been carried out. One considered the problem of the extinction of the internal reinforcement in case of prolonged correct behavior followed by a failure (see Section 3). Experiments have shown that this extinction causes problems in the adaptation procedure if the actions are not yet optimal in all subspaces (Potma, 1990). One solution would be to scale down the learning factor (for the weight adaptation of the action network) proportional with time. This was not successful. A successful solution is to make the weight update proportional with the evaluation weight v( by multiplication of this weight with the second part of relation (3). Finally, the correction of the relation in time between the signals (Section 3.4) has been implemented. This did not result in significant changes. Perhaps this is due to the relatively slow speed of the pendulum compared with a relatively high sample rate. Figure 14 shows a typical sequence of trials illustrating the learning behavior of the system. This life time plot on logarithmic scale shows that there is a slow
228 but (on average) steady improvement at the start and at a certain moment a dramatic increase occurs. The simulation has been stopped after several hours of balancing. 60 s 1> -3 360 •fi OJO s 1> 15 £ 36 3 3.6 =ktk w mm m i ^ Hi 4#= .... _| .., 40 80 Trial number Fig. 14. A series of trials and their length in simulation. In Figure 15, a 3D-plot is shown in which the knowledge of the evaluation network has been represented after learning. The plot of the action network is not shown. Despite the successful learning procedure, the contents of that table looks chaotic in 3D. The plot of the evaluation network shows the predicted reinforcement after learning. The plot shows that the predicted reinforcement increases near the border of the allowed range A, for the maximal allowed values of the state variable 0 and ¢. Some deficiencies remain in the plot, indicating that learning has not resulted in a perfect prediction everywhere. The fact that the evaluation network predicts a large reinforcement near the border indicates repeated correlations have been found between the presence of the process in these subpaces along the border and external reinforcements. Detailed examination shows that the predicted reinforcement gradually increases when moving from the central area of state space towards the border. Therefore, the control system will obtain an increasing internal reinforcement when it tends to fall from the upright position towards the border of the allowed state space. This information is used by the action network during learning to adjust its weights and, as a result of the changing weights, its control actions.
229 Fig. 15. The knowledge in the evaluation network. 5.3 Real Time Behaviour The real time behaviour of the controller has been tested in two ways. In the first way, the system is learning in simulation using the detailed model of the pendulum. After learning, the knowledge obtained can be used to control the real system. This method has been shown to work reasonably well. However, the length of a trial in practice is not as long as in simulation (a few minutes against hours). The most illustrative experiment has been to let the controller learn using the real system. This is of course the ultimate goal of the research. Until now the algorithm is not able to balance the link for longer than about 15 seconds (peak value). It also takes more time to learn this trial length on the real system than in simulation. The life time plot for a real time learning experiment is given in Figure 16. It shows an increasing trial length in the beginning, indicating the system has been learning in real time. However, in the experiments the increase in trial length could not be maintained. Performance peaks occur now and then, but the behaviour of the controlled system was not as smooth on the real system as it was in simulation. The real time system was also sensitive to external disturbances on the links of the inverted pendulum. 5.4 Discussion The experiments show that the AHC algorithm can be used as a learning control algorithm, for the inverted pendulum both in simulation and in practice. It is hard to compare the algorithm with other algorithms at this time since the way in which the control problem is formulated here is difficult to compare to classical
230 il M w 1 U J 1 I V"' \ki N' ^iy 1 \m i/1'1 w 100 200 Trial number Fig. 16. A series of trials and their length in real time. control algorithms. A discussion of the performance of the AHC algorithm should at least concern two aspects of the control algorithms: their learning quality and the control performance after learning. It appears that real time learning is more difficult than learning in simulation. The reasons for this are diverse. One reason is that the model used in simulation differs from the real pendulum. This is reasonable since usually a model is a simplification of reality. It seems that there are dynamical aspects which are present in reality which have not been incorporated in the model. The use of a binary controller output (positive or negative) will excite certain resonance frequencies in the mechanism. In addition hysteresis and backlash will have a negatieve influence on the learning due to their hard non-linear nature. These effects result in vibrations which can not be registered by the system due to the coarse division of the measurements by the decoder. Especially the division for the angular velocities may be to blame. Due to the decoder, the information of the algorithm reduces to a velocity signal which is either negative, approximately zero or positive. This also explains the slow learning in practice, because a lot of information from the sensors of the real system is not taken into account. A solution to this may be to implement the AHC algorithm using a real valued controller output for each subspace. The alternative is to eliminate the decoder and use the original state measurements for the ASE. The control performance of the AHC algorithm may be compared with the performance of a state feedback controller. One should be aware that the former has hardly used any a priori knowledge while the latter has been designed with
231 all the knowledge available about the process. The neural controller has shown to be able to stabilize the inverted pendulum both in simulation and in practice. However, in contrast with the state feedback controller, its sensitivity to noise and disturbances is considerable and the variance in the stick position is also larger. Regarding the algorithm this can be explained using the same arguments as used before: the binary controller output in combination with poor velocity feedback. The relatively coarse state space division allows the pendulum to move freely within a subspace without the controller noticing this. Therefore, stabilization of the pendulum requires the system to create a kind of limit cycle in which it will continue to move around. Due to the possibility of free motion within a subspace a certain randomness will remain. In practice disturbances, noise and higher order dynamics will eventually cause the system to fail. 6 Relations to Other Work and Conclusions The results described sofar are influenced by the use of the state by the state space decoder in the controller. The AHC algorithm needs to be evaluated in combination with feed forward networks containing differentiable activation functions and a continuous real valued output. Such results have been published by Van Luenen et al (1993). Evidence is provided in this paper that from a classical control point of view, the adaptive critic algorithm suffers from a number of limitations. The algorithm was not able to balance a second order system, consisting of a single link inverted pendulum, with an erro converging to zero. In addition, the meaning of the AHC algorithm in relation to optimal control and, more specifically, dynamic programming will have to be studied. More details can be found in Van Luenen (1994), where various approaches for using neural networks in control are evaluated. There, the idea is posed that the critic network can in fact be considered as a substitute for a process model. Learning to predict the reinforcement is comparable to learning (or identifying) a process model in real time in the sense that both require trials or experiments on the real setup in order to improve the quality of the controller. However, classical control engineers will prefer the process model as a representation of the knowledge to be learned rather than the criterion predictor. The two approaches mentioned at the beginning of this chapter, reinforcement learning control and indirect learning control using identification both have their limitations, especially when considered from a practical point of view (Van Luenen, 1994). For the latter case, the identification of non-linear processes is a computational problem as well as an algorithmic one. For non-linear optimisations which multi layer neural networks require, it will be very hard to proof convergence. In the case of reinforcement algorithms, most successes have, been achieved on higher level control tasks such as navigation and strategies for peg in hole insertion with inaccurate robots. In classical feedback control, neural networks are to be considered as structures with distributed parameters which can be used to approximate a control function, a process model or a criterion prediction. Other structures, like tables,
232 splines and fuzzy logic may be used for similar purposes. The challenge is to explore this property, for instance by using it for non-linear systems in which the function to be approximated (or learned) is not known a priori. An example is a learning feed forward controller implemented for tracking control of an autonomous vehicle (Van Luenen, 1994). The learning controller uses a neural network beased on spline functions and is capable to correct for parameter errors and errors in the friction model. Another example is the use of neural networks when controller outputs are hard to calculate in real time. An example is the optimal control problem defined for the inverted pendulum by Van Luenen (1994). Here, the controller has to bring the second link of the pendulum to the upright position from any initial position Q. The results show that this is feasible in simulation as well as on the experimental setup. References C.W. Anderson (1987) Strategy learning with multilayer connectionist representations. Proc. 4th Int. Workshop Machine Learning, Univ. California, Irvine, 103-114. A.W.P. Bakkers and J. van Amerongen (1990) Transputer based control of mechatronic systems. Control, Systems and Computer Engineering Group (BSC), University of Twente, Enschede, Netherlands, Proc. of the 11th World Congress of IFAC in Tallinn, USSR. A.G. Barto and P. Anandan (1985) Pattern recognizing stochastic learning automata. IEEE Trans. Syst. Man Cybern. 15 (3), 360-375. A.G. Barto, R.S. Sutton and P.S. Brouwer (1981) Associative search network: a reinforcement associative memory. Biol. Cybern. 40, 201-211. A.G. Barto, R.S. Sutton and C.W. Anderson (1983) Neuronlike adaptive elements that can solve learning control problems. IEEE Trans. Systems, Man Cybern. Vol. SMC-13, No. 5, 834-846. G.F. Franklin and J.D. Powell (1980) Digital control of dynamic systems, Addisson- Wesley Publishing company, Reading, Massachusetts. A.H. Klopf (1988) A neuronal model of classical conditioning. Psychobiology 16 (2), 85-125. A.N. Kolmogorov (1957) On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition [in russian]. Dokl. Akad. Nauk USSR 114, 953-956. W.T.C. van Luenen, P.J. de Jager, J. van Amerongen and H.M. Franken (1993) Limitations of adaptive critic control schemes. Proc. of the Int. Conference on Artificial Neural Networks, Amsterdam, The Netherlands. W.T.C. van Luenen (1994) Neural networks for control, on knowledge representation and learning. PhD thesis, Control Laboraty, Dept. of Electrical Engineering, University of Twente, The Netherlands. J.M. Mendel and R.W. McClaren (1970) Reinforcement learning control and pattern recognition systems. In Adaptive learning and pattern recognition systems: theory and applications, Mendel, J.M. and Fu, K.S. (eds.), 287-318, New York, Academic Press. D. Michie and R.A. Chambers (1968) 'Boxes' as a model of pattern-formation. Towards a theoretical Biology, Vol. 1, Prolegomena, C.H. Waddington, Ed., Edinburgh: Ed- ingburgh Univ. Press, 206-215.
233 K.S. Narendra and M.A.L. Thathachar (1989) Learning Automata, an introduction. Prentice Hall, Englewood CliiFs NJ. K.S. Narendra and K. Parthasarathy (1990) Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Networks 1 (l). H. Oosterveen (1990) Design and implementation of a state feedback and a neural controller for inverted pendulum. Master thesis, Control, Systems and Computer Engineering Group (BSC), Reporter. 89R140, University of Twente, Enschede, The Netherlands. H.T.A. Potma (1990) Analysis of the adaptive heuristic critic algorithm applied to pole balancing. Master thesis, Control, Systems and Computer Engineering Group (BSC), Reportnr. 90R068, University of Twente, Enschede, The Netherlands. R.S. Sutton (1988) Learning to predict by the methods of temporal diiFerences. Machine Learning 3, 9-44. R.S. Sutton and A.G. Barto (1981) Towards a modern theory of adaptive networks: Expectation and prediction. Psychol. Rev. 88, 135-171. B. Widrow and S.D. Stearns (1985) Adaptive signal processing. Prentice Hall, Englewood CliiFs NJ.
Key Issues for Successful Industrial Neural-Network Applications: an Application in Geology H.R.A. Cardon and R. van Hoogstraten Shell Internationale Petroleum Mij. B.V., The Hague 1 Introduction Neural networks are starting to find their way towards practical applications. Although the number of actual, fully operative neural networks is still relatively small, more and more successful applications have been reported over the past few years. In this article we will discuss the issues that are critical for developing successful, applicable neural networks. These issues are based on experience with practical applications developed at the Shell Research laboratorium in Rijswijk, The Netherlands. We will discuss these key issues in a chronological sequence: When do you consider using a neural network? What are the critical issues when introducing a neural network in an operational environment? What are the most important stages in the development cycle? Subsequently, these issues will be illustrated with a practical application. This example concerns a neural network developed to perform a pattern-recognition task in geology. 2 Criteria in Choosing for a Neural-Network Solution When one is confronted with a problem, there are usually a couple of solution possibilities to choose from. If a neural network is considered, often techniques such as Expert Systems, Standard Statistics, Genetic Algorithms on Rule Induction are also considered. Each technique has its advantages and disadvantages. This also applies to neural networks. Some specific advantages of neural networks are: - automated learning from examples; - no need to make assumptions about the form of the relationship between input and output; - fast learning (if networks have less than 50 neurons). Some specific disadvantages are: - a neural network behaves as a "black box", i.e. it is hard to interpret the neural network solution; - it is difficult to incorporate knowledge of a given problem.
236 This combination of advantages and disadvantages makes neural networks particularly useful in situations with (ample) training examples but no clear relationship between input and output. Expert systems are likely to be more successful in situations where an expert is able to give a set of clear rules and criteria for solving a problem. Standard Statistics are more applicable when a clear model for the problem exists. Then, only a best fit to this model has to be made. Genetic Algorithms seem to have their strongest capability in solving optimisation problems with non-linear constraints. Of course, it should be mentioned that for many problems a hybrid approach is optimal. Because of complexity most problems need to be split up into modular parts, and each module will need its own solution method. 3 Cooperation of Problem-Area and Neural-Network Experts Industrial neural-network applications are aimed at solving or automating a problem occurring in a specific area. Often experts from this specific area — in our example geology — ask the help of neural-network experts because they have heard or read about the capabilities neural networks can offer. After the problem-area expert has explained his problem to the neural-network expert, and they have agreed that it is interesting to try to solve the problem with neural networks, it is highly essential that problem-area and neural-network experts cooperate closely together. Many neural network applications have failed in the past because a problem was explained and some data were solely given to the neural-network expert who then worked unattended for the rest of the project. This easily can lead to neural networks that minimise all kinds of criteria, have good scores on (artificial) test sets but are simply not fit for their purpose. Therefore, it is absolutely crucial that there is frequent communication between the neural-network and the problem- area expert. This cooperation will encourage the generation of more data , will lead to better ideas about what criteria the neural network should optimise and will make it much easier to incorporate knowledge about the problem. This incorporation of knowledge is very important: often knowledge built up by experts is simply discarded when constructing a neural network. This makes the task of a neural network even harder. The knowledge can be incorporated by e.g. constructing a good training and testing set, selection and inclusion of relevant input parameters, and appropriate pre-processing. The selection of input parameters can well be an iterative process: input parameters are proposed, validated and then discarded or accepted, and sometimes, at a later stage, replaced by new parameters. 4 Development Stages The first step in the- development of a neural network is the gathering and selection of proper training and testing data. The data should be representative
237 of the problem. The data also have to cover the application space in order to avoid a high degree of extrapolation. It is difficult to verify these conditions but, again, good frequent contact with the problem-area expert can be very helpful. In cases where the neural network produces mistakes, it can be useful to try to find out why this happens. For example, by looking at examples in the training set most similar (e.g. according to Euclidean distance) to the pattern for which the networks fails. Once a good data set has been constructed, the prototyping phase can start. In this prototyping phase the feasibility of a neural-network solution to the problem is investigated. The most adequate pre-processing of the data and the most relevant input parameters and good optimisation criteria (usually the number of correct classifications) are determined. It is very important to spend a significant effort on a good visualisation of the results. Figures indicating, for example, a low root-mean-squared error of 0.02 are often.less meaningful than a graphical overview of the neural net's response to the various patterns. In our example at the end of this article, geologists were well-pleased to see the neural-net identification of the geological layers (visualised by the use of different colours). This enabled them to interpret the results in a way they were used to, and they were also more able to indicate errors which definitely had to be corrected. Sometimes, however, they also adjusted their own responses because of the answers produced by the neural net! After this extensive investigation it needs to be decided whether the neural network will be made operational. It should be realised that usually a neural network is not used as an autonomously operating part, but that in most cases it is used as an assistant. This assistant can verify the expert's answers, can indicate difficult patterns, or can do a lot of "boring" work, leaving the hard problems to the expert. Only after extensive testing in practice, the network could eventually take over and operate autonomously. When a network has to be installed in an operational environment, one has to consider whether it is going to be installed as a fixed network, a network that allows or needs finetuning, or as a network with on-site learning capabilities. A fixed network, i.e. a network with a fixed architecture and frozen weights, has the advantage that the neural- network expert can monitor all the training in the laboratory and that the user can use the network as a black box. The user does not need to know anything about the operation of the neural net. A disadvantage, however, is that in the operational environment conditions often appear (slightly) different from the operational operations as assumed during the development of the neural net. If the differences are small enough (for instance, in the case of the same problem, but in different countries), some finetuning will solve the problem. This, however, puts some extra demands on the flexibility of the exported network. When the differences between the conditions assumed in the laboratory and the operational ones are expected to be large, a very flexible network with learning capabilities is necessary. This will allow, for instance, the inclusion of new input parameters. In this case, far more complex neural network software is needed as well as some neural-network knowledge at the location where the neural network is implemented.
238 As an example in which all of the above-mentioned considerations played a role we give a more technical description of a project in which neural networks have been used for performing a task in the area of geology. The description of this project was given in a slightly modified version in Cardon et al. (1991). 5 Problem Description The properties that determine the flow of oil and gas in subsurface reservoirs can vary over short distances, and yet they can usually be measured on actual rock samples only from a limited number of locations (wells). Geologists therefore model the distribution of these reservoir properties in the region between the wells by relating the properties to genetic characteristics of rocks. For this purpose, five rock classes (facies) were distinguished in a group of North Sea reservoirs that originated in a coastal plane environment during Jurassic times: (1) channel-fill, (2) sheet-sand, (3) coarsening-upwards sand (mouthbar), (4) coal and (5) shale (see Figure 1). This classification was based on the characteristics observed on continuous rock samples from the reservoir interval (cores). For economic reasons, however, cores are only available for a limited number of wells. Information from other wells must therefore be restricted to wireline logs. Wireline logs are produced by tools which are lowered in an exploration well to obtain information about the formation. The identification process of genetic facies relies on the recognition of characteristic log signatures, which are typified by shape, vertical trend and value. A neural-network approach seems well suited to this intuitive type of pattern recognition. In this study, a neural network has been trained to recognise the five genetic facies types mentioned above. Implementation of this and other similar networks in an existing knowledge-based log subdivision and correlation system would significantly add to the system's capabilities, since the interpretation of genetic facies types is presently carried out by the system user. Consistent and correct identification of genetic facies types is crucial for a 3-D reservoir geological modelling system (Davies, 1990) in which genetic units must be properly identified in order to be modelled correctly. 6 Extraction of Features and Training of the Neural Network A schematic representation of our back-propagation network is given in Figure 2, and further details on the subject are found in (Lippman, 1987; Rumeihart, 1987; Stinchcombe, 1987). It was decided not to let the neural network operate on the raw data, but to apply some pre-processing first. For each segment, the following 13 features, which are considered relevant by geologists were extracted : thickness, average values and trends of the gamma ray log (GR), formation density log (FDC), compensated
239 Fig. 1. An example of wireline log responses in an exploration well. neutron log (CNL) and borehole compensated sonic log (BCSL), plus the positive and negative separations between the FDC and CNL and between the GR, and BCSL. Standardised values of these parameters were fed into a neural network with 13 input units. Since we fed the aforementioned parameters instead of the raw log values into the neural network, we were thus able to exploit a- priori knowledge and the experience of geologists and obtained a significant data reduction. Several network configurations were tried to find the topology for optimal performance. A back-propagation network with 13 input neurons, 5 hidden neurons and 5 output neurons appeared optimal. The network was trained with 334 examples. It reached a root-mean-squarred error (rms) of 0.16 and a correctness ratio of about 96% on the training set within 1000 cycles. The generalisation capability of the network was tested on 137 other examples. On this test set the network performed very well: a rms of 0.2 and a correctness ratio of about 92% . (See the table below in Figure 3.) It should also be kept in mind that in many cases where the neural-net answer differs from that provided by its trainer, the network is not necessarily wrong! 7 Operation of the Neural Network A standard back-propagation network with three layers is used. The layers are connected by feedforward, weighted connections; no feedback or communication
240 av. CNL Input layer Hidden layer Output layer FDC CNL Preprocessing \ av. FDC trend FDC J(FDC-CNL) Thickness A^ _L _L J Channel Sheet-sand Mouthbar Shale Fig. 2. Schematic representation of a back-propagation network. For each subdivision a number of features are extracted from the wireline logs. Only five such possible features are drawn. These features are fed into the input layer. The output layer indicates the calculated genetic facies type. within a layer takes place. The function of the input layer is merely to distribute the values of the input parameters to the next layer, which consists of the hidden neurons. The input layer of the network does not perform any transformation on the inputs; i.e. output o is equal to input x for each neuron i in this layer: o» £% (i) In contrast, the hidden and output neurons compute a weighted sum of their inputs, biased by a threshold: Ui = ]Pu;,-j0j-0,-, (2) where wy denotes the connection weight from neuron j to neuron i and 0,- is the threshold of neuron i. The output of hidden and output neurons is a non-linear
241 function of the summed input. In our models we used /(«) = [! + ^]-1 (3) During the learning process, the network tries to minimise an error function E. Errors are defined per training pattern p: Ep = h(tp,op) (4) with t the "target" vector (the vector of desired output values given by the supervisor). For most purposes, the sum squared error Ep = 0.5]T>pj -optf (5) 3 is used. Here opj is the value of the jth neuron of the output layer when pattern p is presented to the network as input and tpj denotes the jth component of the ideal target output of example p. The overall measure of error for the set of training patterns is: Etotal = 2_^ -¾ (6) P The objective of the training phase is to find an optimal set of weights and thresholds such that Etotai 1S minimised. In the following, we will treat the threshold of a neuron as the weighted connection from a neuron j with (constant) output oj = 1. After presentation of a pattern p, the weights are updated according to u^ = wf + Apw?;w (7) and \™?fw = -»|§: + «*»tf, (8) where i) > 0, the learning rate, gives the rate of weight change and the term with a > 0 can be added to introduce momentum. The learning and momentum parameters i) and a had to be chosen. Often values for i) of about 0.1 and for a close to 0.6 were used. Also an initial choice for the weights had to be made. We initialised the weights by choosing them randomly from a uniform distribution between -0.5 and 0.5. It was found, however, that in most cases the choice of parameters did not have much influence on the network's performance. To be precise, this was definitely not always so. To test the training accuracy of a network, the root-mean-squared error / ^total /n\ rmS=ijp~dj ^ is commonly used, with d the dimension of the output (i.e. the number of neurons in the output layer) and p the number of examples in the training set.
242 8 Comparison with Linear Discriminant Analysis A standard statistical technique for classification problems is discriminant analysis (Duda and Hart, 1973; Norusis, 1985). In this analysis, linear combinations D of the input variables X are formed: Di = Boi + BliX1 + .... + BniXn (10) where the B's are coefficients estimated from data. The B's are chosen so that the classes (of examples) are separated as well as possible, while at the same time the volume of the clusters (or classes) is kept to a minimum. The discriminant functions describe the hyperplanes that separate the groups. If the distribution of features is normal and the covariance matrices per group are identical, it can be proven that the discriminant analysis technique is Bayes optimal. Although the conditions required to guarantee Bayes optimality do not hold in general, one can nonetheless perform a discriminant analysis. For our problem, the discriminant analysis gave a performance of 82% for the correctness ratio. This is clearly worse than the score of the neural network (cf. Figure 3). The reason seems to be the capability of a neural net to model non-linear relationships in the data. 9 Improvements on the Back-Propagation Algorithm A number of ideas to improve both the speed of training and the generalisation capability of the neural net were also investigated. The training procedure aims at obtaining an optimal set of weights wij such that an error criterion E, representing a distance measure between answers calculated by the neural net and target answers, is minimised. In standard back-propagation, (5) and (6) are used. Two ideas for improving on standard back-propagation training were tried. First, a comparison was made between two strategies for changing weights in the learning procedure. One method is to calculate the weight change ApWij for every pattern and then to change the weights so that Awij = ]T\ Apwij (batch learning). The other method is to immediately change a weight by ApWij after each presentation of a pattern (on-line learning). Simulations consistently indicated a clearly faster convergence for on-line learning. The performance on the test data was comparable (see Figure 4). Secondly, closer attention was paid to the training set. Traditionally, training patterns are selected randomly from the set, so that each pattern has a 1/p chance of being chosen. In error-dependent training, the patterns (examples) with a larger error are given a greater chance of being selected as training patterns. They are chosen with a chance proportional to Ep. This leads to a training procedure in which more attention is paid to patterns which have not been learned very well, rather than fine-tuning the network with patterns that are already well classified. Applied to this problem, simulations of error-dependent training indicated a slightly faster convergence and a slightly improved generalisation.
243 Results on test set with discriminant analysis actual channel sheet-sand mouthbar coal shale total 10 34 7 28 58 prediction ch. 7 70% 1 1 0 2 shs. 2 29 85% 2 1 2 mb. 1 0 4 57% 0 1 CO. 0 0 0 21 75% 2 sha. 0 4 0 6 51 88% total score 82% Results on test set with neural network actual channel sheet-sand mouthbar coal shale total 10 34 7 28 58 prediction ch. 10 100% 1 1 0 0 shs. 0 30 88% 1 1 1 mb. 0 1 5 71% 0 0 CO. 0 0 0 25 89% 1 sha. 0 2 0 2 56 97% total score 92% Fig. 3. Comparison between linear discriminant analysis and neural networks. 10 Conclusions and Future Research We have developed a neural network that is capable of identifying genetic geological facies types with a very high accuracy (up to 92% on the test set). In fact, where the neural-net answers differ from that of the geological experts the true answer is mostly debatable. The performance of the neural network in this case was clearly better than that of linear discriminant analysis. We also demonstrated the possibility of incorporating a-priori knowledge of a problem into the application of a neural network. Good results were obtained in a study aimed at developing a neural network for the automatic partitioning of the logs into segments. A sliding window approach appeared to be suitable for this. This neural network approach has now also been applied for similar identification problems in fields in South-East Asia. The network has been incorporated in a large 3D geological analysis computer program.
244 Training 0.40 T—|—i—i—r 200.0 400.0 600.0 Number of cycles ' ' l ' ' ' ' l 800.0 1000.0 Generalisation 100.0 80.0 $60.0- §40.0- £ o O 20.0- 0.0 on-line i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—| 0.0 200.0 400.0 600.0 800.0 1000.0 Number of cycles Fig. 4. Comparison of on-line versus batch learning. In the upper plot the convergence during training is compared; in the lower plot the classification performance on the test set is monitored during training.
245 Acknowledgements The authors wish to thank Mark Hooijkaas (University of Eindhoven) and Sandra Oudshoff (University of Utrecht) for their large contribution to this project and Willem Epping, Harry Joosten and Hans Rieuwerts of KSEPL for their valuable discussions. Furthermore, the authors wish to thank Paul Davies, Frances Abbots, Mark Budding, and Harry Soek of KSEPL for their cooperation on the geological aspects of this project. References P. Davies (1990) Integrated reservoir characterisation of Cycle III, Brent Group, Brent Field, U.K. North Sea. In Proceedings of Archie Conference. R.O. Duda and P.E. Hart (1973) Pattern classification and scene analysis. Wiley, New York. M. Stinchcombe, K. Hornik and H. White (1989) Multilayer feedforward networks are universal approximators. Neural Networks, 2. R.P. Lippman (1987) An introduction to computing with neural nets. IEEE ASSP magazine, April, 4-22. M.J. Norusis (1985) SPSS-X Advanced Statistics Guide. SPSS-X Inc, Chicago. D.E. Rumelhart and J.L. McClelland (1987) Parallel Distributed Processing: Explorations in the micro-structure of cognition. Volume 1,2, MIT Press, Cambridge MA. H.R.A. Cardon, R. van Hoogstraten and P. Davies (1991) A neural network application in geology: identification of genetic fades. Artificial neural networks: Proceedings of the ICANN 1991, Espoo, Finland. Volume 1, North Holland, 809-814.
Neural Cognodynamics P.J. Braspenning Department of Computer Science, University of Limburg, Maastricht 1 Introduction The main title of this paper is a paraphrase on the term 'quantum cognodynam- ics', which was humorously used by Feigenbaum to denote the field of classical Artificial Intelligence (AI). Although Feigenbaum's nickname should not be taken very seriously, the title of this paper is apt to denote a frontier of exploration, which is regarded by many as essentially the final one: the brain. The idea of emulating the brain by means of models of its neural network(s) and corresponding implementations (in software or in hardware) revives in fact age old dreams about building artificial brains. In this sense the field of Artificial Neural Networks (ANNs) shares a large part of its impetus with the field of classical Artificial Intelligence. There are, however, also significant differences in emphasis. Many of these differences remind us of a similar clash of opinions about the way human cognition should be studied by the discipline of Psychology. For example, there are (and have been) many behaviourial analysts, which try to explore cognition by studying human behaviour in a variety of problem-solving tasks. Methodologically, they presume that in the scientific enterprise only intersubjective observables (i.e., behaviourial data) are allowed, and usually they strongly oppose psychologists of a more speculative bend. At the other side of the road, so to speak, are psychologists proclaiming that introspection is not the same as just speculating. They view introspection as a quite peculiar source of knowledge also to be tapped when it comes to understanding the phenomena of cognition. In their opinion cognition can not be understood by studying solely the (outside) behaviour of human beings. Classical AI versus Artificial Neural Networks The introductory remarks help in characterizing the difference in emphasis between classical AI and the field of ANNs. The power of the breed of systems build within AI is viewed as residing in the (amount of) explicit knowledge contained by the system. Furthermore, the unit of knowledge is commonly considered to be the rule, i.e. a conceptual unit expressing a kind of regularity or law-like connection between states (or events, processes etc.) of the world. In contrast, in the field of ANNs the power of neural networks is attributed to their complex, yet highly flexible behaviour. Moreover, there is no real unit of knowledge, because the behaviour is generated by the collective interaction of a huge amount of neurons. In fact, in so far as any generative mechanism (including the behaviour generated by a ANN) is brought about by knowledge one is forced to say that
248 the 'knowledge' of the network is distributed. Be that as it may, if one wants to speak about knowledge represented by the network it should still be quite clear that ANNs contain no rule-like knowledge. It makes much more sense to view them as highly complex, associative devices. Such an associative device uses a web of associations between data of the outside world; a web that is dynamically built in interaction with the outside world. Therefore, it comes as no surprise that one of the most well-known applications of ANNs is as associative memory of world data. Metaphorically, picking up one part of such a web triggers all other content-related data to appear too. A first way of thinking about such devices is as systems moving along an (abstract) trajectory in the space defined by the representation chosen for the underlying dynamical system and in a direction given by some optimality criterion for associative linkage. Renewed Interest in ANNs The main purpose of our introduction is to state clearly and succinctly that the many over-enthusiastic claims in regard of the field of ANNs together with a corresponding difficulty in distinguishing between genuine results and wishful thinking should be seen as indicating a shift in scientific attention. For example, in many cases the knowledge that should be captured for building a classical AI system appears as yet too difficult to articulate. Associative devices such as ANNs seem to promise easier ways for capturing the necessary knowledge. However, one often forgets that the knowledge to be articulated in order to build a classical knowledge-based system may be (and indeed is often) quite something else than the web of associations generated by an ANN. The shift in scientific attention from knowledge-based towards behaviour-based systems reflects mainly increased technological possibilities for computing with ANNs. However, the quality of the (partial) solutions towards real artificial- intelligent systems (based on these different kinds of systems) is presently not very different, and is anyway still a subject of many discussions. The resurgence of interest in neural networks has been fuelled by several factors. We mention here only a few: a) New search techniques such as simulated annealing and its deterministic approximation can be embodied very naturally by these networks. Thus parallel hardware implementations promise to be extremely fast at performing the best-fit searches required for associative (content-addressable) memory and real-world perception. b) New learning procedures have been developed which allow networks to learn from examples. The learning procedures automatically construct the internal representations that the networks require to be effective in particular domains. Hence, they m&y remove the need for explicit programming in ill-structured tasks which contain a mixture of regular structure, partial regularities, and exceptions. c) There has also been considerable progress in developing ways of representing complex, articulated structures in neural networks. The style of representa-
249 tion is tailored to the computational abilities of the networks and differs in important ways from the style of representation that is natural in serial von Neumann machines. It allows networks to be damage resistant which makes it much easier to build massively parallel networks. Still, with respect to each of these factors some critical remarks are in order. First, the possibilities for parallelism are the subject of many grandiose claims. Yet, generally due to the huge amount of neurons needed only hardware implementations might fulfil the many claims. Although substantial progress in hardware implementations are indeed reported, it is only fair to say that these constitute only the most primitive (i.e., unstructured) networks. Secondly, the internal representations constructed automatically by new learning procedures need not at all represent humanly accessible knowledge. Many useful associations may be generated by the network for ill-structured tasks, but it is fair again to say that ill-structured essentially means 'lacking sufficient knowledge'. Therefore, although any useful web of associations is better than nothing at all, one should keep in mind that regularities as discovered by human beings are outside the scope of ANNs (except for some accidental commonalities). Besides, the (meaning of the) term 'partial regularities' is exemplary for the many misnomers appearing in ANN-terminology. In fact, good old 'correlations' are meant. None the less, for a domain containing mainly such 'partial regularities' it is rather obvious that ANNs could indeed be helpful in building internal representations of its correlational structure. Thirdly, there is a difference between using neural networks to implement complex, articulated structures and claiming that the network may represent such structures. The latter requires that the network has identifiable means to compose these structures from their constitutive relations. The first, however, only requires a mapping of the relational structure on the units and links of the network. This is quite similar to the usual distinction between a physical database (being an implementation of a database) and a logical database (representing the information contained in the database). As a matter of fact, a new implementation style is often confused with new ways of representing knowledge. Clearly, a damage resistance of the network can in this context be an asset, which may profitably be used in implementing such high-level structures. The Purposes of this Contribution The previous remarks provide the context for what this paper is aiming at. First of all, we shall not deal any further with questions regarding the knowledge contained by the network. We hope that our cautionary remarks will prevent any superficial and unwarranted comparison between knowledge-based and behaviour-based systems. In any case, one should first try to understand their behaviour. After that, a systematic comparison between knowledge-based and behaviour-based systems might show much more complementary features than are usually advertised. Secondly, due to the customary emphasis on the behaviour (thus dynamics) of these systems our main topic will be a high-level map of the many types of neural
250 networks and their dynamics. Thirdly, by providing a framework to locate particular types of neural dynamical systems such a map could support the reader in assimilating other papers in this book. What's more, it is a cognitive map outlining also types of systems, which are not treated at all within the present book. Areas on the map corresponding to such types still have a function in reminding us of very interesting systems indeed. Notwithstanding the fact that these systems are not yet technologically feasible and belong to relatively unexplored domains of research, they deserve much more attention because of their presumably quite unexpected, even unforeseeable behaviour. Fourthly, overseeing our high-level map of 'neural cognodynamics' some types of systems will be named in particular. These names should be used as 'anchors', i.e., as providing a point of reference, while we treat some general knowledge concerning these types of systems. Finally, we like to suggest that re-reading the paper after understanding some particular types of neural dynamics (as presented in the different contributions of this book) in more detail may prove to be fruitful indeed. 2 Artificial Neural Networks Revisited As this paper is placed to the end of the book and since so many different networks have been described in previous contributions it maybe useful to review the basics of ANNs before we sketch the promised high-level map of such network systems. The generic ANN or connectionist architecture is a network of very large numbers of simple but highly interconnected active nodes. Each node is assumed to receive real-valued activity (either excitory or inhibitory or both) along its input lines. Typically (but not necessarily), the processing of the nodes consists of summing this activity and changing the state of the node as a function of this sum. The connections are modulating the activity that they transmit as a function of an intrinsic (but modifiable) property called its weight. The weighted sum of the activity along the input lines is then fed into a mostly non-linear output function, which usually includes some threshold for the total activation of any node. In general there is a non-linear functional relation between the activity on an input line and the state of activity of its sources. The behaviour of the network as a whole is a function of the initial state of activation of the nodes and of the weights associated with its connections, which constitute the (only form of) memory of the architecture. This generic architecture can be specialized in quite a number of ways, e.g. by 1) introducing stochastic mechanisms, which determine the level of activity or the state of a node,
251 2) connecting nodes to outside environments, in which case they are sometimes assumed to have a certain receptive field in parameter space (a narrow range of combinations of parameters values). 3) encoding environmental properties by the pattern of states of entire populations of nodes, called distributed representation, instead of in terms of a single node state (local representation). 4) building networks in terms of modules that are themselves connectionist networks functioning as (swper-)nodes (i.e., so-called cascade systems). Thus an ANN denotes a family of mechanisms which are similar regarding a number of architectural and dynamical commitments. The networks may exhibit interesting collective properties such as pattern recognition, the appearance of rule-like behaviourial regularities, and the realization of many desired multi-parameter, multi-valued (input-output) mappings. Moreover, such networks can be made to learn by modifying the weights on the connections as a function of certain kinds of feedback. This is usually done in a way that reduces the discrepancy between an actual output (in response to some input) and a pre-determined output contained in an independent set of input-output pairs, the so-called learning set. In learning mode the networks can be seen as servo-mechanisms trying to realize optimal memory traces for the collection of input-output pairs to be learned. The feedback mechanism of, e.g., "back propagation" is well-known. Many people are initially surprised how much computing with even a uniform network of simple interconnected nodes can be accomplished in order to realize the aforementioned aggregate, or, as they are mostly called, emergent properties. Yet, physicists, chemists and even population biologists would be able to tell an interesting story about the way natural 'many-body' systems establish ('compute') aggregate properties. Quite another factor is, in fact, contributing to the fascination for the amazing networks, i.e. their superficial analogy to the neural system. Presumed, neural plausibility has been the initial driving force behind claims about ANNs as models of mental processes and connectionist (distributed) representations as mental representations. We will not discuss these issues, but explain why ANNs are very interesting systems, even without any claims in respect of their neural plausibility. Therefore, we discuss now the activation dynamics and possible trajectories of dynamical systems, such as ANNs, to draw the contours of a high-level map of such systems. 3 Activation Dynamics The term activation dynamics refers here to the dynamics of a dynamical system (e.g., an ANN) with fixed weights for which, given initial values of the activations of all the nodes, all future activations of the nodes can be computed. Parameters of the activation dynamics are weights, activation thresholds (or biases), and actually used input vectors. A complementary view is based on weight dynamics referring to adaptive schemes for realizing a particular form of activation dynamics. For example,
252 establishing an activation dynamics for classifying input patterns according to some classification scheme requires finding a weight dynamics of a mapping of input onto output vectors constrained by some partial specification. Such a partial specification is the so-called learning set, ideally consisting of the most prototypical examples of input vector classifications. Finding a trajectory through weight space such that every initial weight distribution converges to an equilibrium is what generally has been called learning in neural network literature. The utmost generality is reached by allowing dynamics in both activation and weight spaces. This would define an dynamical system in the Cartesian product of the weight space and the activation space. However, this generality is mostly too difficult to be treated rigorously. Our consideration of activation dynamics assumes continuous time. The reason for that is two-fold. First, dynamical systems running in continuous time can in general be described by sets of differential equations. Secondly, these equations function as a mnemonic to categorize the different types of activation dynamics and thus aid in maintaining a high-level map of such dynamics. On the other hand, systems running in discrete time are certainly no less useful so that our choice for continuous time should not be seen as a particular bias, but only as a particular heuristic approach. Convergent, Oscillatory, and Chaotic Dynamics An ANN may be identified with a dynamical system described by a set of ordinary differential equations based on a continuously differentiable vector field. The activation dynamics of such a system (assuming weights and external inputs are clamped) may be categorized in three broad categories: a) convergent: every trajectory of the activation vector (vector of activations of all nodes) moves finally (maybe after a very long time!) towards some equilibrium or stationary state. b) oscillatory: every trajectory (of the activation vector) moves asymptotically towards a periodic succession of states (or 'periodic orbit'), which could be stationary. c) chaotic: very many trajectories are not reaching some periodic (though not necessarily stationary) orbit. What is generally called the 'butterfly effect' refers to extreme sensitivity of long-term behaviour of trajectories to starting values. Most theoretically treated or practically realized ANNs have convergent activation dynamics (see Section 4). From the standpoint of biology such behaviour is highly implausible for natural networks, especially if the nodes are identified with nerve cells. Yet, in such networks coherently acting collections of cells have been found for which the convergent behaviour is less unlikely. In that case, such collections are taken as nodes of a network. Types of dynamics: an information-processing view Equally important, convergent activation dynamics is conceptually the easiest
253 way to understand the information-processing capabilities of such systems. Indeed, if activation dynamics means e.g. retrieving some (cluster of) information then the end of the trajectory (the stationary state) can be taken to refer to that particular (cluster of) information. However, it is much more difficult to understand conceptually how by using an oscillatory network the asymptotically reached cycle (non-constant periodic orbit) may stand for some (cluster of) information. Moreover, how should one retrieve that information? Should we think of a global invariant like the cycle's period, or its amplitude or the average of some function defined over the orbit (e.g. the activation vector velocity)? In fact, a stationary state (or equilibrium) is a point in the activation state space. If this state space is finite dimensional then there are as much stationary states belonging to particular activation dynamics as there are points within the state space. Formulated differently, any state may be reached by finding an appropriate activation dynamics. On the other hand, the set of possible cycles is an infinite dimensional space, while also time is needed for determining the eventual information-reference function of the cycle (besides of the infinite set of points making up the cycle). Yet, from the standpoint of information processing oscillatory activation dynamics is of much interest, though not very well explored. It is even more difficult to think about information-processing possibilities of chaotic networks. However, as stated earlier the assumed neural plausibility of ANNs has been a strong driving force for neural net research. It is therefore quite amazing that as a matter of experimental fact brain dynamics seems to be more of an oscillatory and/or chaotic flavour than a convergent one (Freeman and Viana Di Prisco, 1986a). Thus, the question arises how a limit cycle of a chaotic orbit (often some sort of fractal) may be used to represent information These very interesting, but also rather difficult questions concerning chaotic networks will not be treated here. However, the interested reader may expect to hear much more about such chaotic systems in the near future. Our main concern in the rest of this paper will be systems with convergent activation dynamics. 4 The Class of Convergent Dynamics Although in the course of time many ANNs have been implemented, meanwhile not always very well understanding their activation dynamics, the neural network research has now reached a state in which networks can be constructed, that have known theoretical properties. A bunch of mathematical methods, like gradient descent, Liapunov functions, probability theory, linear algebra, group theory, dynamical systems theory, differential equations and combinatorics, have been proven to be useful in analyzing the dynamics. However, this holds mostly for convergent dynamics, whereas the analysis of oscillatory networks is somewhat underdeveloped or too complex to be used. An example is given by the so- called Liapunov function, whose existence is a criterion for every trajectory to
254 converge to a stationary state. A comparable criterion for convergence to a cycle is, however, unknown. Before delving into more details it is useful to treat ANNs within a more formal framework. The formal framework here is inspired by Hirsch (1989), though our treatment is necessarily shorter and not nearly as rich as his paper. However, our goal is somewhat different since we don't strive for completeness, but for some handy theoretical background so as to be able to assess what can be expected from particular types of networks. 4.1 Mathematical Models for ANNs First, we give two basic equations which cover many of the actually used AN- Nmodels. Of course, these equations are rather general and therefore somewhat abstract. However, carefully elaborating them provides precisely the kind of insights that we are aiming at. Think of a network of n nodes, where each node has its activation a8- = a;(i) at time t, output function o;, activation threshold S{ and output signal O; = Oi{o-i + S;). Moreover, the weight on the link from node j to node i is generally a real number Wij. We use the convention that a value of zero means that there is no link between i and j. The incoming signal from node j to node i is S{j — WijOj. Moreover, a vector I can be introduced denoting a vector of any number of external inputs feeding into some or all nodes. As we treat activation dynamics the weights and thresholds are fixed. Now, the future activation states are assumed to be determined by a system of n differential equations (one for each node i) of the form: dai/dt = Gi(ai,Sii,...,Sin,I) i = 1, ...,n and I — (Ij,...,Im) (1) where the independent variable t represents time. As the weights Wij, thresholds «,• and external inputs /¾ are assumed to be known we may write da,i/dt = Fi(a1,...,an) i= 1,..., n (2) T output functions oj are taken to be continuously differentiable and nonde- creasing: o'- > 0. Further, the state transition functions G; are assumed to satisfy 6Gi/8Sij > 0; i.e., an increase in the weighted signal WijOj(aj) from node j to node i tends to increase the activation of node i. In many cases one can assume non-negative outputs: oj > 0. In this case the condition Wij > 0 can be interpreted as "node j excites node i", since an increase in the output Oj will cause the activation a; to rise if other outputs are held constant; similarly, Wij < 0 means "node j inhibits node i" .It is important to distinguish a dynamica 1 system and the way it is represented in particular coordinates. Equation (2) represents the network in so- called network coordinates, which are convenient because a; is the activation of node i of the network. However, if we choose the outputs o,- as coordinates then the underlying dynamical system may be represented by other differential equations. In practice, only mathematical convenience determines the
255 choice of coordinates, whilst any invertible function of the activation states may be chosen. The usual dynamical features of solutions to (2) like convergence of the dynamics, attractors (ends of trajectories), periodic orbits, limit cycles, etc., are invariant under coordinate transformations since they are properties of the underlying dynamical system. Not all systems that are models for ANNs are included in (2) (see further on), but since the equations are rather general one can use them to illustrate the mathematics involved. Of special significance is the fact, that inputs are often held constant ('clamped') during a particular run of the activation dynamics. That means that the inputs are parameters determining the activation dynamics. As such they should be specified before one can legitimately speak of stationary states, attractors, and so on. In vector notation we write (2) as da/dt = F(a), where F is the vector field on Euclidian space Rn whose f'-th component is Fi. F is always held continuously differentiable. The underlying dynamical system can be characterized as the collection of mappings {Ct}teR defined as follows. For each b £ Rn there is a unique solution a to (2) with a(0) = b; we set (t(b) = a(t). A special case of (2) and a class of network dynamics which is much studied, are the additive networks: da,i/dt = — Cjdj + "^WijO^aj +Sj) + It i = 1,..., n (3) J with constant decay rates C{ > 0 and external inputs J; (Amari, 1972, 1982; Aplevich, 1968; Cowan, 1967; Grossberg, 1969; Hopfield, 1984; Malsburg, 1973; Sejnowski, 1977). A closely related type of network (which is, however, not covered in general by (3)) is composed of nodes, which are differentiable analogs of linear threshold elements with a dynamics given by: dbi/dt = -c,-6,- + 6i ]T Wijbj +si\ i= 1,..., n (4) where each #,• is a sigmoid function. Just in case all c; are equal one may substitute a; = J2j Wijbj by which a system of type (3) with o; = #,• is obtained. Sometimes the weight matrix is invertible in which case the inverse transformation is also possible. Note, however, that if one of these equations represents a physical network, then the other one is indeed only conceptual showing that the underlying dynamics is not dependent on the representation as such. Moreover, it suggests that studying network-type equations in different representations may provide additional insights. Equations (2) (special case equation 3) and (4) cover many of the actual ANN-models introduced in this book.
256 4.2 Input-Output Behaviour for ANN-Models Now what is the role of network input and output in establishing a particular activation dynamics? If we take a network described by (1) it is clear that one must specify the initial activation vector a(0) and the external input vector I. However, although both provide ways of feeding data into the network, their dynamical role is different. In fact, the external input vector I determines a particular dynamical system, whereas, a different a(0), assuming I fixed, determines another trajectory of the same dynamical system. This is precisely the reason that one can view a particular dynamical system determined by an external input vector I as reflecting the outside world described by I. Within a particular network architecture (in which no feedback lines appear and with a process model using discrete time, i.e., so-called feed-forward ANNs) by definition only initial values of input nodes are required since future activation values of any other nodes are determined solely by functions of the input values. In contrast, a process model using continuous time and based on differential equations, even with a feed-forward architecture, requires all activations to have an initial value. Of course, this results from the fact that differential equations are not determined with some initial values missing. Even so, the initial values of non-input nodes may be reset to zero or some other conventional value each time the network is run. However, resetting is not very plausible from a biological viewpoint, though mathematical analysis becomes much easier with such a conventional procedure. To imagine the dynamics suppose that (almost) every initial value lies in the basin of some point attractor. Suppose further that one wants to establish a mapping from inputs to attractors. Moreover, suppose that resetting of initial values is not done. Then the following scenario may happen: - After the first input vector is fed in, the activation is in some initial state. If this state is in the basin of some stationary state p = (pi,..., pn) the initial dynamics (based on the input vector) leads to a trajectory (starting from the initial state) that will approach p. - Feeding a second input vector (unequal to the first one) disturbs the activation dynamics such that p is generally no longer a stationary state of the new dynamics. Assume that q is a (new) attractor corresponding to the new dynamics, while p is still close enough to lie in the basin of q. Then the activation vector moves along a trajectory leading towards q. - Suppose that a third input vector is injected which is equal to the very first one. In that case the system jumps back to the initial activation dynamics. However, the system still tends to move along a trajectory based on q, rather than a trajectory based on the initial activation state. - Now while there is in general no reason to suppose that the initial activation state and q are lying within the same basin of the attractor p for the initial dynamics, the activation state will evolve to some new attractor r ^ p. It is clear that in this way no required mapping will be established. Obviously, such a network cannot be used as a classifier for the input vectors /;, or as an as-
257 sociative content-addressable memory from which stored items can be retrieved by injecting a (maybe partially given) input vector used previously. Instead it behaves like a drunkard attracted by any newly appearing stationary state. In passing we mention only a less widespread used alternative for clamped external inputs. The alternative is to give I(t) a single pulse character, i.e., specified during a particular time interval and clamped afterwards to some conventional value. In that case the dynamics during the time interval is different from the dynamics afterwards. Although we don't treat such networks in any detail, it may be of interest to know that they may be used to mould a particular dynamics by shooting the system during a particular time interval towards a particular region of activation space (say a basin of an attractor) using suitable input vectors. When the input pulse is shut off, the location of the activation vector a(t) in activation space guarantees a 'free-wheeling' movement towards the nearest attractor. Such networks do not depend on initial activation vectors provided the 'guns' of external input vectors do their work properly. That is, resetting activation values is no issue here. In the next section some convergence issues will be dealt with and the importance of so-called Liapunov functions will be explained. In connection with these issues the main types of ANN-models will be characterized. 5 Convergence and Liapunov Functions Nearly all networks used as models for ANNs are convergent (or, if not really known, assumed to be so). Especially, the frequently used feed-forward networks are convergent. As said earlier, this holds in particular for networks running in discrete time. However, networks running in continuous time corresponding to equation (1) require the condition: SGi/Sai = 0, i.e., G; is purely longitudinal along surfaces in activation space. The class of additive nets (equation (3)) are known to be convergent in certain cases, namely: 1) if the weight matrix W = [Wij] is symmetric, 2) if the state transition functions G; are of a special algebraic form, 3) if the derivatives o'j and the weights fulfil certain inequalities, and 4) finally, compositions (or networks) of convergent networks (so-called cascades; see Section 6) are sometimes provable convergent. Each of these cases will be discussed in somewhat more detail. In view of the particular information-processing role of convergent networks these cases are of major practical interest too. 5.1 Robust and Simple Stationary States Convergence of activation dynamics is not always easy to prove. However, in practice somewhat weaker conditions are useful too and frequently less difficult to verify. These conditions often guarantee at least that no cycles or recurrent
258 trajectories will be found or that such nonconvergent orbits are not stable enough to be observed. Bounded Dynamics Without going into much details some of the conditions relevant for dynamical systems (appearing as models for ANNs) will be outlined. For any of the systems considered below it is assumed that there is a bounded set F attracting all trajectories. Essentially, it means that after a specific time point in their evolution all states of all trajectories can be proven to be an element of this set. Another way of expressing is that every trajectory a(t) approaches a nonempty, closed, bounded, connected set of limit points (roughly the end-points of all trajectories). If a(0) = q the set of limit points is, by definition, contained in O(q), the so-called limit set of the point q. Clearly, all points on the orbit of q (i.e., the trajectory starting at the point q) share the same limit set. The limit set is an invariant under the activation dynamics meaning that if b(t) is a trajectory starting at a point 6(0) € 0(q), then b(t) £ O(q) for all t for which b(t) is defined. Now, convergence of a trajectory is equivalent to a singleton limit set (one stationary state). Stability Of particular importance are stable stationary states. Stationary states p for a vector field H are characterized by H(jp) = 0, i.e. the states p are eigenvectors of the field H. However, stable stationary states require additionally that every eigenvalue of the linearized field DH(p) has a negative real part. When p is stable the basin of p is the union of all trajectories tending to p. Furthermore, all trajectories lying in the basin approach p at an exponential rate. Moreover, when p is stable then it is also robust meaning that for small perturbations of H the corresponding stable states are near p. Such robustness is especially valued in physical models, because it means that experimental measurement uncertainty is bounded by a region around p. In fact, the absence of robustness is seen in physics as giving rise to spurious results, which are either not observable or not meaningful at all. A more generic type of stationary state is a hyperbolic one for which the eigenvalues of DH{p) have at least nonzero real parts. Sufficiently small perturbations of H lead to hyperbolic stationary states if and only if H itself has only hyperbolic equilibria. As a consequence when p is a hyperbolic stationary state then either p is a stable equilibrium, or else the set of trajectories tending to p forms a smooth manifold of lower dimension than the state space. A physical example of the latter case are the so- called scattering states within metals (or metallic alloys) giving rise to energy bands (i.e., smooth manifolds of energy states) through which electrons may move rather freely, thus explaining their low electrical resistance. In a sense, one can still use the concept of robustness as shown by for instance shifts of these bands by the presence of one or more (metallic) impurities (as a consequence of small perturbations of the original Hamiltonian of the system). That is, a vector field close enough to H must have
259 a hyperbolic stationary state rather near the hyperbolic equilibrium p of H. It is of some interest that the symmetric weight matrix W = [Wij] in equation (3) renders the corresponding vector field to be Hermitian meaning that its stationary states are at least hyperbolic. Simplicity A stationary state is simple if and only if the linearized field DF(p) of F can be inverted. This is equivalent to proving that an eigenvalue equal to zero cannot appear. Indeed, hyperbolic stationary states are simple besides robust. Being simple is the key to getting segregated equilibria so that simpleness is a generic condition for all stationary states. Assuming bounded dynamics, the end result is a finite set of stationary states. In practice, however, vector fields may be used for which it is not quite certain that the set of stationary states is finite. However, lacking any indication to the contrary, one may often assume that the particular vector field used has a finite equilibrium set. The aforementioned dynamical concepts are quite general, though not so well- known within the community of ANN-users. The same holds for a particular type of function, the Liapunov function, which plays a peculiar role in determining the convergence of trajectories of particular activation dynamics. 5.2 Liapunov Functions A Liapunov function (henceforth denoted by L-function) is a continuous function V on the state space with the property not to increase along trajectories. For the set of limit points of a trajectory the function equals a constant. A strict L- function strictly decreases along nonstationary trajectories. If V is strict then the limit set contains only stationary states (see e.g. Hirsch and Smale, 1974). Furthermore, one should know that any strictly increasing function of a L-function V is a function of the same signature. Moreover, because of the assumption of bounded dynamics for systems, any L-function is bounded from below, whilst the foregoing function composition (any function of a L-function V) guarantees boundedness again for, e.g., ev. What's more, addition of a sufficiently large positive constant to a bounded L-function leads to a bounded positive L-function (which is strict if V is strict). Assuming a vector field F on Rn (see equation 2) and V to be a continuously differentiate real valued function on Rn one can show (using the chain rule) that if a(t) is a trajectory then d/dt{V(a(t))} = VV(a(t))-F(a(t)) where W is the gradient vector field of V and the dot means the usual inner product. The equality allows to state that V is a L-function if and only if W-F < 0 everywhere, while V is strict if and only if \/V ■ F < 0 at every point z such that F(z) ^ 0. Suppose F(z) ^. 0 and set V(z) = c. In that case the vector F(z) is transverse to the level surface V~l{c) at z, whilst pointing toward the set where V < c.
260 Because of the boundedness (from below) of L-functions one can conclude that a strict L-function directs every trajectory to approach asymptotically a set of stationary states. Technically, the system is quasiconvergent meaning that the velocity of (or tangent vector to) every trajectory tends to zero: the trajectory appears to converge. A strict L-function excludes cycles (or recurrent trajectories). If the set of equilibria is finite or countable infinite then a strict Liapunov function guarantees every trajectory to have a unique equilibrium so that the system is also convergent indeed. Even if strictness of a L-function is not given it may be possible to prove quasiconvergence. Using the fact that the limit set of a trajectory must be contained in the largest invariant set in which the L-function is constant over orbits one may be able to demonstrate by LaSalle's invariance principle (LaSalle, 1968) that this invariant set consists purely of stationary states. Hence, the system will be quasiconvergent. Again, if this set is discrete then the system is convergent. In this way Golden (1986) was able to prove convergence of Anderson's Brain- State-In-A-Box ANN-model (a well-known model which is, however, not treated in this book). Although there is no general method for finding L-functions the following findings may prove useful in constructing one. A. Usual Physical Systems and Gradient Systems Liapunov functions are sometimes called energy functions, because for dissipative mechanical systems energy is a strict L-function. However, also entropy (used in classical thermodynamic systems) is a strict L-function. For a gradient system dai/dt = —SU/da, the real valued function U on the state space is a strict L-function. For this reason adaptive learning systems use an error function, which behaves as a L-function for the weight dynamics. By minimizing the error function one is searching for a (local) minimum in the error surface (or error landscape). In learning algorithms for adapting weights many approximations to gradient descent on the error function have already been used. In case a vector field F can be written as the product of a positive continuous function on the state space and another vector field G, for which a L-function V(a) exists, one can show that V(a) acts also as a L-function for F. In this case the trajectories of F are re-parametrizations of those of G, i.e. the trajectories are only scaled. B. Survival-of-the-Competitively-Fittest Systems For systems of the form dxi/di = ai(x)[bi(xi) - ^cikdk(xk)] = Ft(x) (5) k where the factor a,{ > 0, the constant matrix [c;&] is symmetric, and d'k > 0, Cohen and Grossberg (1983) have constructed Liapunov functions. Here, the matrix element cu can be taken to be equal to zero since the term cudi(xi) can be hosted by bi{xi). Interestingly, these systems are a generalization of ecological
261 systems describing interacting species having symmetric community matrices (Gause-Lotka-Volterra systems). Many specializations of (5) have been used for particular ANN-models with Xi the activity level of node i, dj.{xk) the output of node k, c;& the weight (or strength) of the connection from node k to node i and a((x) an amplification factor. If all Xj and dj have positive values, then the connection from node k to node i is inhibitory if en- > 0, while being excitatory if c;& < 0. The summation in (5) represents the net input to node i. Assuming the amplification factor to have a positive value, the activity of node i decreases if and only if the net input to node i exceeds a certain intrinsic function bi of the nodes activation. One may think about the nodes as competing among themselves in case that all connections between different nodes are inhibitory. The competition is, then, modulated by the state-dependent amplification factors a;, the self-excitement rates bi, and the inhibitory interactions Ci^dj.. Cohen and Grossberg discovered the following L-function for system (5) V(x) = ~Ys / H»)di(L1W + hT,jkcJkdj(xj)Mxk) (6) i J° Moreover, they showed that if aj > 0 and derivative d'k > 0, then V is a strict Liapunov function so that the system is quasiconvergent. In some general circumstances quasiconvergence could also be proved for particular cases by means of LaSalle's invariance principle. C. Hopfield's Neural System with Graded Response Hopfield (1984) has provided, quite independently, the same Liapunov function for a special case of (2) with Fi(x) = -CiXi + J2 Tij9(xj) (7) 3 where [T(j] is a constant symmetric matrix and derivative g' > 0. In that case the L-function is -5 £,-^+¾¾¾)^) (8) which in Hopfield's electrical circuit interpretation is precisely the energy. On the whole, all these findings suggest a deep analogy between the concept of energy in physics and strict Liapunov functions forcing every trajectory in activation space to approach asymptotically a set of equilibria so that the dynamics is at least quasi convergent. In fact, there is every reason to suspect that the special algebraic form required of the state transition functions may profit from quite general considerations about Hamiltonians in physics. Also noteworthy, it has been shown (Golden, 1988] that a broad class of networks with strict L-functions can be interpreted to maximize a posteriori estimates of the probability distribution of the input-output pairs representing the actual environment of the system. This lends even somewhat more credibility to our earlier statement that ANNs are to be seen as associative devices optimally closing the environmental loop.
262 5.3 A Remarkable Inequality The introduction to Section 4 mentioned a quite peculiar circumstance in which certain types of additive networks can be shown to be convergent, namely in case that the derivatives of the output functions and the weights satisfy certain inequalities. These types of additive networks are of the form dxi/dt = a,i(x)\bi(xi) - C(x1,..., xn)] (9) with dj > 0 and 6C/8xi > 0 for all i (or 8C/8xi < 0 for all i). One should observe that b{ is a function of Xi only. Equally important, the function C, which maps from Rn to R is independent from i. For 8C/8xi > 0 these systems are competitive systems in which the competition between the X{ is mediated by the scalar field-like quantity C(x) based on the interaction of all Xi. Although no L-functions are known for systems like (9), Grossberg (1978, 1982) has shown that piecewise monotonicity of the bi renders equation (9) convergent. Furthermore, without such monotonicity the system can still be shown to be quasi convergent. A specialization of (9) and a nice example is given by dxi/dt = nxi[Bi — X{ — K 2_)°j(xj)] 0 < a;; < Bi 3 where o'j > 0, while n, Bi and K are positive constants. In this special type of network the summation of the neural outputs uses no weighting (there is a uniform weight of — K for all interconnections between nodes) to provide the net input to any node i from all nodes of the network (including i itself). Due to the sign all connections are inhibitory (i.e., including the self-excitations). Breaking all interconnections (and thus competitive, inhibitory interactions) by putting K = 0 reduces the equation to dxi/dt — riXi[Bi — Xi] 0 < Xi < Bi That is, by shutting off the inhibitory field any activation Xi is now allowed to reach its upper bound Bi. Interestingly enough, systems of type (9) are well-known in physics, e.g. as multiple-scattering type equations where the scalar field consists of linear combinations of weighted products of regular functions (like e.g. spherical Bessel functions) plus a product (representing the self-excitation) of a regular and. non- regular function (like e.g. an outgoing, spherical Hankel function). As multiple- scattering equations (see e.g. Braspenning and Lodder, 1994; Braspenning et al., 1982, 1984) presumably represent cooperative systems we have an active interest in exploring their connection to Grossberg's results. Moreover, such a connection may give a Liapunov function that is up till now considered to be unknown.
263 5.4 A Criterium for Global Asymptotic Stability Returning again to the shorthand equation (2), but writing it now with its explicit dependency on clamped external inputs as: dxi/dt = F{(xi, ...,xm,h,...,In) i =1,...,n one can consider the issue of global asymptotic stability. This system is globally convergent if there is a unique stationary state to which it converges for I = (Ji,..., Jn). Moreover, if this equilibrium is stable then the system is called globally asymptotically stable. In case that the system is globally convergent for any input vector I one is not bound to specify the initial values of the X{, since all trajectories approach the same unique stationary state only depending on the particular I injected. It is clear that such a system has the ability to map the space of input vectors to the space of activation vectors. Additionally, for e.g. systems running in real-time it is no longer necessary to reset the activations when input vectors are changing2. A quite simple condition for global asymptotic stability of the dynamical system dx/dt = F(x) is given by the following inequality: (Af\f) < -//(/1/} for all / € Rn for each Jacobian A = DF(x), constant fi with — fi < 0, where (|) denotes inner product and (/|/) is the squared Euclidian norm. The inequality can be proved by a Taylor expansion of F, while estimating the distance between two solutions x(t) and y(t) assuming sufficient closeness of x(0) and y(0) (Hirsch, 1989). Without explicit proof it is still useful to know that this condition is equivalent to the largest eigenvalue of the Hermitian matrix (or symmetrized matrix in case of real values) ^(A + AT) being < — fi, where AT is the transposed matrix. This condition guarantees that x(t) converges to a stationary state, which is asymptotically stable. Systems, which are globally asymptotically stable have a strict L-function. However, to construct this function one should first know about its global asymptotic stability so that this function is not helpful in determining such a system. Yet, it may prove useful in determining convergence of layered or cascaded networks (see Section 6) in which this type of network appears as a subnetwork or component (i.e., node) respectively. The condition for global asymptotic stability can be applied to the general form for additive networks dxi/dt =z -CiXi + ^2WijOj(xj) + Ii = Ffoi,.. .,xn) 3 Qualitatively it is not difficult to understand that the aforementioned condition for global asymptotic stability can be ensured by using output or transfer functions 0{ with gains o\ which are relatively small compared to the self-inhibitions 2 see e.g. Kelly, D.G., Stability in contractive nonlinear neural networks. IEEE Transactions on Biomedical Engineering
264 C{. An alternative formulation can be given in terms of the weight matrix W: ensure that each diagonal element Wu (self-weight) is much more negative than the absolute values of the other weights (i.e., non-diagonal elements W{j) on input lines connected to node i. Although such a condition may be in conflict with other, possibly architectural constraints, or with a concrete algorithm used for calculating the weights, it is at least helpful in knowing what to expect. Moreover, assuming uniformly bounded weights and gains 0 < o'i < 0, Wa < /?, \Wij\<6, 0<a<a and accounting for the fact that the connectivity m, i.e. maximum number of other nodes any node is connected to, determines also the relative strength of the net input compared to the self-inhibiting contribution (depending on c,-) the aforementioned condition can be used to infer that 0(P + m6) <a guarantees global asymptotic stability for general additive networks. Interestingly, this reformulated condition is independent of the number of nodes actually used for the ANN, while depending only on local properties of the network. Hirsch remarks that Kelly (see previous footnote) gives a different criterion for global asymptotic stability of a somewhat less general additive network, namely specialized to c,- = 1 for all i. However, he also notes that as every eigenvalue of \{W + WT) has absolute value < (W^)* one can infer again that the (upper) boundedness of the gains o\ and matrix elements Wij (as mentioned before) ensures a particular balance between the self-inhibitory and other inhibitory contributions to the node's activation. This balance then secures that e.g. the square of the trajectory length {(x — p)\(x — p)) is a strict L-function for additive networks, since it decreases strictly on the nonstationary trajectory x(t) towards the unique stationary state p. In the next section some convergence issues will be globally treated for networks consisting of sub-networks (thus acting as layers) and those consisting of super-nodes, which are ANNs too (thus establishing cascades). Interestingly, these convergence issues can often be dealt with by using only the results previously mentioned for simple networks. 6 Layered Networks and Cascades Layered networks appear very often within the field of ANNs, since they allow in general to retain a particular functionality without being forced to full connectivity of all nodes. A particular class of so-called feed-forward networks is, in fact, quite popular because they are conceptually rather simple; that is, each sub-network is feeding its output to (the nodes of) a subsequent sub-network or layer. Obviously, it would be nice if one could analyze the dynamics of the network in terms of the dynamics of the sub-networks. Although such analysis
265 is possible in many cases, the results are not always as intelligible as one would expect and sometimes even counter-intuitive. Therefore, emphasizing only the basic concepts the next account gives some results without going into much technical detail. A network with a layered structure consists of sub-networks ttj each of which is feeding its output only to the next 7^+1. A generalization of such a structure is a cascaded network. When ttq and 7T] are separate networks and some nodes of 7T0 feed their outputs to nodes of tr\ by means of new connections, one gets a larger network A (a cascade of 7To and K\). When again outputs from A are fed into a third separate network ito, one gets A (a cascade of A into 7^). It is clear that a cascade can be seen as a feed-forward (super-)network consisting of supernodes, which are themselves networks. An important concept is the rcducibility of a network. A network is called irreducible if any two distinct nodes take part in a loop of (directed) connections such that every node can influence the output, of every other node (directly or indirectly). Absence of irreducibility means a reducible network (for example feed-forward networks or cascades, which are reducible to one-node networks). A maximal irreducible sub-network of a given network is called a basic sub-network. Even without proof it is not very difficult to see that, any reducible network can be partitioned in a way such that the network is a cascade with components being basic sub-networks of the reducible network. Furthermore, the possible irreducibility of a network represented by equation (1) can be expressed in terms of the irreducibility of the weight matrix W. Irreducibility of a square matrix means that one can construct for any pair of distinct labels a path of successive labels in between such a pair such that along this path there will always be a pair of subsequent labels denoting a matrix element unequal to zero. Formulated differently, the linear mapping expressed by the matrix has no invariant sub- space, which is proper and nontrivial. Another useful formulation is: there is no similarity transformation (based on permutations of coordinates) that will bring the matrix in one-upper-block quasi-diagonal form. Testing a weight matrix for irreducibility may be done by constructing the flow chart of the network, and searching for loops of directed edges between any two distinct nodes i and j. A consequence of reducibility is that the matrix can be brought in block triangular form with square matrices along the main diagonal and only zeros at some upper or lower part of the diagonal. As a matter of convention, often the lower block triangular form is chosen such that the upper right part of the matrix above the block diagonal contains only zeroes. In the following we will treat .some pertinent results of studies about convergence of cascades and layered networks. Although no proofs are given (however, see (Hirsch, 1989)), these results are presented as some guidelines that may help in constructing actual models of ANNs with particular dynamical properties. 6.1 Globally Stable Cascade Network A cascade is globally asymptotically stable if each network in the cascade has this dynamical property. Formulated in terms of vector fields with parameters
266 (modelling systems with inputs) one may introduce F as a vector field on Rm and the vector field G on Rn with input parameter from Rm (G a mapping Rm x Rn —> Rn) to obtain the dynamical system dx/dt = F(x) dy/dt = G(x, y) which is a cascade of the two systems dx/dt = F(x) and dz/dt = G(p, z) with p being a parameter for the latter system representing the input from the former system. Iteration, of course, leads to more complex cascades. Even so, based on the sketched procedure one can formulate a dynamical system on the Cartesian product state space E° x ... x E3 , with Ek an Euclidian space of arbitrary dimension, x* a vector in Ek, and a vector field Fk on Ek with inputs from E° x ... x Ek~l as the mapping Fk : E° x ... x Ek -► Ek by (dx/dt)° = F°(x°), (dx/dty =i^(x°,...,xJ'), with j= l,...,max(j) This system is globally asymptotically stable if for every parameter value any of the components is globally asymptotically stable. Furthermore, if any of the lower level cascades can only be proven to be convergent, whilst a higher level is globally asymptotically stable then the resulting cascade of systems is also convergent (a similar result holds for almost convergence, meaning that it is very unlikely to observe a nonconvergent trajectory). These results appear to hold for discrete systems too. 6.2 Liapunov and Cascade Convergence Not every cascade of convergent systems is also convergent. To get an idea of what would be required to make the cascade convergent: if the initial component or super-node (see e.g. F above) would be convergent and for successive vector fields (like G above) a strict L(iapunov)-function would be given for each of its stationary states, while it is known that these vector fields have only a finite number of stationary states then the cascade can be proven to be convergent. Similar conditions lead to almost convergence of the cascade. 6.3 Additive Cascades Of special interest are additive cascades of networks in which functions of the outputs of component networks are added to the input nodes of subsequent networks in the cascade. If we take a collection of networks wj each of the Cohen-Grossberg form of equation (5) (see subSection 5.2) then we can build a cascade of such systems as follows: fix j and let y denote the activation vector of wj so that the dynamics must be of the form dyi/dt = a,-(y,-) [6,-(1/,-) - ^T cikdk(yk)} + h^z1) (10) k
267 Here, z3 is a vector whose components are the activations of the nodes in the (lower-level) networks 7Ti,..., Wj-i- We remark explicitly that a; is a function of ?/;, while we assume as before that a; > 0, d'k > 0 and [c{k] is a symmetrical matrix. Giving hj(zJ) explicitly the status of a parameter r equation (10) can be reformulated as dyi/dt = ai(yi)[Bi(yi) - ]T cikdk(yk)] = Gi(r, y) (11) k with Bi(yi) = bi(yi) + r/a;(?/i). Consequently, as this equation is in the form of equation (5) for each fixed r the Liapunov function of equation (6) can be applied to give a function V(t, y). This is for each fixed r a strict Liapunov function for Gi(r, y). As a result, it can be proved that if the vector fields Gi and L-functions V happen to be continuously differentiate functions in both variables then the cascade has a strict L-function. Practically, this means that the functions a;, &;, di and hj should be continuously differentiable to three variables. In fact, there is an important generalization with the usual conditions except that [cij] is merely required to be in block triangular form, while the blocks down the diagonal should still be symmetric. Even then the activation dynamics can be shown to have a strict continuously differentiable L-function. 6.4 An Extra Useful Ordering Condition Without Liapunov functions or the use of global asymptotic stability it is much more difficult to prove that a cascade of convergent networks is convergent itself. A way in which one may proceed is building cascade systems in which the stable stationary states in basins of lower-level networks (i.e., in earlier stages in the cascade) are approached at much faster convergence rates than stationary states' in the higher-level networks (i.e., in later stages). Technically, it can be arranged by ensuring that at the earlier stages the eigenvalues of the linearized vector fields for particular stable stationary states (say x — p for DF(x)) have a more negative real part than the eigenvalues at particular stable states (say y = q for DG(p, y)) of later stages. We remark only that with this additional condition nearly any initial state of the cascade network of type (11) can be shown to belong to a basin of a stable stationary state. A useful example is given by a two layer network 7r. Each layer is assumed to be a recurrent additive network, whilst the second layer 7r2 is not sending inputs to the first network tt\ . The dynamics of the cascade of tt\ into 7T2 can be written as dxi/dt = -CiXi + ^T WijOj(xj +Sj) + Ii = Fi(x) (12) 3 dyk/dt = -ckyk + ^Ukiri(yi + st) i + ^2vkmom(ym + sm) = Gk(x,y) (13) m
268 with weights W{j in the first layer, {/&; in the second layer, and weights Vkm for connections from the first to the second layer. In the first layer the activation functions are oj(xj), and in the second layer r;(?/;). Now, by invoking the additional condition about the ordering of (the real parts of) the eigenvalues belonging to a stationary state of the x-dynamics with respect to those of the y- dynamics one can show that almost every initial value of the activation dynamics of the network x will be located in the basin of a stable stationary state. Clearly, the additional ordering condition can be used in constructing a particular type of additive network. Here, we finish our overview of models for ANNs (as typified by particular general formulae). However, the reader interested in more detailed treatments of cooperative and competitive systems should consult the recent work of Hirsch and others providing a wealth of theoretical findings, which are not very well known in the broader community of ANN-users. Especially, cooperative irreducible systems show almost quasiconvergent dynamics (Hirsch, 1984, 1985, 1988), meaning that trajectories appearing not to converge are seldom observed, whilst cycles or any other kinds of nonconvergent orbits are unstable. This also means that cooperative systems tend to have less exotic dynamics, which is clearly advantageous in the present state of modelling ANNs. 7 Resume and Some Conclusions After expressing the difference between classical AI systems and Artificial Neural Networks as the difference between knowledge-based and behaviour-based systems, we have stated that ANNs should not be seen as systems based on explicitly acquired (human) knowledge, but instead as complex associative devices trying to optimally close the environmental loop. For this reason they can be a very valuable addition to the range of instruments for building a knowledge-based system, though there are domains (at present unsatisfactorily covered by classical AI) in which ANNs are clearly superior to any human knowledge acquired till now. Yet, one should actively resist any kind of superficial comparison between classical AI systems and ANNs, since the power of the former systems clearly depends on the amount of explicit knowledge supplied to it. And who knows what better ways of knowledge acquisition will be found? The power of ANNs together with new computational possibilities explains, of course, the renewed interest in building such devices and the exploration of their manyfaceted behaviours. We referred to their powerful ability to search in parallel, to learn from examples, even in so-called ill-structured domains, and to become much more structured so that also inherently complex concepts can possibly be represented by the network. However, we were also rather critical of many of such ready-made claims since there are still too many unsolved problems, which should not be skipped lightmindedly. Having said so, the purpose of this contribution was explained to focus more on the general be- haviourial side of ANNs then on the knowledge, which they are said to embody (however, see e.g. Braspenning P.J., 1989).
269 After outlining a basic generic architecture of ANNs we have explored mainly the activation dynamics of quite general models of ANNs. The idea was that these generic models may help in assimilating the rather diversified literature, which is also scattered over many reports, journals and books. Introducing the convergent, oscillatory and chaotic dynamics which may be shown by these generic models we have stressed the fact that presently convergent dynamics seems to provide the easiest picture of how information-processing by these systems is possible. However, we have also pointed at future discoveries about how oscillatory dynamics might be used in information processing. The reader was hinted not forget that in principle such dynamics provides much richer possibilities for representing information, though we still don't know very well how to use these representations. Also in view of the fact that our brain has been shown to favour more oscillatory and chaotic types of behaviour, very exciting results about these kinds of activation dynamics may be expected in the near future. Focusing on the class of convergent dynamics and introducing quite general mathematical descriptions of ANN-models we emphasized that there probably exist findings of deep analogies between many network-type equations still waiting to be discovered. A heuristic guide to such findings may involve using at least some different representations of the same underlying dynamical system. As far as the external inputs is concerned we have introduced the idea that the dynamics of a system becomes fixed by clamping the external input. Only when the external input has a burst-like character aimed at locating the initial dynamics in the neighbourhood of a basin (with an attractor, of course) are these systems not dependent on the initial activation vectors. In most cases, however, initial activation vectors are to be reset if a new external input vector is injected. Next, we have treated (in some depth) a number of convergence issues related to special circumstances in which convergence of activation dynamics can be proven or even established by constructive methods. Many methods make use of so-called Liapunov functions being at least non-increasing functions of the state vector along a trajectory through activation state space. We paid relatively much attention to so-called additive networks, since this type of network is conceptually rather easy, and in practice very often used. It is useful to distinguish categorial cases in which additive networks may be proven to be convergent. These were enumerated in the introduction of Section 5, whereas the rest of the paper aimed at providing a working knowledge of necessary concepts and ideas for treating these different cases. After dealing with a number of weaker conditions, like those for stable and simple stationary states in the context of bounded dynamics", the concept and use of Liapunov functions has been sketched. Furthermore, a number of ways of constructing L-functions have been outlined. Moreover, a remarkable inequality (due to Grossberg) guaranteeing convergent or quasi convergent dynamics for a quite general class of systems was introduced. These systems can act in a competitive or cooperative mode depending on the sign of the partial derivatives of a scalar field term.
270 Globally convergent dynamics has been introduced, whilst showing its importance for mapping input vectors to activation vectors. Although these systems have a Liapunov function, this function can be constructed only after one knows already that the system is globally asymptotically stable. None the less, such constructed Liapunov functions may prove useful in determining the dynamics of layered networks or, even more generally, cascade systems. With respect to these cascade systems only the most basic issues were treated. We provided (without any proofs) some guidelines to know better what to expect of particular types of cascade systems. Again, additive cascades allowed to formulate simple convergence conditions. Besides, their explicitness may help the reader in conceptualizing how these cascade systems operate. Finally, we have hinted at the fact that, if necessary, the addition of a quite intelligible condition (which enforces certain convergence rates towards stable stationary states) renders nearly all additive cascade systems convergent (or at least quasiconvergent) for nearly all initial states of the cascade. This is a rather hopeful result as long as one is able to apply the condition in the construction of a cascade network. As a matter of fact, in practice one often constructs networks with too many degrees of freedom so that the dynamics of the resulting network is completely unpredictable, even categorically unknown. We have come to believe that this situation will hamper real progress in finding reliable, and applicable models for ANNs. Even in a field with such a strong experimental flavour one should know that experiment alone can not provide the general knowledge that one is looking for. The researcher and ANN-engineer has to bifurcate his activities related to ANNs in such a way that the "chaos" of possible findings is "solved" through the combination of experiment and theoretically sound knowledge of ANN-models. The suggested literature hereafter helps in getting more knowledge about the many models. References S.-I. Amari (1972) Characteristics of random nets of analog neuron-like elements. IEEE Transactions on Systems, Man and Cybernetics. SMC-s, 643-653. S.-I. Amari (1982) Competitive and cooperative aspects in dynamics of neural excitation and self-organization. In S. Amari and M. Arbib (Eds.) Competition and cooperation in neural nets. Springer Lecture Notes in Biomath. 45. New York, Springer. J.D. Aplevich (1968) Models of certain nonlinear systems. In: E.R. Caianiello (Ed.), Neural Networks, 110-115. Berlin, Springer-Verlag. P.J. Braspenning (1989) 'Out of Sight, Out of Mind' -► 'Blind Idiot': A Review of Connectionism in the Courtroom. AI Communications, Vol. 2, Nos. 3/4, 168-176. P.J. Braspenning and A. Lodder (1994) Generalized multiple-scattering theory. Phys. Review B49, 10222-10230. P.J. Braspenning, R. ZeEer, P.H. Dederichs and A. Lodder (1982) Electronic Structure of Nonmagnetic Impurities in Cu, J. Phys. F12, 105. P.J. Braspenning, R. Zeller, A. Lodder and P.H. Dederichs (1984) Selfconsistent Cluster Calculations with Correct Embedding for 3d-, 4d- and some sp-Impurities in Copper, Phys. Review B29, 703.
271 M.A. Cohen and S. Grossberg (1983) Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transaction on Systems, Man, and Cybernetics 13, 815-826. J.D. Cowan (1967) A mathematical theory of central nervous activity, Unpublished dissertation, Imperial CoEege, University of London. W.J. Freeman (1991) The Physiology of Perceptron, Scientific American, V264, nr. 2, 34-41. R.M. Golden (1986) The 'brain-state-in-a box' neural model is a gradient descent algorithm. Journal of Mathematical Psychology, 30, 73-80. R.M. Golden (1988) A unified framework for connectionist systems. Biological Cybernetics, 59, 109-120. S. Grossberg (1969) On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks. Journal of Statistical Physics, 1, 319-350. S. Grossberg (1978a) A Theory of Human Memory: Self-Organization and Performance of Sensory-Motor Codes, Maps and Plans, Prog. Theor. Biol. 5, 233. S. Grossberg (1978b) Competition, decision, and concensus. Journal of Mathematical Analysis and Applications, 66, 470-493. S. Grossberg (Ed.) (1982) Studies of Mind and Brain: Neural Principles of Learning, Perception, Development, Cognition, and Motor Control. Reidel Press, Boston. M.W. Hirsch and C. Pugh (1970) Stable manifolds and hyperbolic sets. Proceedings of Symposia in Pure Mathematics, 14, 133-164. M.W. Hirsch and S. Smale (1974) Differential equations, dynamical systems, and linear algebra. New York, Academic Press. M.W. Hirsch (1984) The dynamical systems approach to differential equations. Bulletin of the American Mathematical Society, 11, 1-64. M.W. Hirsch (1985) Systems of differential equations that are competitive or cooperative. II: Convergence almost everywhere. SIAM Journal of Mathematical Analysis, 16, 423-439. M.W. Hirsch (1988) Systems of differential equations that are competitive or cooperative. III. Competing species. Nonlinearity, 1, 51-71. M.W. Hirsch (1989) Convergent Activation Dynamics in Continuous Time Networks, Neural Network, Vol. 2, 331-349. J.J. Hopfield (1984) Neurons with graded response have coEective comptutational properties like those of two-state neurons. Proceedings of the National Academy Academy of Sciences U.S.A. 81, 3088-3092. J.P. LaSalle (1968) Stability theory for ordinary differential equations. Journal of Differential Equations, 4, 57-65. Ch. v.d. Malsburg (1973) Self-organization of Orientation Sensitive Cells in the Striate Cortex, Kybernetik 14, 85-100. T.J. Sejnowski (1977) Storing covariance with nonlinearly interacting neurons. Journal of Mathematical Biology, 4, 303-321.
Choosing and Using a Neural Net P.T.W. Hudson and E.O. Postma Department of Computer Sciences, University of Limburg, Maastricht 1 Introduction The range of available types of neural networks is considerable, and increasing. To those who are not immediately involved with such objects (and even to those who are) the picture is one of 'blooming, buzzing confusion' (James, 1890).- The potential user needs help to decide whether neural networks can help with solving problems and, if so, which network approach is most appropriate. In this paper we have no pretensions to academic depth of coverage. Instead, we have gone for breadth, selecting some of the better known neural network architectures and ordering them in terms of dimensions we have found useful in practice. Most researchers are, not surprisingly, interested in the behaviour and advantages of their own chosen network. They are not, therefore, always the best suited to provide an objective evaluation of the whole field to a potential customer. This paper is intended to provide a very brief set of guidelines for choosing a network architecture and then for using it appropriately; we have tried to provide a 'consumer's guide' to help those whose interest is primarily as a user of neural networks rather than as one driven by scientific curiosity. Our comparisons are done by analysing some very general features of neural networks and by categorising different types of network. By concentrating upon certain widely applicable features it is to be hoped that the same approach can be generalised to new network architectures or those which have been excluded here should a user wish to know more. This book, as a whole, provides some rigorous introductions to a wide range of different networks1. We also use the term architecture to emphasise the variations in structure as well as the differences in dynamic behaviour. Neural networks provide very powerful ways of solving certain sorts of problems. They do not, nevertheless, provide a panacea, as one might imagine when reading certain types of more popular literature. Anyone who considers that applying a neural network approach might be an interesting or effective way of solving a problem needs to be helped in knowing where to look and what to select for. It is certainly not the case that a neural network technique will ever represent the only possible way of solving a problem. Marr (1982) distinguished very clearly between three levels of analysis; the computational, algorithmic and im- plementational levels. The first (the problem of what is being computed and why) is independent of the algorithm chosen to tackle a computational problem (how to compute it). The third level is implementational, how is the algorithm actually 1 Because of our wish for breadth of coverage we have included some architectures not discussed in this book
274 carried out. Neural networks are, in the first instance, algorithmic approaches. This said, nevertheless, there are clear interrelations between the algorithmic and implementation levels. In particular, it may well be the case that a real- life problem, even if capable of elegant and even slightly superior solution by a conventional algorithm, is still best tackled by a hardware implementation using a neural network algorithm because that is the only effective way we have of ever reaching the desired speed with a parallel approach. Likewise, we may be interested in types of solutions that we ourselves produce and approaches we use - biology has had a long time to reach certain solutions and to define what is computable and worth computing. There are a number of reasons why you might consider using a neural network to help solve one or more specific problems. First one should consider whether there is a well defined alternative technique. If so it may be necessary to evaluate the different approaches to see which offers the most advantages. Typically many neural network architectures offer a functionality similar to statistical approaches and it is important to recognise this. If there is a well-defined alternative it is questionable whether neural networks offer any added benefit, although there are situations where they are still attractive. In particular the neural network approach is interesting when either the input data is not as clean as many statistical approaches, or even the algorithms implementing such techniques, require. A special case of this is when the total expected range of input data exceeds allowable variation, even though most of the time the data is within limits. Technically, statistical techniques may require exclusion of such non-standard data; a neural network may well be more forgiving. Another interesting possibility arises when the underlying model is not even clear, which implies that conventional statistical approaches will be suspect. A final possible reason for choosing a neural network, albeit more speculative, is the possibility of implementing a system in parallel hardware. In this latter case general purpose neural network hardware may soon be more accessible than hardware implementations of specific statistical approaches and for certain classes of problem may represent the only feasible way of computing results in any reasonable time. In general it is possible to summarise the choice of a neural network approach as follows: If you know what to do or how to do it, choose a conventional approach. If, on the other hand, you need a solution but do not know exactly what to do, or the data is a mess, then a neural network becomes an, and possibly the only, available solution. While this may seem a rather extreme position, it is one guaranteed not to miss effective solutions, if they exist at all. 2 Types of Problem We can distinguish three basic types of problem for which neural networks are appropriate: 1. Classification-Recognition and Completion; 2. Control; 3. Constraint Satisfaction.
275 These categories overlap to a certain extent, so we try here to distinguish them as far as is relevant. The important issue is that these are terms for very broad classes of problems which can serve as targets to see if a problem can be described as one of such a class. 2.1 Classification-Recognition and Completion Classification or categorisation tasks require a system to make an identification on the basis of a wide range of information which may well be incomplete.In classification we expect to have more than one instance (or token) associated with each category (or type); classification tasks reduce the input. We use the term completion to refer to the process where corrupted or incomplete patterns are converted into a recogniseable form. Given only half a face a completion system gives as output the whole face, while a classification system merely reports a label associated with that face. Completion may thus be regarded as a special type of classification in which patterns can be reconstituted. There are many statistical approaches which may be applied to classification tasks. There are also more symbolic approaches, such as are used in simple expert systems, to identify and classify entities on the basis of the possession of some set of features. Generally neural network classifiers will have a number of neurons, commonly called the input layer, onto which the features of the stimulus set are mapped. Figure 1 shows a typical layer, a number of neurons which are not interconnected but which receive input or provide their outputs to other parts of a system. Fig.l. A schematic Layer of neurons. Another type of layer has connections between the layer neurons as weE as from or to other elements, but in this case the connections are usually inhibitory, the so-called 'Winner-takes-aU' networks. The output neurons are typically organised so that there is one neuron per possible response and the neuron that 'fires' represents the selection of that
276 category. The output is also seen as a layer. Layers of neurons which are not either input or output, but are connected to them, are then called 'hidden' layers because of their immediate inaccessibility. Figure 2 sketches such an architecture, commonly associated with the Back-Propagation algorithm (see the contribution by Henseler). 000 Output Layer ¢^¾¾ g Hidden Layer Input Layer Fig. 2. A Multi-Layer architecture. The detail of Figure 1 is replaced by a general notion of a layer. Typically neighbouring layers are completely connected together. Figure 3 shows an architecture capable of completion. It is intended to convey the idea that all the neurons are connected together and the weights between all elements form the representations of the different 'patterns' being stored. Typically a number of images, presented as distinct inputs, would be learnt by a network, for instance by presenting each as a matrix of points, encoded as grey levels, to the input layer. When such networks are subsequently presented with a partial and/or noisy image, they should be capable of reconstituting the original version if it was one that was in the original learnt set. 2.2 Control A task for which conventional approaches are ill-suited is that of controlling a complex object such as a robot arm (see the contribution by Henseler). Neural network approaches can be remarkably successful in this sort of real-time real-world task. This is because they can be trained to operate within the constraints of the world as it is found and because, especially with truly parallel
277 outputs c / ^ A inputs Fig. 3. An interconnected network. The inputs are mapped onto all of the neurons, as are the outputs. It is possible to see such a structure as a layer, but one in which there are interconnections which are not necessarily inhibitory. implementations, they exploit the inherent parallelism of solutions which rely upon rapid progressive approximation to the solution. Whereas a conventional approach may require an analysis of trajectories and the computation of an optimal path (well known to be difficult when there are many dimensions, such as in multiply-jointed arms) a neural network approach may rely upon successive approximations employing relaxation-type techniques for which they are eminently well suited. Another advantage is the response to change in certain aspects of the task. For instance, if a robot arm is bent, or the hand is replaced by a larger grip- per, then many (learning) neural network approaches can adapt quite quickly, whereas many conventional approaches have to start again from scratch. 2.3 Constraint Satisfaction These are problems in which a great many different factors all apply in defining solutions. For instance, a school timetable is dependent upon the presence of teachers, rooms, class sizes; making a good timetable requires balancing all these different constraints at the same time. Another example might involve the setting up of a flight schedule for an airline: routes, expected numbers of passengers, types and sizes of aircraft all play a role in what can be made available at what places and at what times. While there exist many different approaches to constraint satisfaction (e.g., linear programming, heuristic search) the great advantage of neural networks lies in their ability rapidly to come to an acceptable solution that compares
278 favourably with traditional methods (Peterson, 1990). The major difficulty with using neural networks to solve constraint-satisfaction problems lies in the selection of a good mapping between the problem and the network architecture. Connection weights between neurons can be used to encode constraints while the neurons themselves must also have an interpretation. For instance, the travelling salesman problem is a classic constraint-satisfaction task which may be solved using a neural network (Hopfield and Tank, 1985; Soderberg and Peterson, 1989). In such a network, however, the neurons and connections do not have an immediately obvious interpretation. Unfortunately, mapping problems onto networks still appears to be highly task-specific. Clever problem representation can enhance the performance (quality and speed of solution) significantly (Soderberg and Peterson, 1989; see our contribution Solving Optimisation Problems with Neural Networks). These three characteristic problem areas, taken together, suggest where the power of neural networks is to be sought and what it is exactly which gives them such power and advantages over more conventional approaches. Table 1 gives an overview of a number of neural network architectures rated in terms of a range of relevant parameters. These are subdivided in five main types. Given the many architectures now available, the list is by no means exhaustive, only the better known are reported here. Hertz, Krogh, and Palmer (1989) and Simpson (1990) provide more detailed overviews of the networks summarized here as well as some others. The table is organized in terms of five different architectures and a number of features forming dimensions along which networks can be evaluated. 3 General Classification of Architectures We can distinguish a number of distinct types of neural network architecture. The first, Type I, comprises those that come to mind first for many people; the Per- ceptron and multi-layer perceptrons such as the Madaline and Back-Propagation networks. Such architectures are exemplified by Figure 1 and by Figure 2, All these are characterised by a fairly straightforward feedforward operation. They all learn, under supervision, and usually form quite simple representations of their inputs, although the existence of a hidden layer will often complicate matters. All such 'Perceptron-like' systems can be seen as representing weighted decisions about whether to proceed (i.e. fire) or not. Feedforward systems, such as these, rely upon learning in a different mode from actual operation, such as classification or recognition (see the contributions by Peters and by Henseler). Type II are the associative memories. These architectures store many different patterns, which can be released upon presentation of limited input data, i.e., completion tasks. Some of the systems reported here such as the Boltz- mann (see the contribution by Spieksma) and Cauchy machines (Takefuji and Szu, 1989) can also be used as non-learning optimization systems, as defined for Type V, the constraint-satisfaction networks. Type II systems can also be seen as non-linear energy-minimizing networks. These systems search in a complex state-space, defined by their connection strengths, for good solutions (i.e., local minima).
279 Table 1. Overview of common neural networks. Abbreviations and core references: Perc: Perceptron (Rosenblatt, 1958); BP: Back-Propagation Network (Rumelhart, Hinton, and Williams, 1986); Hop: Hopfield Network (Hopfield, 1982; 1984); Boltz: Boltzmann Machine (Ackley, Hinton, and Sejnowski, 1985); BAM: Bidirectional Associative Memory (Kosko, 1988); AHC: Adaptive Heuristic Critic (Barto, Sutton, Anderson, 1983); SOFM: Self-Organizing Feature Map (Kohonen, 1984); ARTl/2: Adaptive Resonance Theory Network (Carpenter and Grossberg, 1987a, 1987b); ARTMAP: Adaptive Resonance Theory Mapping Network (Carpenter, Grossberg, and Reynolds, 1991); CALM: CAtegorizing and Learning Module (Murre and Phaf, 1992); HT: Hop- field-Tank optimisation network (Hopfield and Tank, 1985); Potts: Potts optimisation network (Peterson and Soderberg, 1989). Partly based on Simpson (1990). Supervised Learning Type Network input type output type stability-novelty representations connectivity scalability locality alg. learning speed execution speed transfer function learning type main learning parameters Network Perc analog binary - local fuE - + + + step function error learning rate Perc I BP analog analog - distr. high - - - + sigmoid function error l.rate momentum BP Hop binary binary - distr. fuE - + - + probab. sigmoid hebb temp l.rate Hop Boltz binary binary - distr. high - - - - probab. sigmoid hebb temp anneal Boltz II (A)BAM analog analog - distr. fuE - + + + ramp/sigmoid function hebb ~ (A)BAM III AHC analog binary - distr. high - + +/- + step function reinforcement l.rate internal reinf. AHC Type Network input type output type stability-novelty representations connectivity scalability locality alg. learning speed execution speed transfer function learning type main learning parameters Network Unsupervised Learning IV SOFM a b - local full - - - + n.a. competitive l.rate neighbourh. SOFM V ARTl/2 b/a b + local high modular + + + sigmoid function competitive 1, rate vigilance ARTl/2 ARTMAP a a + local high modular + + + sigmoid function competitive 1. rates vigilances ARTMAP CALM a a + local high modular + + + sigmoid function competitive 1. rate noise CALM Non-learning VI HT b b n.a. distr. high - + n.a. + sigmoid function n.a. slope of sigmoid HT Potts b b n.a. distr. high + + n.a. + generalized sigmoid n.a. critical temperature Potts
280 Type III are reinforcement-learning networks. These systems learn in a supervised way but do not receive the detailed feedback characteristic of Type I systems. Rather, the feedback acts as a critic that responds to network output with right (reward) or wrong (punishment) reinforcement signal, (see the contribution by Van Luenen). Type IV are the feature-mapping systems such as the Self-Organising Feature Maps (SOFMs) of Kohonen (1984). The inputs are typically mapped onto a two-dimensional surface formed by neurons. The output represents a classification. The statistical properties of the input patterns are reflected in the two- dimensional spatial organisation of neuronal activity. Distance between common components in the input (conceptual relatedness) becomes mapped onto two- dimensional distances in the output. Similarity between input patterns becomes represented as spatial proximity in the sense that adjacent nodes tend to develop similar weight patterns. The SOFM is a sort of competitive-learning network with learning characteristics (see the contribution by Vrieze). Type V systems are much more complex architectures, they involve a number of distinct parts, both functionally and architecturally. Typically there are a number of layers of different functional properties combined together to form a single, albeit complex, processing unit. These systems are inherently modular and are capable of acting as classification, recognition and memory building blocks. Figure 4 shows an example of such a building block, the CALM module which can be combined to form complex, highly specific processing systems. Grossberg's ART architectures form other examples of such detailed and highly modular architectures (see our contribution on Adaptive Resonance Theory). Finally, Type VI systems involve combinatorial approaches to problems without the ability to learn. Examples are the fully-connected energy-minimizing networks of Hopfield and Tank (1985) and Peterson and Soderberg (1989) (see our contribution on Optimisation Networks). In Type VI systems, solving constraint- satisfaction problems, the connections represent the constraints and the system iterates until relaxing to a solution (i.e., pattern of activation). Such architectures may have problems with local minima and much research work has concentrated upon ensuring that they do not. 4 Considerations for Choosing a Network Architecture Here follows a list of the features we feel important to consider if one wishes to use a neural network approach to a specific problem. What we have done here is just to annotate those features which appear in the comparison Table. Learning or non-learning Although learning is often regarded as the most important defining feature of neural networks, there are situations in which learning is less important. The other interesting ability, to provide reasonable and satisfactory (but not necessarily optimal) solutions to complex problems, may be sufficient reason to choose
281 Fig. 4. The CALM module, with distinct layers arranged in a particular way. The bottom layer forms the Representation (R-nodes) with the connection weights from the input to the R-nodes being adjusted in the learning process. The top layer are 'Veto' nodes, controlling the R nodes and providing inputs to a 'newness detector', the E-node. Finally, external to the module, there is an excitatory random node, the Arousal or A-node, which perturbs the incoming information. This module can then provide input to other CALM modules. a network. Type VI systems offer powerful optimisation abilities which may compete with other forms of algorithm and implementations. Generalisation Does a system provide solutions which are appropriate only to inputs already known, or can it associate good answers to previously tmencountered situations that are somewhat similar to encountered ones? If the full range of possible input situations is known, generalisation may not be an issue. Generalisation is an ability associated with neural networks which many consider important, but may lead to confusion. One should distinguish between expecting existing answers to be attached to new input data (categorisation) and the alternative, generating new answers (extrapolation and interpolation). It should be noted that interpolation will probably be effective but that neural networks provide no guarantee when it comes to extrapolation. Input type Inputs can be either boolean (yes/no or present/absent) or continuous (numerical). A pattern can be presented as a number of simultaneous inputs of either sort. For instance a temporal pattern may be represented as a vector of values (generally continuous but not necessarily) with sequential elements in the original pattern mapping quite directly onto the network input vector. A picture becomes a 2-dimensional vector of the same sort, but now the vector may be made up of first the sequence of the top row, followed by subsequent rows in order. Another example is encoding letters. There are two obvious possibilities, 8 binary nodes representing the ascii codes, or 26 binary nodes, one node for
282 each letter. The former is highly encoded, the latter sparsely, both are vector representations. Sparse representations make the interpretation of the network's behaviour much simpler, but at the cost of more nodes. The trick is understanding how best to convert one's input into a vector format. The only advice we can give here is that this translation to a vector is crucial because finding a good representation is at least half of the solution. Output Type A network can provide a single output, which can serve as a classification judgement, when boolean, or a real, continuous output. The alternative is a number of outputs firing simultaneously, which represents a pattern as output in the same way as a pattern can be presented as input. Basically the vector rules for the input apply to the output as well. A picture can be output as easily as it can be input. Stability-Novelty One crucial issue, often overlooked by protagonists, is what happens to learning when novel stimuli are presented. Some systems can suffer seriously from disruption of the original learning set (e.g., Back-Propagation networks). This need not be a problem when the totality of the learning set is known in advance and a reasonable sample is selected. It is important to see stability in the context of generalisation; highly stable systems are unlikely to generalise well and vice- versa. This is because they preserve the original memory of the learning set and have a considerable degree of inertia (e.g., ART networks). Scalability The issue of scalability is of whether the system suffers a penalty when it grows. Many systems operate effectively in small-scale or demonstration applications, but suffer badly when scaled up. This may be due to the number of connections or their required precision. Penalties are incurred in serial implementations when the growing number of connections slows down solution time considerably. They may also, and probably more critically, be induced by the size of the problem faced; relaxation approaches in which n potential solutions compete will generally show exponential time complexity behaviour with increasing n (Minsky and Papert, 1969; 1988). Execution Speed It is necessary to consider whether the execution speed is determined by running the implementation on serial hardware (which is usually the case) or whether there are inherent limitations because of the nature of the algorithm. The latter problems may be found when there are, for instance, a large number of iterations required to arrive at a solution (as in the Boltzmann machine). In general globally determined approaches to problems, where everything can influence everything else, will be inherently slow as everything has to wait for everything else. The more of everything there is, the worse it gets.
283 5 Considerations for Using a Network The considerations above were concerned mainly with the choice of a network architecture although even there useability issues arise. The features discussed here are more, but again not necessarily exclusively, related to issues of use rather than initial choice. Learning Speed If a system is to learn, it is important to know how rapidly the learning takes place. Questions which may be asked refer to such issues as the effect of the size of the learning set, the nature of examples which are to be presented, and the ease of discrimination between examples. Rapid learning may result in lowered discrimination in the final system. What is important to consider is whether all the learning can be performed in advance or whether the ability to learn should remain. Back propagation, for instance, is a slow learner but allows very rapid performance once learning has been completed and turned off. Learning Algorithm The exact nature of the learning algorithm can be very important. Some systems, such as back-propagation, take a large number of trials to learn and can suffer catastrophic failure when exposed to new information. Back propagation actually learns a whole learning set, so adding new information essentially involves providing a whole new set and destroying the old. For back-propagation this implies that the only way to have incremental learning is to present the union of the two sets, old and new, for learning and start over again. Other algorithms suffer less from this but may not carry the guarantee that an exhaustive algorithm like back-propagation possesses (Hornik, Stinchcombe and White, 1989). Learning Parameters Some systems have more than one parameter which can be varied to affect either the speed or the accuracy of learning. Using a system with several learning parameters may require the user to understand sufficient about those parameters to be able to manipulate them with a view to maximising return on the learning set. For instance, Back Propagation is one of a number of techniques with a learning-speed parameter (learning rate) which can significantly affect the quality of what is learnt. The Vigilance parameter in ART determines the coarseness of the categories learned while the Temperature in the Boltzmann machine specifically trades off solution speed against quality. Generally the different parameters allow one to trade off solution or learning speed against some other quality, such as discrimination power of the final system. Transfer Function The transfer function determines the relationship between inputs and output. In general the function introduces a degree of non-linearity that functions as a (graded) threshold mechanism. Each form of transfer function effectively forces the neuron to represent a decision (yes/no or 1/0) but may allow, with the ex-
284 ception of the step function, a degree of fudge in the middle of the range. The ramp function is sometimes described as semi-linear. The step function is a limiting case of the sigmoid function (see our contribution on Optimisation Networks). Number of Layers Many systems are layered, which represents an initial structure and have nonlinear behaviour through the use of transfer functions such as the sigmoid. In general there is a clear advantage to using a single hidden layer of non-linear neurons between the input and output layers, but having more than two hidden layers in a system with non-linearity does not increase their computational power (Cybenko, 1988; 1989; Hornik et al., 1989). This benefit accrues to the user who wishes to understand the internals of the system. The disadvantages lie in areas such as speed of learning and execution because there are more nodes and connections to be computed. Connectivity- Is it necessary to have all elements connected to all others? Fully interconnected networks produce considerable computational overheads. Types I, II, III, IV and VI are all either highly or fully interconnected. Type V, in contrast, have high local interconnection (within modules) and full interconnectivity only between separate modules. Reduced interconnectivity may be very advantageous in both learning and execution (Murre, 1992). Fully interconnected systems are, however, more amenable to analytical solutions. What this means is that you may be able to understand, mathematically, why the system is working badly, as opposed to having an effective but analytically intractable system. Distributed and Localised Representations This issue arises when one wishes to interpret what a network is doing. Much of the discussion around this area centres upon theoretical issues. In practice it may be easier to start with localised representation (e.g. 26 nodes for 26 letters) but more complex applications are best served by distributed representation. The modular approach tends to combine the two with localised representation within modules, distributed between. The damage resistance of networks is due, in considerable part, to distribution of information across large portions of networks. Locality of Algorithm Some algorithms require that knowledge about the whole of the network be made available to any one part. Others, in contrast, require input from only a small number of neighbouring elements. These considerations are probably of little importance in small implementations, but become increasingly important when considering hardware-specific implementations with large numbers of processors.
285 6 Public-domain Software The best way to understand neural networks is to experiment with neural- network simulation software. Before considering professional software packages, one may experiment with public-domain software which can be obtained from one of the neural-network ftp sites. Below we list sites containing software for the main neural architectures described in this book. SOM_PAK Kohonen and coworkers have developed a program package for self-organizing feature maps and related neural architectures. The Internet site is cochlea.hut.fi (130.233.168.48) use anonymous as login name programs and documentation are in the directory /pub/som-pak. Stuttgart Neural Network Simulator The Stuttgart Neural Network Simulator (SNNS) from the University of Stuttgart, Germany, offers many types of neural networks for UNIX machines, e.g., back- propagation, ART1, ART2, and ARTMAP. The SNNS is available through anonymous ftp from ftp.informatik.uni-stuttgart.de directory: /pub/SNNS Aspirin/ MIGRAINES Aspirin/MIGRAINES Version 6.0 generates C code for neural networks models specified in a network-description language called Asperin. MIGRAINES is an interface for exporting data from the neural network to visualisation software. Aspirin/MIGRAINES is available in compressed UNIX format from two FTP sites: pt.cs.cmu.edu (128.2.254.155) directory: /afs/cs/project/connect/code and ftp.cognet.ucla.edu (128.97.50.19) directory, /alexis WinNN WinNN is a shareware Neural Networks (NN) package for windows 3.1. WinNN has a very user friendly interface with extensive on line help. WinNN is designed to experiment with the different parameters of backpropagation networks: they can be easily modified while WinNN is training. Available for ftp from winftp.cica.indiana.edu as /pub/pc/win3/programr/winnn093.zip (545 kB). MS-DOS Backpropagation Simulator NNS NNS is a simple Neural Network Simulator designed to help you to set up and
286 train a backpropagation network with one or more hidden layers. Conducting experiments with setting up and training simple Artificial Neural Networks (ANN) will improve your understanding of the basic principles underlying backpropagation networks and more general ANNs. The program is written in ANSI C and can be used on any MSDOS-PC with 256Kb RAM. ftp.cs.rulimburg.nl (137.120.13.8) program and documentation are in the directory: /pub/software/ANN People without ftp-facilities can ask for a copy of the NNS-software by sending a request to Ton Weijters, University of Limburg, Faculty of General Sciences, Department of Computer Science, P.O. Box 616, 6200 MD Maastricht, The Netherlands. 7 Conclusions To recapitulate, there are a great many ways of solving particular problems, a neural network is only one of them. We should stress that the approach is essentially one at the algorithmic level, together with possible implementational consequences in the future. For someone with a specific problem it is first necessary to analyse that problem in terms of the features discussed here, such as whether learning is necessary, generalisation, type of input etc. Such an analysis can provide a pattern against which it should be possible to see if there is a match with one of the architectures in Table 1. If so, that provides a strong suggestion that the matching architecture may prove useful and useable. If no such match exists, it is then possible to ask whether neural networks really are the appropriate tool at all or, alternatively, either to search for the nearest network architecture or for one not mentioned in our scheme. References D, Ackley, G. Hinton, and T, Sejnowski (1985) A learning algorithm for Boltzmann machines, Cognitive Science, 9, 147-169. J.A. Anderson, J.W, Silverstein, S.A. Ritz, and R.S. Jones (1977) Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413-451. A. Barto, R. Sutton and C. Anderson (1983) Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834-846. G.A. Carpenter and S. Grossberg (1987a) A massively parallel architecture for a self- organizing neural pattern recognition machine. Computer vision, graphics, and image processing, Vol. 37, 54-115. G.A. Carpenter and S. Grossberg (1987b) ART 2: self-organization of stable category recognition codes for analog input patterns. Applied Optics, Vol. 26, 4919-4930. G.A. Carpenter, S. Grossberg and J.H. Reynolds (1991) ARTMAP: Supervised realtime learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, Vol. 4, 565-588.
287 G. Cybenko (1988) Continuous valued neural networks with two hidden layers are sufficient. Technical Report, Department of Computer Science, Tufts University, Medford, MA. G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, Vol. 2, 303-314. J.J. Hopfield (1982) Neural networks and physical systems with emergent collective computational properties. Proceedings of the National Academy of Sciences U.S.A., 79, 2554-2558. J.J. Hopfield and D.W. Tank (1985) "Neural" computation of decisions in optimization problems. Biological Cybernetics, 52, 141-152. K. Hornik, M. Stinchcombe and H. White (1989) Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359-366. W. James (1890) Principles of Psychology, New York: Holt. T. Kohonen (1984) Self-organization and associative memory, Berlin: Springer-Verlag. B. Kosko (1988) Bidirectional associative memories. IEEE Transactions on Systems, Man, and Cybernetics, SMC-18, 42-60. Marr (1982) Vision. Freeman, San Francisco. J.M.J. Murre (1992) Categorization and Learning in Modular Neural Networks. Hemel Hempstead: Harvester Wheatsheaf J.M.J. Murre, R.H. Phaf and G. Wolters (1992) CALM: Categorizing and Learning Module. Neural Networks, 5, 55-82. C. Peterson (1990) Parallel distributed approaches to combinatorial optimization: Benchmark studies on Traveling Salesman Problem. Neural Computation 2, 261- 269. C. Peterson and B. Soderberg (1989) A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems 1, 3-22. F. Rosenblatt (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 56, 386-408. D.E. Rumelhart, G.E. Hinton and R.J. Williams (1986) Learning representations by back-propagating errors. Nature, 323, 533-536. P.K. Simpson (1990) Artificial Neural Systems: Foundations, Paradigms, Applications, and Implementations. Pergamon Press, New York, NY. H. Szu (1986) Fast simulated annealing. In J. Denker (Ed.), AIP Conference Proceedings 151: Neural Networks for Computing, 420-425. New York: American Institute of Physics. Y. Takefuji and H. Szu (1989) Design of parallel distributed Cauchy Machines. In: Proceedings of the International Joint Conference on Neural Networks, San Diego, CA. Vol. I, 529-532. B. Widrow (1962) Generalization and information storage in networks of adaline "neurons". In M. Yovits, G. Jacoby, and G. Goldstein (Eds.), Self-Organizing Systems 1962, 435-461. Washington: Spartan Books.
Supporting General Literature H. Adeli and S.-L. Hung (1995) Machine Learning: Neural Networks, Genetic Algorithms and Fuzzy Systems. Wiley, New York. S.-I. Amari (1977) A neural theory of association and concept formation. Biological Cybernetics 26, 175-185. S.-I. Amari (1983) Field Theory of Self-Organizing Neural Networks, IEEE Trans. Syst. Man Cybern. SMC-13, 741. D.J. Amit (1989) Modeling Brain Function. Cambridge University Press, Cambridge, MA. D.Z. Anderson (ed.) (1988) Neural Information Processing Systems - Natural and Synthetic. American Institute of Physics, New York, NY. J.A. Anderson, A. Pellionisz and E. Rosenfeld (eds.) (1990) Neurocomputing 2, Directions for Research. MIT Press, Cambridge, MA. J.A. Anderson and E. Rosenfeld (eds.) (1988) Neurocomputing, Foundations of Research. MIT Press, Cambridge, MA. A. Barr, and E.A. Feigenbaum {1981) The Handbook of Artificial Intelligence. Kauf- mann, Los Altos, CA. A.G. Barto, R.S. Sutton and C.W. Anderson (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics 13, 834-846. H.D. Block (1962) The perceptron: a model for brain functioning. I. Review of Modern Physics 34, 123-135. A. Blum (1992) Neural Networks in C+ + : an Object-Oriented Framework for Building Connectionist Systems. Wiley, New York. E.R. Caianiello (1961) Outline of Theory of Thought Processes and Thinking Machines, J. Theoret. Biol. 2, 204. G.A. Carpenter (1989) Neural network models for pattern recognition and associative memory. Neural Networks 2, 243-257. G.A. Carpenter and S. Grossberg (1987a) A massively parallel architecture for a self- organizing neural pattern recognition machine. Computer vision, graphics, and image processing 37, 54-115. G.A. Carpenter and S. Grossberg (1987b) ART 2: self-organization of stable category recognition codes for analog input patterns. Applied Optics 26, 4919-4930. G.A. Carpenter , and S. Grossberg (1987) Neural Dynamics of Category Learning and Recognition: Structural Invariants, Reinforcement, and Evoked Potentials. In Pattern Recognition and Concepts in Animals, people and Machines. M.L. Commons, S.M. Kosslyn and R.J. Herrnstein, Eds. Erlbaum, Hillsdale, NJ. G.A. Carpenter, and S. Grossberg (1988) The ART of adaptive pattern recognition by a self-organizing neural network, IEEE Computer. Special issue on Artificial Neural Systems 21, 77-88. G.A. Carpenter and S. Grossberg (1990) ART 3: Hierarchical search using chemical transmitters in self- organizing pattern recognition architectures. Neural Networks 3, 129-152. M. Caudill and C. Butler (1990) Naturally Intelligent Systems. MIT Press, Cambridge, MA. J. Denker (ed.) (1986) Neural Networks for Computing. AIP Conference proceedings 151. American Institute of Physics, New York, NY. R.M. Durbin and D. Willshaw (1987) An analogue approach to the travelling salesman problem using an elastic net approach. Nature 326, 689-691.
290 R. Eckmiller and C. von der Malsburg (eds.) (1988) Neural Computers. NATO ASI Series, Series F Computers and Systems Sciences, Vol. 41. Springer-Verlag, Berlin. G.M. Edelman (1989) Neural Darwinism. The Theory of Neuronal Group Selection. Oxford University Press, Oxford. J.A. Feldman and D.H. Ballard (1982) Connectionist models and their properties. Cognitive Science 6, 205-254. W.J. Freeman (1991) The Physiology of Perceptron, Scientific American V264, nr. 2, 34-41. K. Fukushima (1975) Cognitron: A self-organizing multilayered neural network. Biological Cybernetics 20, 121-136. K. Fukushima (1988) A neural network for visual pattern recognition. Computer 21, 65-74. D. Gabor (1969) Associative Holographic Memories, IBM J. Res. Dev. 13, 156. R.P. Gorman and T.J. Sejnowski (1988) Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75-89. S. Grossberg (Ed.) (1982) Studies of Mind and Brain: Neural Principles of Learning, Perception, Development, Cognition, and Motor Control. Reidel Press, Boston. S. Grossberg (1987) Competitive learning: From interactive activation to adaptive resonance. Cognitive Science 11, 23-63. S. Grossberg (1986) The Adaptive Brain I: Cognition, Learning, Reinforcement, and Rhythm. Elsevier/North-Holland, Amsterdam. S. Grossberg (1987) The Adaptive Brain II: Vision, Speech, Language, and Motor Control. Elsevier/North-Holland, Amsterdam. S. Grossberg (1987) Nonlinear neural networks: principles, mechanisms, and architectures. Neural Networks 1, 17-61. S. Grossberg (Ed.) (1988) Neural Networks and Natural Intelligence. Bradford Books, Cambridge, MA. D.O. Hebb (1949) The Organization of Behavior. Wiley, New York, NY. R. Hecht-Nielsen (1990) Neurocomputing. Addison-Wesley, Reading, MA. P.J. van Heerden (1963) Theory of Optical Information Storage in Solids, Appl. Opt. 2, 393. G.E. Hinton and J.A. Anderson (eds.) (1981) Parallel models of associative memory. Erlbaum, Hillsdale, NJ. J.J. Hopfield (1982) Neural networks and physical systems with emergent collective computational properties. Proceedings of the National Academy of Sciences U.S.A., Vol. 79, 2554-2558. J.J. Hopfield and D.W. Tank (1985) Neural computation of decisions in optimization problems. Biological Cybernetics 52, 141-152. J.J. Hopfield and D.W. Tank (1986) Computing with neural circuits: A model. Science 233, 625-633. E.R. Kandel and J.H. Schwartz (1989) Principles of neural science, Second Edition. New York, Springer-Verlag. T. Khanna (1990) Foundations of Neural Networks. Addison-Wesley, Reading, MA. T. Kohonen (1977) Associative Memory: A System Theoretical Approach. Springer- Verlag, Berlin. T. Kohonen (1982) Self-organized Formation of Topologically Correct Feature Maps, Biol. Cybern, 43, 59. T. Kohonen (1987) Adaptive, associative, and self-organizing functions in neural computing. Applied Optics 26, 4910-4918. T. Kohonen (1988) An introduction to neural computing. Neural Networks 1, 3-16.
291 T. Kohonen (1989) Self-organization and associative memory, Third Edition, New York, Springer-Verlag. B. Kosko (1987) Adaptive bidirectional associative memories. Applied Optics 26, 4947- 4960. A. Lapedes and R. Farber (1988) How neural nets work. In: Neural Information Processing Systems, (ed. D.Z. Anderson) American Institute of Physics, New York, NY. R.P. Lippman (1987) An introduction to computing with neural nets. IEEE ASSP Magazine 3 (4), 4-22. J.L. McClelland and D.E. Rumelhart (1988) Explorations in Parallel Distributed Processing, a Handbook of Models, Programs, and Exercises. MIT Press, Cambridge, MA. J.L. McClelland, D.E. Rumelhart and the PDP Research Group (eds.) (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 2. Psychological and Biological Models. MIT Press, Cambridge, MA. W.S. McCulloch (1988) Embodiments of Mind, (new edition). MIT Press, Cambridge, MA. W.S. McCulloch and W.A. Pitts (1943) A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematics and Biophysics 5, 115-133. C. Mead (1989) Analog VLSI and Neural Systems. Addison-Wesley, Reading, MA. T.M. Miller, R.S. Sutton and P.J, Werbos (1990) Neural Networks for Control. MIT Press, Cambridge, MA. M.L. Minsky (1985) The Society of Mind. Simon and Schuster, New York, NY. M.L. Minsky and S.A. Papert (1988) Perceptrons: An Introduction to Computational Geometry, expanded edition. MIT Press, Cambridge, MA. L. Nadel, L.A. Cooper, P. Culicover and R.M. Harnish (eds.) (1989) Neural Connections, Mental Computation. MIT Press, Cambridge, MA. K. Nakano (1972) Associatron- a model of associative memory, IEEE Trans. Syst. Man. Cybern.SMC-2, 380. N.J.Nilson (1990) The Mathematical Foundations of Learning Machines, (new edition). Morgan Kaufmann Publishers, San Mateo, CA. Y.-H. Pao (1989) Adaptive Pattern Recognition and Neural Networks. Addison-Wesley, Reading, MA. C. Parten, C. Harston, A. Maren and R. Pap (1990) Handbook of Neural Computing Applications. Academic Press, San Diego, CA. R. Pfeifer, Z. Schreter, F. Fogelman-Soulie and L. Steels (eds) (1989) Connectionism in perspective, North Holland. K.H. Pribram (ed) (1991) Brain and Perception: Holonomy and Structure in Figural Processing. Lawrence Erlbaum Associates, Hillsdale, NJ. F. Rosenblatt (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 56, 386-408. F. Rosenblatt (1962) Principles of Neurodynamics. Spartan, New York, NY. D.E. Rumelhart, J.L. McClellandand the PDP Research Group (eds.) (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition Vol. 1. Foundations. MIT Press, Cambridge, MA. D.E. Rumelhart, G.E. Hinton and R.J. Williams (1986) Learning representations by back-propagating errors. Nature 323, 533-536. D.E. Rumelhart and D. Zipser (1986) Feature discovery by competitive learning. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition,
292 Vol. 1. Foundations, (eds. J.L. McClelland, D.E. Rumelhart, and the PDP research group), MIT Press, Cambridge, MA. T.J. Sejnowski and C.R. Rosenberg (1988) NETtalk: a parallel network that learns to read aloud. In: Neurocomputing, Foundations of Research, (eds. J. A. Anderson and E. Rosenfeld) MIT Press, Cambridge, MA. P.K. Simpson (1990) Artificial Neural Systems. Foundations, Paradigms, Applications, and Implementations. Pergamon Press, New York, NY D.W. Tank and J.J. Hopfield (1987) Collective computation in neuronlike circuits. Scientific American 257 (6), 62-70. D.S. Touretzky (ed.) (1989) Advances in Neural Information Processing Systems, 1. Morgan Kaufman Publishers, San Mateo, CA. D.S. Touretzky (ed.) (1990) Advances in Neural Information Processing Systems, 2. Morgan Kaufman Publishers, San Mateo, CA. D.S. Touretzky, G. Hinton and T. Sejnowski (ed.) (1989) Proceedings of the Connec- tionist Models Summer School. Morgan Kaufmanii Publishers, San Mateo, CA. R.R. Trippi and E. Turban (eds.) (1993) Neural Networks in Finance and Investment: Using Artificial Intelligence to Improve Real World Performance. Probus, Chicago, IL. C. von der Malsburg (1973) Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. C. von der Malsburg and E. Bienenstock (1986) Statistical coding and short-term synaptic plasticity: A scheme for knowledge representation in the brain. In: Disordered systems and biological organization, NATO ASI Series, Vol. F20. (eds. E. Bienenstock, F. Fogelman-Soulie, and G. Weisbuch), Springer-Verlag, Berlin. B. Widrow and M.E. Hoff (1960) Adaptive switching circuits. 1960 IRE WESCON Convention Record, Part 4, 96-104. B. Widrow and M.E. Hoff (1985) Adaptative Switching Circuits. In 1960 WESCON Convention, Record Part 4, 96-104; Human NeurobioL, 4, 229. D.J. Willshaw and H.C. Longuet-Higgens (1970) Associative Memory Models, Machine Intelligence, B. Meltzer and O. Michie, Eds. Edinburgh U.P. G.V. Wilson and G.S. Pawley (1988) On the stability of the traveling salesman problem algorithm of Hopfield and Tank. Biological Cybernetics 58, 63-70. M. Zeidenberg (1990) Neural Networks in Artificial Intelligence. Simon and Schuster, New York, NY. S.F. Zornetzer, J.L. Davis and C. Lau (eds.) (1990) An Introduction to Neural and Electronic Networks. Academic Press, San Diego, CA. J.M. Zurada (1992) Introduction to Artificial Neural Systems. West Publishing Company, St. Paul, USA. Journals: Applied intelligence : the international journal of artificial intelligence, neural networks, and complex problem-solving technologies. Kluwer Academic Publishers, Dordrecht. First issue: July 1991. Connection Science. Journal of Neural Computing, Artificial Intelligence and Cognitive Research. Carfax Publishing Company, Abingdon, United Kingdom. First issue: January 1989. IEEE Transactions on Neural Networks. IEEE, New York, NY. First issue: March 1990.
293 The International Journal of Neural Networks. Research and Applications. Learned Information (Europe) Ltd., Oxford United Kingdom. First issue: January 1989. International Journal of Neural Systems. World Scientific Publishing Co. Pte. Ltd., London, United Kingdom. First issue: 1989. Journal of Neural Network Computing. Auerbach Publishers, Warren Gorham and Lamont Co., Boston, MA. First issue: 1989. Machine learning : neural networks, genetic algorithms, and fuzzy systems Wiley, New York. First issue: 1995. Neural networks in finance and investment: using artificial intelligence to improve real world performance Probus, Chicago, IL. First issue: 1993. Network. Computation in Neural Systems. IOP Publishing Ltd., Bristol, UK. First issue: January 1990. Neural Computation. MIT Press, Cambridge, MA. First issue: Spring 1989. Neural Network News. AlWeek Inc., Atlanta, CA. First issue: January 1989. Neural Network Review. The critical review journal for the neural network community. Lawrence Erlbaum Associates Inc., Hillsdale, NJ. First issue: 1987. Neural Networks. The official journal of the International Neural Network Society (INNS). Pergamon Journals Ltd., Oxford, United Kingdom. First issue: January 1988. Neurocomputing. North-Holland/Elsevier Science Publishers, Amsterdam, The Netherlands. First issue: January 1989. Neural Processing Letters. D Facto Publications, Brussel. First issue: September 1994. Special Issues: Applied Optics (1986) Vol. 25, No. 18. Special issue on Neural Computation. Applied Optics (1987) Vol. 26. No. 10. Special issue on Neural Computation. Byte (1987) Vol. 12, No. 11. Special issue on heuristic algorithms. Byte (1989) Vol. 14, No. 8. Special issues on neural networks. Dr. Dobb's Journal (1990) April. Special issue on neural networks. IEEE Computer (1988) Vol. 21, No. 3. Special issue on Artificial Neural Systems. IEEE Transactions on Systems, Man, and Cybernetics (1983) Vol. 13, No. 5. Special issue on Neural and Sensory Information Processing.
Addresses of the Authors P. Boekhoudt, Department of Mathematics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. P.J. Braspenning, Department of Computer Science, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. H.R.A. Cardon, Shell Internationale Petroleum Mij. B.B., EPD/22, P.O. Box 162, 2501 AN The Hague, The Netherlands. Y. Crama, Department of Economics and Business Administration, University of Liege, Boulevard du Rectorat 7 (B31), 4000 Liege, Belgium. J. Henseler, Section of Forensic Computerscience, Forensic Science Laboratory of the Ministry of Justice, Volmerlaan 17, 2288 GD Rijswijk, The Netherlands. R.J.W. van Hoogstraten, Shell Internationale Petroleum Mij. B.V., EPD/22, P.O. Box 162, 2501 AN The Hague, The Netherlands. G.A.J. Hoppenbrouwers, Dutch State School of Interpreting, P.O. Box 964, 6200 AZ Maastricht, The Netherlands. P.T.W. Hudson, Department of Computer Science, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. A,W.J. Kolen, Department of Quantitative Economics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. J.H.J. Lenting, Department of Computer Science, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. W.T.C. van Luenen, Unilever Research Laboratorium Vlaardingen, P.O. Box 114, 3130 AC Vlaardingen, The Netherlands. E.J. Pesch, Department of Economics and Business Administration, University of Bonn, Adenauerallee 24-42, D-53113 Bonn, Germany. H.J.M. Peters, Department of Quantitative Economics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. E.O. Postma, Department of Computer Science, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. F.C.R. Spieksma, Department of Mathematics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. F. Thuijsman, Department of Mathematics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. O.J. Vrieze, Department of Mathematics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands. A.J.M.M. Weijters, Department of Computer Science, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands.
Lecture Notes in Computer Science For information about Vols. 1-857 please contact your bookseller or Springer-Verlag Vol. 858: E. Bertino, S. Urban (Eds.), Object-Oriented Methodologies and Systems. Proceedings, 1994. X, 386 pages. 1994. Vol. 859: T. F. Melham, J. Camilleri (Eds.), Higher Order Logic Theorem Proving and Its Applications. Proceedings, 1994. IX, 470 pages. 1994. Vol. 860: W. L. Zagler, G. Busby, R. R. Wagner (Eds.), Computers for Handicapped Persons. Proceedings, 1994. XX, 625 pages. 1994. Vol: 861: B. Nebel, L. Dreschler-Fischer (Eds.), KI-94: Advances in Artificial Intelligence. Proceedings, 1994. IX, 401 pages. 1994. (Subseries LNAI). Vol. 862: R. C. Carrasco, J. Oncina (Eds.), Grammatical Inference and Applications. Proceedings, 1994. VIII, 290 pages. 1994. (Subseries LNAI). Vol. 863: H. Langmaack, W.-P. de Roever, J. Vytopil (Eds.), Formal Techniques in Real-Time and Fault-Tolerant Systems. Proceedings, 1994. XIV, 787 pages. 1994. Vol. 864: B. Le Charlier (Ed.), Static Analysis. Proceedings, 1994. XII, 465 pages. 1994. Vol. 865: T. C. Fogarty (Ed.), Evolutionary Computing. Proceedings, 1994. XII, 332 pages. 1994. Vol. 866: Y. Davidor, H.-P. Schwefel, R. Manner (Eds.), Parallel Problem Solving from Nature - PPSN III. Proceedings, 1994. XV, 642 pages. 1994. Vol 867: L. Steels, G. Schreiber, W. Van de Velde (Eds.), A Future for Knowledge Acquisition. Proceedings, 1994. XII, 414 pages. 1994. (Subseries LNAI). Vol. 868: R. Steinmetz (Ed.), Multimedia: Advanced Teleservices and High-Speed Communication Architectures. Proceedings, 1994. IX, 451 pages. 1994. Vol. 869: Z. W. Ras', Zemankova (Eds.), Methodologies for Intelligent Systems. Proceedings, 1994. X, 613 pages. 1994. (Subseries LNAI). Vol. 870: J. S. Greenfield, Distributed Programming Paradigms with Cryptography Applications. XI, 182 pages. 1994. Vol. 871: J. P. Lee, G. G. Grinstein (Eds.), Database Issues for Data Visualization. Proceedings, 1993. XIV, 229 pages. 1994. Vol. 872: S Arikawa, K. P. Jantke (Eds.), Algorithmic Learning Theory. Proceedings, 1994. XIV, 575 pages. 1994. Vol. 873: M. Naftalin, T. Denvir, M. Bertran (Eds.), FME '94: Industrial Benefit of Formal Methods. Proceedings, 1994. XI, 723 pages. 1994. Vol. 874: A. Borning (Ed.), Principles and Practice of Constraint Programming. Proceedings, 1994. IX, 361 pages. 1994. Vol. 875: D. Gollmann (Ed.), Computer Security - ESORICS 94. Proceedings, 1994. XI, 469 pages. 1994. Vol. 876: B. Blumenthal, J. Gornostaev, C. Unger (Eds.), Human-Computer Interaction. Proceedings, 1994. IX, 239 pages. 1994. Vol. 877: L. M. Adleman, M.-D. Huang (Eds.), Algorithmic Number Theory. Proceedings, 1994. IX, 323 pages. 1994. Vol. 878: T. Ishida; Parallel, Distributed and Multiagent Production Systems. XVII, 166 pages. 1994. (Subseries LNAI). Vol. 879: J. Dongarra, J. Wagniewski (Eds.), Parallel Scientific Computing. Proceedings, 1994. XI, 566 pages. 1994. Vol. 880: P. S. Thiagarajan (Ed.), Foundations of Software Technology and Theoretical Computer Science. Proceedings, 1994. XI, 451 pages. 1994. Vol. 881: P. Loucopoulos (Ed.), Entity-Relationship Ap- proach-ER'94. Proceedings, 1994. XIII, 579 pages. 1994. Vol. 882: D. Hutchison, A. Danthine, H. Leopold, G. Coulson (Eds.), Multimedia Transport and Teleservices. Proceedings, 1994. XI, 380 pages. 1994. Vol. 883: L. Fribourg, F. Turini (Eds.), Logic Program Synthesis and Transformation - Meta-Programming in Logic. Proceedings, 1994. IX, 451 pages. 1994. Vol. 884: J. Nievergelt, T. Roos, H.-J. Schek, P. Widmayer (Eds.), IGIS '94: Geographic Information Systems. Proceedings, 1994. VIII, 292 pages. 19944. Vol. 885: R. C. Veltkamp, Closed Objects Boundaries from Scattered Points. VIII, 144 pages. 1994. Vol. 886: M. M. Veloso, Planning and Learning by Analogical Reasoning. XIII, 181 pages. 1994. (Subseries LNAI). Vol. 887: M. Toussaint (Ed.), Ada in Europe. Proceedings, 1994. XII, 521 pages. 1994. Vol. 888: S. A. Andersson (Ed.), Analysis of Dynamical and Cognitive Systems. Proceedings, 1993. VII, 260 pages. 1995. Vol. 889: H. P. Lubich, Towards a CSCW Framework for Scientific Cooperation in Europe. X, 268 pages. 1995. Vol. 890: M. J. Wooldridge, N. R. Jennings (Eds.), Intelligent Agents. Proceedings, 1994. VIII, 407 pages. 1995. (Subseries LNAI). Vol. 891: C. Lewerentz, T. Lindner (Eds.), Formal Development of Reactive Systems. XI, 394 pages. 1995. Vol. 892: K. Pingali, U. Banerjee, D. Gelernter, A. Nicolau, D. Padua (Eds.), Languages and Compilers for Parallel Computing. Proceedings, 1994. XI, 496 pages. 1995.
Vol. 893: G. Gottlob, M. Y. Vardi (Eds.), Database Theory- ICDT '95. Proceedings, 1995. XI, 454 pages. 1995. Vol. 894: R. Tamassia, I. G. Tollis (Eds.), Graph Drawing, Proceedings, 1994. X, 471 pages. 1995. Vol. 895: R. L. Ibrahim (Ed.), Software Engineering Education. Proceedings, 1995. XII, 449 pages. 1995. Vol. 896: R. N. Taylor, J. Coutaz (Eds.), Software Engineering and Human-Computer Interaction. Proceedings, 1994. X, 281 pages. 1995. Vol. 897: M. Fisher, R. Owens (Eds.), Executable Modal and Temporal Logics. Proceedings, 1993. VII, 180 pages. 1995. (Subseries LNAI). Vol. 898: P. Steffens (Ed.), Machine Translation and the Lexicon. Proceedings, 1993. X, 251 pages. 1995. (Subseries LNAI). Vol. 899: W. Banzhaf, F. H. Eeckman (Eds.), Evolution and Biocomputation. VII, 277 pages. 1995. Vol. 900: E. W. Mayr, C. Puech (Eds.), STACS 95. Proceedings, 1995. XIII, 654 pages. 1995. Vol, 901: R. Kumar, T. Kropf (Eds.), Theorem Provers in Circuit Design. Proceedings, 1994. VIII, 303 pages. 1995. Vol. 902: M. Dezani-Ciancaglini, G. Plotkin (Eds.), Typed Lambda Calculi and Applications. Proceedings, 1995. VIII, 443 pages. 1995. Vol. 903: E. W. Mayr, G. Schmidt, G. Tinhofer (Eds.), Graph-Theoretic Concepts in Computer Science. Proceedings, 1994. IX, 414 pages. 1995. Vol. 904: P. Vitanyi (Ed.), Computational Learning Theory. EuroCOLT'95. Proceedings, 1995. XVII, 415 pages. 1995. (Subseries LNAI). Vol. 905: N. Ayache (Ed.), Computer Vision, Virtual Reality and Robotics in Medicine. Proceedings, 1995. XIV, 567 pages. 1995. Vol. 906: E. Astesiano, G. Reggio, A. Tarlecki (Eds.), Recent Trends in Data Type Specification. Proceedings, 1995. VIII, 523 pages. 1995. Vol. 907: T. Ito, A. Yonezawa (Eds.), Theory and Practice of Parallel Programming. Proceedings, 1995. VIII, 485 pages. 1995. Vol. 908: J. R. Rao Extensions of the UNITY Methodology: Compositionality, Fairness and Probability in Parallelism. XI, 178 pages. 1995. Vol. 909: H. Comon, J.-P. Jouannaud (Eds.), Term Rewriting. Proceedings, 1993. VIII, 221 pages. 1995. Vol. 910: A. Podelski (Ed.), Constraint Programming: Basics and Trends. Proceedings, 1995. XI, 315 pages. 1995. Vol. 911: R. Baeza-Yates, E. Goles, P. V. Poblete (Eds.), LATIN '95: Theoretical Informatics. Proceedings, 1995. IX, 525 pages. 1995. Vol. 912: N. Lavrac, S. Wrobel (Eds.), Machine Learning: ECML - 95. Proceedings, 1995. XI, 370 pages. 1995. (Subseries LNAI). Vol. 913: W. Schafer (Ed.), Software Process Technology. Proceedings, 1995. IX, 261 pages. 1995. Vol. 914:1. Hsiang (Ed.), Rewriting Techniques and Applications. Proceedings, 1995. XII, 473 pages. 1995. Vol. 915: P. D. Mosses, M. Nielsen, M. I. Schwartzbach (Eds.), TAPSOFT '95: Theory and Practice of Software Development. Proceedings, 1995. XV, 810 pages. 1995. Vol. 916: N. R. Adam, B. K. Bhargava, Y. Yesha (Eds.), Digital Libraries. Proceedings, 1994. XIII, 321 pages. 1995. Vol. 917: J. Pieprzyk, R. Safavi-Naini (Eds.), Advances in Cryptology - ASIACRYPT '94. Proceedings, 1994. XII, 431 pages. 1995. Vol. 918: P. Baumgartner, R. Hahnle, J. Posegga (Eds.), Theorem Proving with Analytic Tableaux and Related Methods. Proceedings, 1995. X, 352 pages. 1995. (Subseries LNAI). Vol. 919: B. Hertzberger, G. Serazzi (Eds.), High-Per- formance Computing and Networking. Proceedings, 1995. XXIV, 957 pages. 1995. Vol. 920: E. Balas, J. Clausen (Eds.), Integer Programming and Combinatorial Optimization. Proceedings, 1995. IX, 436 pages. 1995. Vol. 921: L. C. Guillou, J.-J. Quisquater (Eds.), Advances in Cryptology - EUROCRYPT '95. Proceedings, 1995. XIV, 417 pages. 1995. Vol. 923: M. Meyer (Ed.), Constraint Processing. IV, 289 pages. 1995. Vol. 924: P. Ciancarini, O. Nierstrasz, A. Yonezawa (Eds.), Object-Based Models and Languages for Concurrent Systems. Proceedings, 1994. VII, 193 pages. 1995. Vol. 925: J. Jeuring, E. Meijer (Eds.), Advanced Functional Programming. Proceedings, 1995. VII, 331 pages. 1995. Vol. 926: P. Nesi (Ed.), Objective Software Quality. Proceedings, 1995. VIII, 249 pages. 1995. Vol. 927: J. Dix, L. Moniz Pereira, T. C. Przymusinski (Eds.), Non-Monotonic Extensions of Logic Programming. Proceedings, 1994. IX, 229 pages. 1995. (Subseries LNAI). Vol. 928: V.W. Marek, A. Nerode, M. Truszczynski (Eds.), Logic Programming and Nonmonotonic Reasoning. Proceedings, 1995. VIII, 417 pages. 1995. (Subseries LNAI). Vol. 929: F. Moran, A. Moreno, J.J. Merelo, P. Chacon (Eds.), Advances in Artificial Life. Proceedings, 1995. XIII, 960 pages. 1995 (Subseries LNAI). Vol. 930: J. Mira, F. Sandoval (Eds.), From Natural to Artificial Neural Computation. Proceedings, 1995. XVIII, 1150 pages. 1995. Vol. 931: P.J. Braspenning, F. Thuijsman, A.J.M.M. Weijters (Eds.), Artificial Neural Networks. IX, 295 pages. 1995. Vol. 932: J. Iivari, K. Lyytinen, M. Rossi (Eds.), Advanced Information Systems Engineering. Proceedings, 1995. XI, 388 pages. 1995. Vol. 933: L. Pacholski, J. Tiuryn (Eds.), Computer Science Logic. Proceedings, 1994. IX, 543 pages. 1995. Vol. 934: P. Barahona, M. Stefanelli, J. Wyatt (Eds.), Artificial Intelligence in Medicine. Proceedings, 1995. XI, 449 pages. 1995. (Subseries LNAI). Vol. 935: G. De Michelis, M. Diaz (Eds.), Application and Theory of Petri Nets 1995. Proceedings, 1995. VIII, 511 pages. 1995.